The Needle in the Haystack Competition
Problem Statement: Consider the problem of classifying the
general domain to which a given document belongs. Suppose, for example, that a web user wanted to look at (gasp!)
all documents related to astronomy on the web.
What textual clues might reliably identify an html or pdf or postscript
document as belonging to the domain of astronomy? What evidence might determine that the document really belonged
to some other domain?
The Needle in the Haystack competition is
not intended to definitively answer the above questions so much as to involve
the members of the class in related issues in a rapid fashion. Competitors may implement their choice of
"quick and dirty" methods that they believe may be competitive against
more disciplined approaches in a short time frame. The competition takes two forms, with half the class competing in
each form:
The positive question: Given a set of 2000 documents, all taken
from the world wide web recently, and all definitely belonging to the domains
of biology, chemistry, geology, mathematics, and computer science (and possibly
astronomy and physics, if the instructor has the opportunity), find the small
subset belonging to one targeted domain, consisting of from 10 to 100 files
from that one domain.
The negative question:
Given a set of 2000 documents, most of
which are taken from one domain in the list above, correctly identify the 10 to
100 documents which do not belong to that domain. These documents will, however, belong to one
of the other domains listed above.
The systems will be given scores based on
how many of the desired files are found ("recall") and how many of
the files proposed by the system were actually right
("precision").
The domains are deliberately chosen to
have overlapping vocabulary and concepts.
Your system will have training data
available for all of the domains, including at least a million words of text
per domain, the same text annotated with part of speech, a set of vocabulary
relevant to each domain, and a list of compound words relevant to each domain.
Requirements
Each student is responsible for the
following, on behalf of the class:
·
Collect at least a million words in your
own domain for use as training data.
Keep in mind that what you want from your selection of training data is
a sampling of the most useful phrases that identify the domain.
·
In the case of pdf and postscript files,
translate the documents to text. Run a
part of speech tagger on the text versions.
Put the text and the tagged text files into the class directory.
·
Collect a good list of domain-specific
vocabulary (from your dictionaries) as well as harvested phrases in your domain
for the class's use. At a minimum you
must use the Justeson and Katz filters (p.
154 of the text), plus "a few" phrasal patterns of your own,
which you have found to be effective in your domain... like the "x of y" pattern
mentioned in class. "A few"
is generally interpreted to mean >= 3.
·
Make all such domain-specific vocabulary
and phrases harvested from your data available to the class. This means that you must generate a list of
the actual phrases, not just their underlying structures, for the collected
text in your domain, and leave that list for the rest of the class. If you think that you would like something
more than the actual phrases, like actual frequencies, relative frequencies,
mutual information values, t test scores, chi- squared scores, .... whatever,
from whatever your classmates leave in their lists, then consider that they
would like the same information from you.
Since this is a graded exercise, it is
important that you know what can be shared among the group and what
cannot. Feel free to share good
sources, good references, outside code (in fact, if you use code taken from the
web, you are expected to let others know about it), any tools for
preprocessing the raw text to prepare for tagging, and hints or actual help
with tagging. In fact, if the class
wants to make a joint project out of tagging all the text, you are welcome to
do so.
Things you are not allowed to share: code written to collect either phrases or
information leading to getting the phrases (unless that code was written by
someone other than yourself, in which case you are required to share - see
above. In cases where you are not sure,
consult with the instructor!). The
underlying concept here is that there is a basic level that everyone shares,
and that you are going to do some kind of "value added" beyond what
everyone has, as your personal effort, on which you will be graded. That includes a reasonable set of terms
(vocabulary plus phrases) for your domain, which you will make available to
everyone else. It also includes
whatever strategies you implement to be successful in the "needle in a
haystack" contest.
After putting the vocabulary and phrases (perhaps
with frequency information) out on the class directory for your domain, you
should turn your attention to the contest.
On the basis of class discussions about breadth of some of the domains,
you will be pleased to learn that the chosen target domain is geology. Although by no means a clearly bounded
discipline with regard to document content, it probably is less voluminous in
coverage and somewhat easier to pin down than the others. Also, as it happens, since there are two
students working in this domain, there will be at least 2 million words of
training text available to the class.
The contest itself will feature either 2000 files of text mostly from
geology, with 10 to 100 of the files actually clearly from one of the other
domains, or it will feature 2000 files from the five domains, of which only 10
to 100 are from geology - depending on which group you are in. So you can fine-tune your personal entry
into the contest for the domain of geology.
However, please do write most of your system so that if you were
required to change to a different domain, you would not have to change much
more than directory paths and filenames.
Because I will do my best to be sure that
you do not train on the test sets, I request that you send me a list of the URLS
that you downloaded from, if at all possible.
If it's already too late for that, give me as much information as you
can about your sources, so that I have a reasonable chance to get the test
files from somewhere else.
Due
dates:
October
18 - all training files (text, tagged text, vocabulary, phrases) in the class
directory. If you're at conference out
of the country on that date, please get as much as possible online before you
leave, but you can finish when you get back. :)
October
18 - your proposed approach must be described in at least general terms to the
instructor. The instructor reserves the
right to suggest or require modifications to your proposed approach. Once that approach is mutually agreed upon,
let the instructor know about your progress at least once per week, preferably
twice a week.
November
1 - Your program/system must be ready to run on the test sets by this
date. Test sets will be made
available. Run the test, report on the
names of the target files identified by your system (or, equivalently, their
position in the list of filenames), inspect the actual files that your system
returns, give your assessment of accuracy, and what problems your system may be
having. This should be contained in a
written report that you will hand in on Friday, November 3, and will also be
the subject of our class discussion on that date.
Between October 18 and November 1, I
expect that one or two class discussions will be devoted to questions and
comments that members of the class will have regarding the tasks they are
attempting. Since no two students will
be using the same approach on the same version of the contest, there is no
reason to be super-secretive about general approaches. In my experience, it has been very helpful
for each student to describe in general (about five minutes) what they are
doing, with the rest of the class contributing advice.
As mentioned in class, if you do a
disciplined job of whatever you agree with the instructor to do, your grade
does not depend on how successful your approach is, either as direct
performance (even though the various systems will be scored on how many files
are correctly identified and how many of the files identified are incorrect,
your grade is not dependent on that score) or as relative performance (you
don't have to worry about whether all of your classmates' systems' scores are
higher or lower than your system's).
However, your written report turned in on November 3 should contain
enough details of what you have done to convince a reader that you have done a
disciplined implementation of your proposed approach.