The Needle in the Haystack Competition

 

Problem Statement:  Consider the problem of classifying the general domain to which a given document belongs.  Suppose, for example, that a web user wanted to look at (gasp!) all documents related to astronomy on the web.  What textual clues might reliably identify an html or pdf or postscript document as belonging to the domain of astronomy?  What evidence might determine that the document really belonged to some other domain? 

 

The Needle in the Haystack competition is not intended to definitively answer the above questions so much as to involve the members of the class in related issues in a rapid fashion.  Competitors may implement their choice of "quick and dirty" methods that they believe may be competitive against more disciplined approaches in a short time frame.  The competition takes two forms, with half the class competing in each form:

 

The positive question:  Given a set of 2000 documents, all taken from the world wide web recently, and all definitely belonging to the domains of biology, chemistry, geology, mathematics, and computer science (and possibly astronomy and physics, if the instructor has the opportunity), find the small subset belonging to one targeted domain, consisting of from 10 to 100 files from that one domain. 

 

The negative question:  Given a set of 2000 documents, most of which are taken from one domain in the list above, correctly identify the 10 to 100 documents which do not belong to that domain.  These documents will, however, belong to one of the other domains listed above.

 

The systems will be given scores based on how many of the desired files are found ("recall") and how many of the files proposed by the system were actually right ("precision"). 

 

The domains are deliberately chosen to have overlapping vocabulary and concepts.

 

Your system will have training data available for all of the domains, including at least a million words of text per domain, the same text annotated with part of speech, a set of vocabulary relevant to each domain, and a list of compound words relevant to each domain.

 

Requirements

Each student is responsible for the following, on behalf of the class:

·         Collect at least a million words in your own domain for use as training data.  Keep in mind that what you want from your selection of training data is a sampling of the most useful phrases that identify the domain.

·         In the case of pdf and postscript files, translate the documents to text.  Run a part of speech tagger on the text versions.  Put the text and the tagged text files into the class directory.

·         Collect a good list of domain-specific vocabulary (from your dictionaries) as well as harvested phrases in your domain for the class's use.  At a minimum you must use the Justeson and Katz filters (p.  154 of the text), plus "a few" phrasal patterns of your own, which you have found to be effective in your domain...  like the "x of y" pattern mentioned in class.  "A few" is generally interpreted to mean >= 3.

·                     Make all such domain-specific vocabulary and phrases harvested from your data available to the class.  This means that you must generate a list of the actual phrases, not just their underlying structures, for the collected text in your domain, and leave that list for the rest of the class.  If you think that you would like something more than the actual phrases, like actual frequencies, relative frequencies, mutual information values, t test scores, chi- squared scores, .... whatever, from whatever your classmates leave in their lists, then consider that they would like the same information from you.

 

Since this is a graded exercise, it is important that you know what can be shared among the group and what cannot.  Feel free to share good sources, good references, outside code (in fact, if you use code taken from the web, you are expected to let others know about it), any tools for preprocessing the raw text to prepare for tagging, and hints or actual help with tagging.  In fact, if the class wants to make a joint project out of tagging all the text, you are welcome to do so.

 

Things you are not allowed to share:  code written to collect either phrases or information leading to getting the phrases (unless that code was written by someone other than yourself, in which case you are required to share - see above.  In cases where you are not sure, consult with the instructor!).  The underlying concept here is that there is a basic level that everyone shares, and that you are going to do some kind of "value added" beyond what everyone has, as your personal effort, on which you will be graded.  That includes a reasonable set of terms (vocabulary plus phrases) for your domain, which you will make available to everyone else.  It also includes whatever strategies you implement to be successful in the "needle in a haystack" contest.

 

After putting the vocabulary and phrases (perhaps with frequency information) out on the class directory for your domain, you should turn your attention to the contest.  On the basis of class discussions about breadth of some of the domains, you will be pleased to learn that the chosen target domain is geology.  Although by no means a clearly bounded discipline with regard to document content, it probably is less voluminous in coverage and somewhat easier to pin down than the others.  Also, as it happens, since there are two students working in this domain, there will be at least 2 million words of training text available to the class.  The contest itself will feature either 2000 files of text mostly from geology, with 10 to 100 of the files actually clearly from one of the other domains, or it will feature 2000 files from the five domains, of which only 10 to 100 are from geology - depending on which group you are in.  So you can fine-tune your personal entry into the contest for the domain of geology.  However, please do write most of your system so that if you were required to change to a different domain, you would not have to change much more than directory paths and filenames.

 

Because I will do my best to be sure that you do not train on the test sets, I request that you send me a list of the URLS that you downloaded from, if at all possible.  If it's already too late for that, give me as much information as you can about your sources, so that I have a reasonable chance to get the test files from somewhere else.

 

Due dates:

 

October 18 - all training files (text, tagged text, vocabulary, phrases) in the class directory.  If you're at conference out of the country on that date, please get as much as possible online before you leave, but you can finish when you get back. :)

October 18 - your proposed approach must be described in at least general terms to the instructor.  The instructor reserves the right to suggest or require modifications to your proposed approach.  Once that approach is mutually agreed upon, let the instructor know about your progress at least once per week, preferably twice a week.

November 1 - Your program/system must be ready to run on the test sets by this date.  Test sets will be made available.  Run the test, report on the names of the target files identified by your system (or, equivalently, their position in the list of filenames), inspect the actual files that your system returns, give your assessment of accuracy, and what problems your system may be having.  This should be contained in a written report that you will hand in on Friday, November 3, and will also be the subject of our class discussion on that date.

 

Between October 18 and November 1, I expect that one or two class discussions will be devoted to questions and comments that members of the class will have regarding the tasks they are attempting.  Since no two students will be using the same approach on the same version of the contest, there is no reason to be super-secretive about general approaches.  In my experience, it has been very helpful for each student to describe in general (about five minutes) what they are doing, with the rest of the class contributing advice.

 

As mentioned in class, if you do a disciplined job of whatever you agree with the instructor to do, your grade does not depend on how successful your approach is, either as direct performance (even though the various systems will be scored on how many files are correctly identified and how many of the files identified are incorrect, your grade is not dependent on that score) or as relative performance (you don't have to worry about whether all of your classmates' systems' scores are higher or lower than your system's).  However, your written report turned in on November 3 should contain enough details of what you have done to convince a reader that you have done a disciplined implementation of your proposed approach.