CS 8633
Natural Language Processing
Instructor:
Lois Boggess
321 Butler (7507)
Office Hours:
Monday 10-12,
Tuesday, Wednesday 1- 2:30,
Other hours by appointment
Class meets: MWF 9-9:50, Butler 102
Text: Manning and Schütze. 1999. Foundations of Statistical Natural Language Processing. The MIT Press
Other Reference Material:
James Allen. 1995. Natural Language Understanding, 2nd edition. The Benjamin/Cummings Publishing Company.
Winograd, Terry. 1983. Language as a Cognitive Process: Volume 1: Syntax Reading, MA. Addison-Wesley Publishing Company. My copy lives in the room known as the Huddle room, third floor of Butler. Solutions to many grammar problems using ATNs are provided. Appendix B is an 80-page outline of English syntax that is a great first place to go if you have questions about fundamental features of clauses and phrases.
Topics to be covered:
For three weeks of lecture we will be covering the basics of English grammar in general and chart parsers and ATNs in particular, sufficiently well that everyone can
1) algorithmically solve a parsing problem with both tools (i.e., trace through a parse of a sentence with a given grammar appropriate to the parsing tool, with pencil and paper), and
2) actually parse a set of sentences (a different set of sentences for each member of class) on the computer using an ATN.
In the following seven weeks we will study statistical methods for syntax (word collocations, lexical acquisition, n-gram models, hidden Markov models, part of speech tagging and probabilistic grammars) and word sense disambiguation (a semantic issue).
For the last three to four weeks we will talk about applications of statistical language processing - the use of clustering techniques, information retrieval, and text categorization. If class members have particular topics in NLP that they'd like to have included in the course, it may be possible to include papers on those topics in these latter weeks.
Extensive knowledge of English grammar is not necessary to be successful in this class. Over the history of the course, the grade distributions for native and non-native speakers of English have been almost the same. The major elements of English grammar which are needed for the ATN assignment are covered in lecture. All students are invited to discuss in detail with the instructor all expected parsed forms for their individual sentences for this assignment. Furthermore, any student who completes the assignment before the deadline may have either their algorithm or their output (but not both) critiqued by the instructor in advance of submitting it for a grade. The algorithms and techniques used for lab assignments after the first are applicable to any large body of text in any language - Japanese, for example.
Credit for the course comes from the following:
Projects: ATN or chart parser for 5 to 7 sentences 15% Midterm project 20-25% in-class and take-home test problems 25-30% final project 20-25% final paper 15%
The in-class and take-home problems are scattered throughout the course. Taken all together, they account for one-quarter to slightly less than one-third of the credit for the class.
The midterm project is expected to be a joint project worthy of being written up and submitted to the Conference on Empirical Methods in Natural Language Processing. The grades for this project are individual grades: Each student will have individual responsibilities for this project, including data collection, preparation of training and test data, checking another student's segment of the training and test data, performing an experiment similar to, but distinctly different from the other students' experiments on the common training and test data, and reporting the results in a form which facilitates comparison with the results of the other members of the class. Although the grade for this assignment normally is associated with the experiment itself and the report thereof, notable lateness or laxness in data preparation of the common training and testing materials could affect an individual's grade. If a conference paper results from the experiment, the bulk of the paper will be written by the instructor, but all students who participate in the experiment will have co-authorship. (The last time this course was offered, all the students were co-authors of a paper of this sort which was published and presented at an IEEE-sponsored international conference.)
The final project is negotiated between the individual student and the instructor. It is expected to be a substantial undertaking, but normally is related to a topic of investigation which is of interest to the student. Note that the final project is responsible for more than a third of the grade for the course. It is not unusual for a student to provide documentation for the experimental portion of the final project which is separate from and more detailed than the description of that experiment which is appropriate to a conference-style paper. Moreover, the paper has requirements, such as a literate review of relevant research, which are distinct from the minimal requirements related to reporting the design and execution of an experiment.
Academic Honesty:
Students
are expected to follow the CS department policy on academic honesty
with regard to homeworks, labs, and citations for papers. Academic honesty
violations on any assignment will result in a minimum penalty of a grade of
zero on that assignment and may be grounds for assignment of an F in the course.
All students who take a CS course are required to read the academic honesty
policy, and are responsible for its content, even if they choose not to read
it.
Schedule - under constant construction. Due dates will appear here.
|
Date |
Topic |
Reference/Please read |
|
August 21 - 25 |
Overview: NLP at the lexical, morphological, part-of-speech, syntactic, semantic, pragmatic, discourse, and dialog levels. Various "war stories" regarding creative use of language as the norm for communication. Need for corpus-based language processing. Intro to ATN's - demo. |
|
|
August 29-Sept 1 |
Recursive transition networks. Demo highlighting passive and relative clause treatment in the sample ATN that the class builds on. Grammar issues in sentences of the first assignment. |
Parsing
assignment handed out August 30. Lexicons (ATN dictionaries) due by
email Tuesday, Sept. 5 at noon. |
|
|
Information on ATNs |
Handout on ATNs (see above). Chapter 3 Section 5 of Allen. Also pp. 52-66 of Winograd. The appendix of Winograd has a full-blown ATN grammar for English. |
|
Labor Day |
Holiday |
|
|
September 5, noon |
ATN dictionaries due |
|
|
September 6,8 |
Grammar issues - passives, gerund phrases, relative clauses, infinitives, subordinate clauses, strans verbs, thematic roles |
|
|
|
Chart Parsing |
|
|
|
Case
grammars/thematic roles |
|
|
September
18 |
ATNs due |
Review
of those parts of M&S Ch. 3 that we didn't encounter firsthand in the
ATNs. |
|
September
22 |
Chart parsing problem due |
|
|
Sept
22-29 |
Corpora, function words, text types, web vs. technical writing, document styles, description of information gathering, sources and domains for our next project. Collocations, frequencies, Zipf's law, KWIC, mutual information. |
Ch.1, parts of Ch 2 that relate to mutual
information, p. 180 of Ch. 5 (see also pp 169, 170) MI subject to sparsity of data problem. |
|
Oct 2-4 |
No class. |
|
|
Oct 6, 9-13 |
Probability, conditional probability, Bayes' theorem, text manipulation issues, morphology, tagging, t test, chi-squared test |
Parts of Ch. 2, all of Ch 4, Ch. 5 up to but not
including 5.3. Click here for a description of the Needle in the
Haystack problem (mostly a copy of email that you've already seen, but
including the initial description given orally in class in September.) |
|
Oct 18 |
Due: domain-dependent data at class
directory: 1
million words of text from your domain, tagged text, vocabulary and phrases Due: Proposal for your approach to the
Needle in the Haystack competition -
preferably through conference with the instructor (jury duty permitting)
followed by email summarization of the agreed-upon approach |
|
|
November 1 |
Due: Your system for the Needle in the Haystack competition (must be operational and
ready for you to run the test) |
|
|
November 3 |
Due: Report on system performance, analysis of strengths and
weaknesses |
|