Project Title: What can the internet can tell us about human language?
Project Description: What are the mental representations of sentences, and how can people’s writing on the internet help us probe these structures? This project uses a unique new methodology of studying inline hyperlinks in a hypertext corpus, with the hypothesis that inline links are constituents in their host sentences.
The project aims to verify (or refine) this hypothesis by compiling a link-rich hypertext corpus from link blogs and other sources and using a stochastic parser and other tools to see whether the links are indeed constituents or not. Such a hypertext corpus with link and parse annotations can also be used to probe further questions about what kinds of syntactic and semantic structures may be good or bad candidates for links, as well as to study interesting cases of non-constituent hyperlinks.
UROP responsibilities and projects may vary by student interest and ability. Here are some examples:
- tools and algorithms: refine our set of corpus-building (scraping) tools and our constituency-checking algorithms. Strong PHP (or python or perl) experience required. Interest in tree/graph-theoretic algorithms and Java experience a plus.
- pure syntax: research differing notions of constituency and see what our corpus and tools can tell us. Study and classify cases of non-constituent hyperlinks. No programming experience required.
Prerequisites: 24.900 or equivalent background required. 24.902 preferred. Programming skills required depends on the individual project (see above).