Introduction and Negative Classification

Hello! My name’s Patrick Hulin, and I’m a new UROP working on the project. I’m a freshman at MIT, and I’m probably going to be studying mathematics.

My first task has been classifying non-constituent hyperlinks. I’ve gone through about 50 of them, chosen randomly. Since non-constituents are split up into failure categories, I separated by that and took a number of links in each category proportional to the size of that category. Here’s the final data:

not_constituent, missing_after:
68131: Incorrect. Completely stupid parse.
23048: Correct. Dropped relative clause.
23747: Incorrect. Completely stupid parse.
34020: Incorrect. Completely stupid parse.
93262: Parser error.
40805: Incorrect. Completely stupid parse.
74397: Incorrect. Completely stupid parse.
13462: Incorrect.
76720: Correct. Dropped complement of PP.
1495: Ambiguous, Attachment ambiguity.
79405: Correct. Dropped adjunct, end of VP. Could be sentence.
8691: Incorrect. Completely stupid parse.
28191: Incorrect. Proper noun.
48932: Incorrect. Completely stupid parse.
15864: Incorrect. Completely stupid parse.
89638: Incorrect. Proper noun. Completely stupid parse.
5602: Incorrect. Stupid linking.
38405: Correct. Dropped PP. Completely stupid parse.
13618: Incorrect. Completely stupid parse.
59394: Incorrect. Completely stupid parse.
71513: Incorrect. Proper noun. Fragment.
9121: Incorrect. Stupid linking.
5737: Incorrect. Proper noun. Colon.
51898: Incorrect. Completely stupid parse.
18481: Incorrect. Completely stupid parse.
25320: Correct. Dropped relative clause after (restrictive). Completely stupid parse.
34849: Correct. Dropped complement of PP.  Completely stupid parse.
654: Incorrect. Proper noun. D is specifier.
53692: Incorrect. Proper noun. Completely stupid parse.
57855: Incorrect. Completely stupid parse.

multiple_constituents, missing_before:
68860+0: Incorrect. D is specifier. Completely stupid parse.
89748+0: Incorrect. D is specifier. Completely stupid parse.
9946+2: Not a sentence. Completely stupid parse.
76995+3: Incorrect. D is specifier.
22793+2: Incorrect. D is specifier. Completely stupid parse.
79375+1: Incorrect. D is specifier.
96862+5: Incorrect. Possessive. Completely stupid parse.
37698+2: Incorrect. Punctuation. Completely stupid parse.
46544+2: Incorrect. NP branching. Completely stupid parse.
74354+1: Incorrect. NP branching. Completely stupid parse.
48911+3: Incorrect. D is specifier. Completely stupid parse.
62017+0: Incorrect. NP branching. Completely stupid parse.
94082+3: Incorrect. D is specifier.
89569+10: Incorrect. Movement. Completely stupid parse.
96724+3: Incorrect. D is specifier. Completely stupid parse.
92307+0: Incorrect. Punctuation.
88518+11: Incorrect. Comma-separated list. Completely stupid parse.

not_constituent, missing_before:
63138+4: Incorrect. Comma-separated list. Completely stupid parse.
85211+3: Incorrect. Unknown. Completely stupid parse.
83933+0: Incorrect. Proper noun. Completely stupid parse.
67273+4: Mislink by poster. Completely stupid parse.
85094+6: Incorrect. D is specifier. Completely stupid parse.
57566+8: Not a sentence. Completely stupid parse.
20552+2: Incorrect. D is specifier.
58616+4: Incorrect. Punctuation. Completely stupid parse.

multiple_constituents, missing_after:
29457+1: Incorrect. Completely stupid parse.
71020+0: Correct. Dropped PP.
85340+1: Incorrect. NP branching.
52976+0: Incorrect. Punctuation. Completely stupid parse.
19937+0: Incorrect. Stupid linking. Completely stupid parse.
8163+0: Incorrect. Completely stupid parse.
50757+0: Incorrect. Fragment. Completely stupid parse.
84700+1: Correct. Dropped relative clause.

multiple_constituents, missing_before_after:
95747+0: Incorrect. NP branching.
73617+3: Incorrect. NP branching.

not_constituent, missing_before_after:
70047+6: Correct. Dropped relative clause.

I chose semi-standardized terminology to notate each entry. “Completely stupid parse” means the parser parsed multiple sentences as one, which generally required some very interesting acrobatics from the parser. There were definitely some true negatives, actual non-constituents, but most of the errors resulted from the parser being stupid. Also, when the Penn Treebank was originally constructed, the researchers left all the NPs flat as it was a significant slowdown to their annotation with little benefit. As a result, any parser trained on the Treebank leaves NPs flat as well. Many links indicated by the parser as non-constituents are actually just dominated by an NP.

So what are the next steps? One of the most important is to break sentences ourselves so multi-sentence entries don’t confuse the parser. For the other main cause of false negatives, NP branching, we’re working on incorporating the work of David Vadas at the University of Sydney into the parsing script.

No related posts.

Related posts brought to you by Yet Another Related Posts Plugin.

One thought on “Introduction and Negative Classification

Leave a Reply to mitcho Cancel reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>