Non-constituent Findings

From only 6-7 “non-constituent” links that I hand-checked, we saw a pattern of hyperlinks that were nearly constituents but left off adjuncts such as prepositional phrases.

Since then, I have hand-checked about 60 non-constituent links and have seen very similar patterns.

The results mostly show complements/adjuncts (I didn’t attempt to distinguish between the two) being left out of the hyperlink. Other results show that determiners are being left out of the hyperlink. This, however, is expected. If we make trees according to DP theory (where NPs are complement to DPs), then this would be fine and dandy. However, the Stanford trees have their determiners (DT) as part of the noun phrase (possibly in the specifier position, if there was such a distinction here). As a result, links with missing determiners are incorrectly judged as non-constituents.

In other cases, leading adjectives were also left out. For example: “a slimy, warty, green frog” but the author would only make “green frog” or “warty, green frog” a hyperlink and leave out (predictably) the determiner and (maybe unpredictably) one or more leading adjectives. In this case, “slimy” or “slimy, warty” were left out. I believe AdjPs are adjunct to the NP (so most substrings of that string should be constituents), but here, the noun phrase structure given by the parser makes it so constituency doesn’t happen for such substrings.

Surprisingly, a large number of these links were simply victims of erroneous parsing and data preparation. Where the Stanford parser fails the most is in comma-delimited lists and the like. Items in conjoined sequences are most definitely constituents, but these tend to fail.

Another fault of the Stanford parser (and perhaps my own) is that final punctuation caused constituents to be judged as non-constituents. Take for example a hypothetical link with the text (without quotes) “the water.” presumably as the direct object of the verb. What happens is that “the water” is parsed normally and if you checked it for constituency yourself, it would be a constituent. But, the additional period “.” at the end is part of the string. Why is this a problem? It’s because, even though punctuation is given its own node in the tree, it is normally placed at the very end of the tree, outside of every node, i.e. not where we expect it to be.

We would expect “the water.” to match something like:

(DT the) (NN water) (. .))))…

Here, there probably would be at least a single node dominating either just “the water” or “the water.” nodes and that would give us our constituent. However, the Stanford parser usually parses the phrase like so:

(DT the) (NN water))))…(. .)

And this is not what we expect at all. A simple (but not perfect) fix would be to probably strip out all punctuation save for quotation marks (though these cause problems as well) and perhaps commas when creating the regular expression pattern for a link.

Below is a list of my findings and comments. Anything marked with “incorrect” is what I think should be a constituent under a manual parse, but was misjudged. Anything without an “incorrect” was judged correctly as a non-constituent. I added various comments to try to categorize each kind of failure. I initially didn’t note what kind of phrase was left off from complement/adjunct-less hyperlinks (noted as “C/A chopped off”).

87943+1 = incorrect, det chopped off
87943+2 = C/A chopped off
87943+8 = C/A chopped off
87944+0 = C/A chopped off
87944+5 = C/A chopped off
87944+6 = incorrect, stanford tree is wrong
87944+9 = incorrect, constituent
87947+0 = incorrect, final punctuation was included
87949+2 = C/A chopped off, stanford tree is ambiguous
87949+3 = incorrect, final punctuation was included
87952+3 = incorrect, stanford tree is wrong
87952+4 = C/A chopped off
87952+8 = have no clue, a string containing the name of the news source?
87952+15 = incorrect, final punctuation was included
87954+3 = incorrect, final punctuation was included
87954+4 = incorrect, final punctuation was included
87956+1 = incorrect, det chopped off (in NP->D system, correct, else DP->NP, incorrect)
87959+0 = C/A chopped off
87960+0 = C/A (appositive) chopped off
87960+2 = incorrect, det chopped off
87961+3 = incorrect, stanford tree is ambiguous
87962+0 = C/A (adverb/past participle) chopped off
87962+2 = C/A (non-restrictive? relative) chopped off
87963+2 = incorrect, stanford tree is wrong: misparsed an undelimited list
87963+3 = incorrect, stanford tree is wrong: misparsed an undelimited list
87964+0 = incorrect, stanford tree is wrong: misparsed strange (SLYT), just bad entry formatting
87971+2 = incorrect, stanford tree is wrong: misparsed list and used verb variant of the noun
87971+4 = C/A (second half of conjunction nested in VP) chopped off
87972+0 = incorrect, stanford tree is wrong
87974+6 = incorrect, stanford tree is wrong, misparsed verbal arguments, misplaced preposition
87975+0 = incorrect, stanford tree is wrong, misparsed topicalization/clefting, something like that
87976+1 = incorrect, initial punctuation (") was included
87976+2 = incorrect, final punctuation (.") was included
87976+3 = incorrect, final punctuation (.) was included
87977+0 = complement to P (of) chopped off...strange..."most of"
87978+1 = C/A (preposition) chopped off
87981+0 = C/A (appositive) chopped off
87981+4 = C/A (preposition) chopped off, stanford tree is wrong, misparsed "of" possession
87982+8 = C/A (preposition) chopped off, det ("'s" possession) chopped off
87982+9 = det and leading adjective chopped off
87983+1 = C/A (second half of conjuction) chopped off
87984+0 = C/A (appositive) chopped off
87986+2 = C/A (preposition) chopped off, det and leading adjective chopped off
87986+4 = incorrect, stanford tree is wrong, misparsed participle attachment after complete VP
87987+4 = det and leading adjectives chopped off
87987+26 = following adjectives and noun head chopped off...strange..."the only"
87987+36 = incorrect, no clue, should be a constituent
87988+9 = incorrect, stanford tree is wrong, misparsed list of things
87988+11 = C/A (other parts of conjunction, comma-list) chopped off
87989+1 = det chopped off
87989+2 = C/A (preposition in passive construction) chopped off
87989+7 = ??? unsure, what to do with "as"
87991+4 = C/A (other parts of conjuction nested in VP, comma-list) chopped off
87991+17 = det and leading adjectives chopped off
87991+20 = incorrect, det chopped off
87991+23 = C/A (preposition) chopped off
87991+24 = det chopped off, final punctuation included
87991+27 = vP chopped off from TP, strange..."already have (won one)"...
87991+39 = C/A (preposition) chopped off
87992+1 = C/A (preposition) chopped off, passive construction may be a problem

I also wrote some code to get the rest of an “incomplete” constituent, but haven’t fully tested it yet. It reads downwards and should be able to handle “C/A chopped off” links. As for “det chopped off” and “leading adjectives chopped off”, I just need to reverse the direction in which it looks.

More Issues

Just kidding! I have success now. There was weird stuff going on with PHP, but I got that fixed up and I had the fix_entries.php script run on my remote Windows machine and unlike running the Stanford parser on Scripts, the Stanford parser on a Windows 7 machine with only 512 MB of RAM worked almost like a charm. Entries that were previously unparsable due to the memory ceiling were now mostly parsable. (It still choked up on some entries, because my remote system doesn’t have a lot of RAM to work with). But since it could work on my remote system, I could run on it on my own laptop, which I’m doing right now and entries that ran out of memory on my remote system are parsing correctly! (The ones that still aren’t parsing correctly are, after inspection, “strange” entries with lists of book titles, etc.).

Everything seems to be going swimmingly. At some point, I’m going to have to rerun the constituency script (or modify it to skip ones it already has judgements for).

Memory Issues

After testing various things with the parsers, on my own laptop and on the Scripts servers, I modified the code to exclude Berkeley parses, not because I haven’t figured out how to get the output, but because the Berkeley parser runs out of memory.

I was reading through the Scripts blog and they mentioned something about the JVM and memory issues. In fact, they said the option -Xmx128M is the default for their JVM, that is, they initially only let the JVM allocate 128 MB of memory. This happens to be a problem for the Berkeley parser which runs out of memory rather quickly.

For parsing the sentence, “Simulations are special interactive programs which represent dynamic models of devices, processes and systems.”, Windows commits 775.75 MB for the parser (it actually only uses 571.08 MB), which way over the 128 MB initial allocation that Scripts places. Scripts claims that you can try to increase the memory to 256 MB, perhaps even 512 MB, but not 768 MB. They said it became unstable at 768 MB. When I did try it on the Scripts server, with 256 MB, it wouldn’t run still. And at 512 MB, the JVM can’t allocate that much memory because apparently it’s not available.

Now, the Stanford parser is better at memory management. For the same parse, the JVM committed 242.64 MB and only actually used 147.1 MB. The Stanford parser is more ideal for running on the Scripts server. However, with longer sentences, as the parsing script (parse_entries.php) ran through all 5,865 rows of entry data, I noticed that some of the parses had an error message “Sentence skipped: no PCFG fallback. SENTENCE_SKIPPED_OR_UNPARSABLE”.

This was before I did any memory tests, so I thought the Stanford parser was choking up on malformed text, like my HTML-stripping regexp was failing, or I didn’t decode HTML entities like &quot to “, or the parser wasn’t handling special whitespace characters well. I fixed those, but these errors still persisted. It turns out the Stanford parser fails on longer sentences on the Scripts server due to lack of sufficient memory.

On longer sentences, Windows commits around 400+ MB of RAM (but only used around 255 MB) to the parser. Either way, I think it must run out of memory on the Scripts server.

The solution is simply to run it on my own computer, run the PHP script locally so that I can run the parsing without the memory ceiling. But my Apache installation broke after I updated PHP, so I currently can’t run any PHP scripts. As an alternative, I let the parser just run through all of the rows, but add the entry ids of entries that contained the memory error to a new table I called hyperlinks_bad_entries. Of the 5864 entries, 2,642 had errors, so that’s close to half. When doing constituency tests, I can probably ignore the bad entries for the time being.

Getting My Feet Wet

Today (12:00-5:30), I downloaded both the Stanford Parser and the Berkeley Parser and parsed the following text I found on the Stanford NLP site:

The strongest rain ever recorded in India shut down the financial hub of Mumbai, snapped communication lines, closed airports and forced thousands of people to sleep in their offices or walk home during the night, officials said today.

For which I got these two corresponding trees: Stanford parse and Berkeley parse.

They do pretty well, though after comparing the two, it’s clear that Stanford’s is better for that one sentence (that’s probably why they put it on their website, because it produces an accurate result with their parser). The differences are that the Berkeley parser messes up with the first past participle after “Mumbai”: “snapped …” which is NOT supposed to be an adjunct to “Mumbai”, but rather a verb phrase in a series of verb phrases with the subject “The strongest rain”. Additionally, the Berkeley parser produces an odd SBAR node (which isn’t documented, but is most likely intended to be an S’ projection, judging by its name).

It seems that the Stanford parser may indeed be “better”, but this can only be confirmed by parsing more sentences.

Meanwhile, I also looked at the Penn Treebank project to hopefully find more information on the strange node labels the Stanford (and the Berkeley) parsers use. I found a PostScript file containing the label descriptions and converted it to a PDF here.

The Penn Treebank seems to make (rather unnecessary) distinctions among certain parts of speech (at least for my own uses).