Getting My Feet Wet

Today (12:00-5:30), I downloaded both the Stanford Parser and the Berkeley Parser and parsed the following text I found on the Stanford NLP site:

The strongest rain ever recorded in India shut down the financial hub of Mumbai, snapped communication lines, closed airports and forced thousands of people to sleep in their offices or walk home during the night, officials said today.

For which I got these two corresponding trees: Stanford parse and Berkeley parse.

They do pretty well, though after comparing the two, it’s clear that Stanford’s is better for that one sentence (that’s probably why they put it on their website, because it produces an accurate result with their parser). The differences are that the Berkeley parser messes up with the first past participle after “Mumbai”: “snapped …” which is NOT supposed to be an adjunct to “Mumbai”, but rather a verb phrase in a series of verb phrases with the subject “The strongest rain”. Additionally, the Berkeley parser produces an odd SBAR node (which isn’t documented, but is most likely intended to be an S’ projection, judging by its name).

It seems that the Stanford parser may indeed be “better”, but this can only be confirmed by parsing more sentences.

Meanwhile, I also looked at the Penn Treebank project to hopefully find more information on the strange node labels the Stanford (and the Berkeley) parsers use. I found a PostScript file containing the label descriptions and converted it to a PDF here.

The Penn Treebank seems to make (rather unnecessary) distinctions among certain parts of speech (at least for my own uses).