Labeling

I just pushed to the repository. There is very good immediately-dominating node identification code in functions.php that hasn’t yet been utilized in the main judgement loop.

The “link” problem of finding the right subtree of all possible subtrees is handled fairly well, at least for links that have more than one or two nodes in the subtree. If they have less, then it’s up in the air, which subtree the code actually picked out. But for smaller links of one or two nodes, it is very possible that it doesn’t matter which subtree we pick up ’cause the structure will still be the same. The word “for” will pretty much always be “(IN for)” no matter what subtree it’s in; it’s not a problem if we mismatch a subtree consisting of just one (or even two) words.

I tested the node id-ing code on the first 30 non-error rows of the links table and so far, everything checks out. If the logic is correct (and I just hand-traced this myself), it should even correctly identify the node in the two-pronged case:

(X
    (α
        (A Ø)
        <<<LINK(B Ø)
    )
    (β
        (C Ø)
LINK;
        (D Ø)
    )
)

In fact, I’m very confident that it should be able to pick out the X node.

Here is a link to an output file of my test run showing correct node labels for each of the parsed links.

Getting My Feet Wet

Today (12:00-5:30), I downloaded both the Stanford Parser and the Berkeley Parser and parsed the following text I found on the Stanford NLP site:

The strongest rain ever recorded in India shut down the financial hub of Mumbai, snapped communication lines, closed airports and forced thousands of people to sleep in their offices or walk home during the night, officials said today.

For which I got these two corresponding trees: Stanford parse and Berkeley parse.

They do pretty well, though after comparing the two, it’s clear that Stanford’s is better for that one sentence (that’s probably why they put it on their website, because it produces an accurate result with their parser). The differences are that the Berkeley parser messes up with the first past participle after “Mumbai”: “snapped …” which is NOT supposed to be an adjunct to “Mumbai”, but rather a verb phrase in a series of verb phrases with the subject “The strongest rain”. Additionally, the Berkeley parser produces an odd SBAR node (which isn’t documented, but is most likely intended to be an S’ projection, judging by its name).

It seems that the Stanford parser may indeed be “better”, but this can only be confirmed by parsing more sentences.

Meanwhile, I also looked at the Penn Treebank project to hopefully find more information on the strange node labels the Stanford (and the Berkeley) parsers use. I found a PostScript file containing the label descriptions and converted it to a PDF here.

The Penn Treebank seems to make (rather unnecessary) distinctions among certain parts of speech (at least for my own uses).