Getting My Feet Wet

Today (12:00-5:30), I downloaded both the Stanford Parser and the Berkeley Parser and parsed the following text I found on the Stanford NLP site:

The strongest rain ever recorded in India shut down the financial hub of Mumbai, snapped communication lines, closed airports and forced thousands of people to sleep in their offices or walk home during the night, officials said today.

For which I got these two corresponding trees: Stanford parse and Berkeley parse.

They do pretty well, though after comparing the two, it’s clear that Stanford’s is better for that one sentence (that’s probably why they put it on their website, because it produces an accurate result with their parser). The differences are that the Berkeley parser messes up with the first past participle after “Mumbai”: “snapped …” which is NOT supposed to be an adjunct to “Mumbai”, but rather a verb phrase in a series of verb phrases with the subject “The strongest rain”. Additionally, the Berkeley parser produces an odd SBAR node (which isn’t documented, but is most likely intended to be an S’ projection, judging by its name).

It seems that the Stanford parser may indeed be “better”, but this can only be confirmed by parsing more sentences.

Meanwhile, I also looked at the Penn Treebank project to hopefully find more information on the strange node labels the Stanford (and the Berkeley) parsers use. I found a PostScript file containing the label descriptions and converted it to a PDF here.

The Penn Treebank seems to make (rather unnecessary) distinctions among certain parts of speech (at least for my own uses).

No related posts.

Related posts brought to you by Yet Another Related Posts Plugin.

4 thoughts on “Getting My Feet Wet

  1. Re: SBAR. I don’t understand the logic of the SBAR at all. Some hypotheses I had:

    1. all subordinated or embedded clauses have to have an SBAR projection—can’t be right because there are some { VP -> X SBAR } but also { VP -> X S } .
    2. they’re trying to do strict X’ schema—can’t be right because other projections don’t have bar levels, and there’s no “SP.”
    3. SBAR is S’ but for some reason only comes up sometimes—this also is odd as there’s an { SBAR -> VP } in there too.

  2. General comments:

    Anton, did the Stanford and/or Berkeley teams say what corpora these were trained on and/or where they got their baseline rewrite rules (X -> A B, etc.) from? Perhaps they have different architectures but were trained on the same data, or maybe on different treebanks. This could change whether we’re comparing apples and apples or apples and oranges. I would also expect them to each have an overview paper with some % correct according to some standard subset of the Penn Treebank or some such standard metric.

    Of course, that being said, “correct based on the Penn Treebank”, or any other similar measure, is not exactly “better” for our purposes. Our ultimate goal is to see whether hyperlinks actually map to what constituency we linguists posit, not necessarily the parsers posit. This may begin with figuring out which types of projections/constructions these parses do uncontroversially—perhaps DP’s, or VP’s, and maybe some more fine grained units—so in some sense which parser we try using, if the differences are as you highlighted above (in the grand scheme of things, not very different, was my takeaway), may not be particularly crucial at this point.

    Perhaps ease of coding/integration and building upon should also be a factor, and if you have a preference of that sort, I’d love to hear about it.

  3. I’m not sure. I didn’t exactly look for them, my goal that day was to simply run the parsers via command line.

    As I’m currently looking through your code, I’m starting to realize my PHP is pretty rusty (by rusty, I just need to look up lots of functions and certain language constructs specific to PHP), other than that, it’s pretty readable.

    It **may** be easier to do it all in Java…I haven’t done any database access or web requests with Java before, so it’d be new to me. But we’ll see once I look through the code, especially the parser code to see if there’s anything else we can squeeze out from the parsers in addition to the syntax trees.

  4. According to the limited documentation on the Berkeley Parser, the English grammar file provided was trained on the Wall Street Journal (1989) portion of the Penn Treebank. I’m assuming it was the second release.

    The Stanford parser is trained on various things depending on what grammar you supply it. I was using the WSJ grammar (Wall Street Journal, based on sections 2-21, apparently). However, on their website, they always mention a generic English grammar that was trained on WSJ sections 1-21, and a couple of other treebanks including some sentences they hand-traced themselves and some sentences from more recent newswires.

    It seems that the Stanford parser has had more training, whether that’s good or not, I don’t know. They say that the generic English grammar should work just a *bit* better for anything that isn’t a WSJ text in the 1980′s (otherwise the WSJ grammar should suffice).

    From a programming standpoint, there is a LOT of code to sift through. From the get-go, there is at least some JavaDoc documentation for the Stanford parser. (There’s none for the Berkeley parser). I’ll take a closer look at the code, but it seems like since both are command-line programs, the most tweaking we’ll be doing is with the command-line options.

    We could try to place print statements in certain spots within the actual code to get more internal information that we may deem useful, but only if we can find those spots.

Leave a Reply to mitcho Cancel reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>