Sentence-splitting: again.

Last week, after pushing the DP-structure and sentence-splitting changes, we reran the parser over the whole corpus. We discovered that about 10% of our links which were formerly marked as constituents are now no longer marked as such. I went through about 25 of them, and there weren’t really any originally false positives. Many of the errors were due to errors with the parser (about 4%, leaving 6% real issues). The others were mainly attributable to the sentence splitter being stupid, along with some issues with the parser.

As a result, I’ve replaced the sentence-splitter with Sebastian Nagel‘s tokenizer, which also does sentence-splitting. It seems to work better – more of our test cases pass, which is a good thing. I’m trying to find some more computing resources so we can test different combinations of the improvements, because we aren’t sure how they interrelate.

Sentence splitting and DP structure

I just pushed sentence-splitting code up to the repository. parse_entries.php now splits sentences before feeding them to the parser, which makes a lot more sense, as the parser did not handle multi-sentence paragraphs well. We’re using Adwait Ratnaparkhi’s MXTERMINATOR sentence-splitter.

We’ve also retrained the parser and reparsed the whole database, using David Vadas’ NP structure additions to the Penn Treebank. The two have increased the constituency percentage by about +6%, which is slightly less than I expected.

Introduction and Negative Classification

Hello! My name’s Patrick Hulin, and I’m a new UROP working on the project. I’m a freshman at MIT, and I’m probably going to be studying mathematics.

My first task has been classifying non-constituent hyperlinks. I’ve gone through about 50 of them, chosen randomly. Since non-constituents are split up into failure categories, I separated by that and took a number of links in each category proportional to the size of that category. Here’s the final data:

Continue reading

Labeling

I just pushed to the repository. There is very good immediately-dominating node identification code in functions.php that hasn’t yet been utilized in the main judgement loop.

The “link” problem of finding the right subtree of all possible subtrees is handled fairly well, at least for links that have more than one or two nodes in the subtree. If they have less, then it’s up in the air, which subtree the code actually picked out. But for smaller links of one or two nodes, it is very possible that it doesn’t matter which subtree we pick up ’cause the structure will still be the same. The word “for” will pretty much always be “(IN for)” no matter what subtree it’s in; it’s not a problem if we mismatch a subtree consisting of just one (or even two) words.

I tested the node id-ing code on the first 30 non-error rows of the links table and so far, everything checks out. If the logic is correct (and I just hand-traced this myself), it should even correctly identify the node in the two-pronged case:

(X
    (α
        (A Ø)
        <<<LINK(B Ø)
    )
    (β
        (C Ø)
LINK;
        (D Ø)
    )
)

In fact, I’m very confident that it should be able to pick out the X node.

Here is a link to an output file of my test run showing correct node labels for each of the parsed links.

Polishing Here and There

functions.php is mostly complete. The constituency testing with node labeling, missing node determination, and punctuation-secondary passes, are working quite well.

I copied the database back to my own to run tests to ensure the code worked. The results so far are quite satisfactory. Looking through logs of stderr with stdout, I noticed that some of the unknown errors were in fact due to inline HTML in the hyperlink text, including tags such as “<i>” (and of course, “</i>”). I now invoke stripTags() on the link text before generating the regexp pattern, though my stripTags() is a very simple preg_replace() with a simple regexp.

In hindsight, it’ll miss self-closing tags like “<br />”, but somehow, I doubt people will be using HTML much (especially the line break element, since a simple keyboard return will have the same effect, and links tend to only span one line anyways) in Mefi entries. However, it’ll also have some false positives, though I doubt anyone would ever type in a string like “< and >”.

Here are some preliminary stats from my test run on my own database compared to the unaltered constituency database:

anguyen+linguistics
+-----------------------+---------------------+
| constituency          | COUNT(constituency) |
+-----------------------+---------------------+
| constituent           |               18219 |
| error                 |                2644 |
| multiple_constituents |                5023 |
| not_constituent       |                5295 |
+-----------------------+---------------------+

constituency+hyperlinks
+-----------------------+---------------------+
| constituency          | count(constituency) |
+-----------------------+---------------------+
| constituent           |               18163 |
| error                 |                2647 |
| multiple_constituents |                5014 |
| not_constituent       |                5357 |
+-----------------------+---------------------+

constituent+56
error-3
multiple_constituents+9
not_constituent-62

Note, this is without the HTML stripping, so we can expect to have even fewer errors, in subsequent runs. Other errors included “)” and “/” in the PHP warnings when it parsed the link patterns, but I have no clue where they came from. I’ll check it later.

Non-constituent Findings

From only 6-7 “non-constituent” links that I hand-checked, we saw a pattern of hyperlinks that were nearly constituents but left off adjuncts such as prepositional phrases.

Since then, I have hand-checked about 60 non-constituent links and have seen very similar patterns.

The results mostly show complements/adjuncts (I didn’t attempt to distinguish between the two) being left out of the hyperlink. Other results show that determiners are being left out of the hyperlink. This, however, is expected. If we make trees according to DP theory (where NPs are complement to DPs), then this would be fine and dandy. However, the Stanford trees have their determiners (DT) as part of the noun phrase (possibly in the specifier position, if there was such a distinction here). As a result, links with missing determiners are incorrectly judged as non-constituents.

In other cases, leading adjectives were also left out. For example: “a slimy, warty, green frog” but the author would only make “green frog” or “warty, green frog” a hyperlink and leave out (predictably) the determiner and (maybe unpredictably) one or more leading adjectives. In this case, “slimy” or “slimy, warty” were left out. I believe AdjPs are adjunct to the NP (so most substrings of that string should be constituents), but here, the noun phrase structure given by the parser makes it so constituency doesn’t happen for such substrings.

Surprisingly, a large number of these links were simply victims of erroneous parsing and data preparation. Where the Stanford parser fails the most is in comma-delimited lists and the like. Items in conjoined sequences are most definitely constituents, but these tend to fail.

Another fault of the Stanford parser (and perhaps my own) is that final punctuation caused constituents to be judged as non-constituents. Take for example a hypothetical link with the text (without quotes) “the water.” presumably as the direct object of the verb. What happens is that “the water” is parsed normally and if you checked it for constituency yourself, it would be a constituent. But, the additional period “.” at the end is part of the string. Why is this a problem? It’s because, even though punctuation is given its own node in the tree, it is normally placed at the very end of the tree, outside of every node, i.e. not where we expect it to be.

We would expect “the water.” to match something like:

(DT the) (NN water) (. .))))…

Here, there probably would be at least a single node dominating either just “the water” or “the water.” nodes and that would give us our constituent. However, the Stanford parser usually parses the phrase like so:

(DT the) (NN water))))…(. .)

And this is not what we expect at all. A simple (but not perfect) fix would be to probably strip out all punctuation save for quotation marks (though these cause problems as well) and perhaps commas when creating the regular expression pattern for a link.

Below is a list of my findings and comments. Anything marked with “incorrect” is what I think should be a constituent under a manual parse, but was misjudged. Anything without an “incorrect” was judged correctly as a non-constituent. I added various comments to try to categorize each kind of failure. I initially didn’t note what kind of phrase was left off from complement/adjunct-less hyperlinks (noted as “C/A chopped off”).

87943+1 = incorrect, det chopped off
87943+2 = C/A chopped off
87943+8 = C/A chopped off
87944+0 = C/A chopped off
87944+5 = C/A chopped off
87944+6 = incorrect, stanford tree is wrong
87944+9 = incorrect, constituent
87947+0 = incorrect, final punctuation was included
87949+2 = C/A chopped off, stanford tree is ambiguous
87949+3 = incorrect, final punctuation was included
87952+3 = incorrect, stanford tree is wrong
87952+4 = C/A chopped off
87952+8 = have no clue, a string containing the name of the news source?
87952+15 = incorrect, final punctuation was included
87954+3 = incorrect, final punctuation was included
87954+4 = incorrect, final punctuation was included
87956+1 = incorrect, det chopped off (in NP->D system, correct, else DP->NP, incorrect)
87959+0 = C/A chopped off
87960+0 = C/A (appositive) chopped off
87960+2 = incorrect, det chopped off
87961+3 = incorrect, stanford tree is ambiguous
87962+0 = C/A (adverb/past participle) chopped off
87962+2 = C/A (non-restrictive? relative) chopped off
87963+2 = incorrect, stanford tree is wrong: misparsed an undelimited list
87963+3 = incorrect, stanford tree is wrong: misparsed an undelimited list
87964+0 = incorrect, stanford tree is wrong: misparsed strange (SLYT), just bad entry formatting
87971+2 = incorrect, stanford tree is wrong: misparsed list and used verb variant of the noun
87971+4 = C/A (second half of conjunction nested in VP) chopped off
87972+0 = incorrect, stanford tree is wrong
87974+6 = incorrect, stanford tree is wrong, misparsed verbal arguments, misplaced preposition
87975+0 = incorrect, stanford tree is wrong, misparsed topicalization/clefting, something like that
87976+1 = incorrect, initial punctuation (") was included
87976+2 = incorrect, final punctuation (.") was included
87976+3 = incorrect, final punctuation (.) was included
87977+0 = complement to P (of) chopped off...strange..."most of"
87978+1 = C/A (preposition) chopped off
87981+0 = C/A (appositive) chopped off
87981+4 = C/A (preposition) chopped off, stanford tree is wrong, misparsed "of" possession
87982+8 = C/A (preposition) chopped off, det ("'s" possession) chopped off
87982+9 = det and leading adjective chopped off
87983+1 = C/A (second half of conjuction) chopped off
87984+0 = C/A (appositive) chopped off
87986+2 = C/A (preposition) chopped off, det and leading adjective chopped off
87986+4 = incorrect, stanford tree is wrong, misparsed participle attachment after complete VP
87987+4 = det and leading adjectives chopped off
87987+26 = following adjectives and noun head chopped off...strange..."the only"
87987+36 = incorrect, no clue, should be a constituent
87988+9 = incorrect, stanford tree is wrong, misparsed list of things
87988+11 = C/A (other parts of conjunction, comma-list) chopped off
87989+1 = det chopped off
87989+2 = C/A (preposition in passive construction) chopped off
87989+7 = ??? unsure, what to do with "as"
87991+4 = C/A (other parts of conjuction nested in VP, comma-list) chopped off
87991+17 = det and leading adjectives chopped off
87991+20 = incorrect, det chopped off
87991+23 = C/A (preposition) chopped off
87991+24 = det chopped off, final punctuation included
87991+27 = vP chopped off from TP, strange..."already have (won one)"...
87991+39 = C/A (preposition) chopped off
87992+1 = C/A (preposition) chopped off, passive construction may be a problem

I also wrote some code to get the rest of an “incomplete” constituent, but haven’t fully tested it yet. It reads downwards and should be able to handle “C/A chopped off” links. As for “det chopped off” and “leading adjectives chopped off”, I just need to reverse the direction in which it looks.

Statistics

After rerunning the “missing_tree” links, I tallied the links up:

mysql> SELECT stanford, COUNT(stanford) FROM `hyperlinks_links` GROUP BY stanford;
+----------------------------------+-----------------+
| stanford                         | COUNT(stanford) |
+----------------------------------+-----------------+
| almost_constituent:-LRB-         |               1 |
| almost_constituent:ADJP          |              17 |
| almost_constituent:ADVP          |              14 |
| almost_constituent:CC            |               1 |
| almost_constituent:CD            |              41 |
| almost_constituent:FRAG          |               5 |
| almost_constituent:JJ            |              57 |
| almost_constituent:NN            |              58 |
| almost_constituent:NNP           |              41 |
| almost_constituent:NNPS          |               6 |
| almost_constituent:NNS           |              43 |
| almost_constituent:NP            |             283 |
| almost_constituent:PP            |               8 |
| almost_constituent:PRT           |               1 |
| almost_constituent:QP            |               3 |
| almost_constituent:RB            |               3 |
| almost_constituent:ROOT          |              70 |
| almost_constituent:S             |              15 |
| almost_constituent:SBAR          |              16 |
| almost_constituent:SBARQ         |               1 |
| almost_constituent:VBN           |               1 |
| almost_constituent:VP            |              31 |
| almost_constituent:X             |               4 |
| almost_constituent_2ndpass_:NN   |               1 |
| almost_constituent_2ndpass_:NNP  |               1 |
| almost_constituent_2ndpass_:NP   |               5 |
| almost_constituent_2ndpass_:PRP$ |               1 |
| constituent                      |           17435 |
| missing_tree                     |            1338 |
| multiple_constituents            |            5014 |
| not_constituent                  |            5056 |
| unknown_error                    |            1309 |
| xclausal                         |             301 |
+----------------------------------+-----------------+

Variable Value
Almost constituents 728
Constituents 17435
Multiple constituents 5014
Non-constituents 5056
Cross-clausal links 301
Links with missing trees 1338
Links with unknown errors 1309

Now, we tally up further and say that intended and actual constituents are the sum of almost-constituent, constituent, and multiple constituent links. This sum is 23177.

For ultimately non-constituents, this is the sum of the non-constituents and cross-clausal links: 5357.

The grand total of correctly parsed links is 28534.

So intended/actual constituents counted for 81.23% of the correctly parsed links, while non-constituents counted for 18.77% of the correctly parsed links.

More Issues

Just kidding! I have success now. There was weird stuff going on with PHP, but I got that fixed up and I had the fix_entries.php script run on my remote Windows machine and unlike running the Stanford parser on Scripts, the Stanford parser on a Windows 7 machine with only 512 MB of RAM worked almost like a charm. Entries that were previously unparsable due to the memory ceiling were now mostly parsable. (It still choked up on some entries, because my remote system doesn’t have a lot of RAM to work with). But since it could work on my remote system, I could run on it on my own laptop, which I’m doing right now and entries that ran out of memory on my remote system are parsing correctly! (The ones that still aren’t parsing correctly are, after inspection, “strange” entries with lists of book titles, etc.).

Everything seems to be going swimmingly. At some point, I’m going to have to rerun the constituency script (or modify it to skip ones it already has judgements for).

Memory Issues

After testing various things with the parsers, on my own laptop and on the Scripts servers, I modified the code to exclude Berkeley parses, not because I haven’t figured out how to get the output, but because the Berkeley parser runs out of memory.

I was reading through the Scripts blog and they mentioned something about the JVM and memory issues. In fact, they said the option -Xmx128M is the default for their JVM, that is, they initially only let the JVM allocate 128 MB of memory. This happens to be a problem for the Berkeley parser which runs out of memory rather quickly.

For parsing the sentence, “Simulations are special interactive programs which represent dynamic models of devices, processes and systems.”, Windows commits 775.75 MB for the parser (it actually only uses 571.08 MB), which way over the 128 MB initial allocation that Scripts places. Scripts claims that you can try to increase the memory to 256 MB, perhaps even 512 MB, but not 768 MB. They said it became unstable at 768 MB. When I did try it on the Scripts server, with 256 MB, it wouldn’t run still. And at 512 MB, the JVM can’t allocate that much memory because apparently it’s not available.

Now, the Stanford parser is better at memory management. For the same parse, the JVM committed 242.64 MB and only actually used 147.1 MB. The Stanford parser is more ideal for running on the Scripts server. However, with longer sentences, as the parsing script (parse_entries.php) ran through all 5,865 rows of entry data, I noticed that some of the parses had an error message “Sentence skipped: no PCFG fallback. SENTENCE_SKIPPED_OR_UNPARSABLE”.

This was before I did any memory tests, so I thought the Stanford parser was choking up on malformed text, like my HTML-stripping regexp was failing, or I didn’t decode HTML entities like &quot to “, or the parser wasn’t handling special whitespace characters well. I fixed those, but these errors still persisted. It turns out the Stanford parser fails on longer sentences on the Scripts server due to lack of sufficient memory.

On longer sentences, Windows commits around 400+ MB of RAM (but only used around 255 MB) to the parser. Either way, I think it must run out of memory on the Scripts server.

The solution is simply to run it on my own computer, run the PHP script locally so that I can run the parsing without the memory ceiling. But my Apache installation broke after I updated PHP, so I currently can’t run any PHP scripts. As an alternative, I let the parser just run through all of the rows, but add the entry ids of entries that contained the memory error to a new table I called hyperlinks_bad_entries. Of the 5864 entries, 2,642 had errors, so that’s close to half. When doing constituency tests, I can probably ignore the bad entries for the time being.

Coding and Parsing

So I’ve cleaned up some code for scraping Metafilter.com and have setup a SQL database to hold it all. I’ve run the code and it’s definitely running the code and populating the database correctly. The only problem is that the PHP script stops running after a while, there’s probably a script timeout somewhere that I should set to have it run (almost) indefinitely. I haven’t yet tested the constituency code and will be doing that now. Once the code has been tested for accuracy, I will go back to something I started last week.

I wanted to get around the PHP script timeout (for which I can probably assume that there’s a variable that controls that), but I also wanted to see how hard it would be to implement a similar program in Java. So far, writing the same Metafilter scraper in Java hasn’t been so easy.

First, I hate streams, everything in Java is in streams. They never give you a simple “loader” that gives you all of the loaded data at once. One issue I came across when running the PHP code was that there were a TON of HTML warnings, malformed markup, etc. Java’s Swing HTML parser (and SAX-based XML parsing), I’ve read aren’t too reliable for real-life HTML that you’ll find on websites. Fortunately, I found the Mozilla HTML Parser for which someone created a Java wrapper for (it’s originally written in C++) and am currently using that (in conjunction with dom4j.

So, I have that set up, I just need to write some regular expressions (I hope Java’s implementation is at least similar to PHP’s) to pull out data and some code to push it to the database. If successful, I’m sure I could just let this Java program run forever.

My immediate goals are to write the code, make sure it works, and run it. After I’ve got some parses and constituency tests to look at, then I can begin to think about the failures and how we can rate constituency.

Also, since the Stanford and Berkeley parsers are based on the WSJ portions of the Penn Treebank, it may be helpful if we could find a news source that is like Metafilter.com, but has WSJ styled writing (maybe this is impossible). But it would be much more accurate if we did, because that’s what the parsers were trained on.

Oh, I should also set up the subversion (Mercurial) on Google Code project hosting. I’ve been having terrible luck with SVN recently. Maybe it’s time to try Mercurial.

EDIT: max_execution_time in php.ini defines script execution time. The default is 30 seconds, but server configurations like Apache servers may have other defaults (say, 300 seconds). I set it to 600 seconds (10 minutes).

I’m curious about the entries that produce HTML warnings and the ones that say “no content on vwxyz”. I wonder if there really isn’t any content. Maybe I should keep track of what entries have warnings and what entries have “no content”. Additionally, there seems to be code missing to strip the HTML tags so I can feed it into the parsers, but that should just be an easy regular expression anyways.