Sentence-splitting: again.

Last week, after pushing the DP-structure and sentence-splitting changes, we reran the parser over the whole corpus. We discovered that about 10% of our links which were formerly marked as constituents are now no longer marked as such. I went through about 25 of them, and there weren’t really any originally false positives. Many of the errors were due to errors with the parser (about 4%, leaving 6% real issues). The others were mainly attributable to the sentence splitter being stupid, along with some issues with the parser.

As a result, I’ve replaced the sentence-splitter with Sebastian Nagel‘s tokenizer, which also does sentence-splitting. It seems to work better – more of our test cases pass, which is a good thing. I’m trying to find some more computing resources so we can test different combinations of the improvements, because we aren’t sure how they interrelate.

Sentence splitting and DP structure

I just pushed sentence-splitting code up to the repository. parse_entries.php now splits sentences before feeding them to the parser, which makes a lot more sense, as the parser did not handle multi-sentence paragraphs well. We’re using Adwait Ratnaparkhi’s MXTERMINATOR sentence-splitter.

We’ve also retrained the parser and reparsed the whole database, using David Vadas’ NP structure additions to the Penn Treebank. The two have increased the constituency percentage by about +6%, which is slightly less than I expected.

Introduction and Negative Classification

Hello! My name’s Patrick Hulin, and I’m a new UROP working on the project. I’m a freshman at MIT, and I’m probably going to be studying mathematics.

My first task has been classifying non-constituent hyperlinks. I’ve gone through about 50 of them, chosen randomly. Since non-constituents are split up into failure categories, I separated by that and took a number of links in each category proportional to the size of that category. Here’s the final data:

Continue reading