Last week, after pushing the DP-structure and sentence-splitting changes, we reran the parser over the whole corpus. We discovered that about 10% of our links which were formerly marked as constituents are now no longer marked as such. I went through about 25 of them, and there weren’t really any originally false positives. Many of the errors were due to errors with the parser (about 4%, leaving 6% real issues). The others were mainly attributable to the sentence splitter being stupid, along with some issues with the parser.
As a result, I’ve replaced the sentence-splitter with Sebastian Nagel‘s tokenizer, which also does sentence-splitting. It seems to work better – more of our test cases pass, which is a good thing. I’m trying to find some more computing resources so we can test different combinations of the improvements, because we aren’t sure how they interrelate.
I just pushed sentence-splitting code up to the repository. parse_entries.php now splits sentences before feeding them to the parser, which makes a lot more sense, as the parser did not handle multi-sentence paragraphs well. We’re using Adwait Ratnaparkhi’s MXTERMINATOR sentence-splitter.
We’ve also retrained the parser and reparsed the whole database, using David Vadas’ NP structure additions to the Penn Treebank. The two have increased the constituency percentage by about +6%, which is slightly less than I expected.
Hello! My name’s Patrick Hulin, and I’m a new UROP working on the project. I’m a freshman at MIT, and I’m probably going to be studying mathematics.
My first task has been classifying non-constituent hyperlinks. I’ve gone through about 50 of them, chosen randomly. Since non-constituents are split up into failure categories, I separated by that and took a number of links in each category proportional to the size of that category. Here’s the final data: