Last week, after pushing the DP-structure and sentence-splitting changes, we reran the parser over the whole corpus. We discovered that about 10% of our links which were formerly marked as constituents are now no longer marked as such. I went through about 25 of them, and there weren’t really any originally false positives. Many of the errors were due to errors with the parser (about 4%, leaving 6% real issues). The others were mainly attributable to the sentence splitter being stupid, along with some issues with the parser.
As a result, I’ve replaced the sentence-splitter with Sebastian Nagel‘s tokenizer, which also does sentence-splitting. It seems to work better – more of our test cases pass, which is a good thing. I’m trying to find some more computing resources so we can test different combinations of the improvements, because we aren’t sure how they interrelate.
Related posts brought to you by Yet Another Related Posts Plugin.