Sentence-splitting: again.

Last week, after pushing the DP-structure and sentence-splitting changes, we reran the parser over the whole corpus. We discovered that about 10% of our links which were formerly marked as constituents are now no longer marked as such. I went through about 25 of them, and there weren’t really any originally false positives. Many of the errors were due to errors with the parser (about 4%, leaving 6% real issues). The others were mainly attributable to the sentence splitter being stupid, along with some issues with the parser.

As a result, I’ve replaced the sentence-splitter with Sebastian Nagel‘s tokenizer, which also does sentence-splitting. It seems to work better – more of our test cases pass, which is a good thing. I’m trying to find some more computing resources so we can test different combinations of the improvements, because we aren’t sure how they interrelate.

Related posts:

  1. Sentence splitting and DP structure

Related posts brought to you by Yet Another Related Posts Plugin.

Leave a Reply

Your email address will not be published. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>