Polishing Here and There

functions.php is mostly complete. The constituency testing with node labeling, missing node determination, and punctuation-secondary passes, are working quite well.

I copied the database back to my own to run tests to ensure the code worked. The results so far are quite satisfactory. Looking through logs of stderr with stdout, I noticed that some of the unknown errors were in fact due to inline HTML in the hyperlink text, including tags such as “<i>” (and of course, “</i>”). I now invoke stripTags() on the link text before generating the regexp pattern, though my stripTags() is a very simple preg_replace() with a simple regexp.

In hindsight, it’ll miss self-closing tags like “<br />”, but somehow, I doubt people will be using HTML much (especially the line break element, since a simple keyboard return will have the same effect, and links tend to only span one line anyways) in Mefi entries. However, it’ll also have some false positives, though I doubt anyone would ever type in a string like “< and >”.

Here are some preliminary stats from my test run on my own database compared to the unaltered constituency database:

anguyen+linguistics
+-----------------------+---------------------+
| constituency          | COUNT(constituency) |
+-----------------------+---------------------+
| constituent           |               18219 |
| error                 |                2644 |
| multiple_constituents |                5023 |
| not_constituent       |                5295 |
+-----------------------+---------------------+

constituency+hyperlinks
+-----------------------+---------------------+
| constituency          | count(constituency) |
+-----------------------+---------------------+
| constituent           |               18163 |
| error                 |                2647 |
| multiple_constituents |                5014 |
| not_constituent       |                5357 |
+-----------------------+---------------------+

constituent+56
error-3
multiple_constituents+9
not_constituent-62

Note, this is without the HTML stripping, so we can expect to have even fewer errors, in subsequent runs. Other errors included “)” and “/” in the PHP warnings when it parsed the link patterns, but I have no clue where they came from. I’ll check it later.