Labeling

I just pushed to the repository. There is very good immediately-dominating node identification code in functions.php that hasn’t yet been utilized in the main judgement loop.

The “link” problem of finding the right subtree of all possible subtrees is handled fairly well, at least for links that have more than one or two nodes in the subtree. If they have less, then it’s up in the air, which subtree the code actually picked out. But for smaller links of one or two nodes, it is very possible that it doesn’t matter which subtree we pick up ’cause the structure will still be the same. The word “for” will pretty much always be “(IN for)” no matter what subtree it’s in; it’s not a problem if we mismatch a subtree consisting of just one (or even two) words.

I tested the node id-ing code on the first 30 non-error rows of the links table and so far, everything checks out. If the logic is correct (and I just hand-traced this myself), it should even correctly identify the node in the two-pronged case:

(X
    (α
        (A Ø)
        <<<LINK(B Ø)
    )
    (β
        (C Ø)
LINK;
        (D Ø)
    )
)

In fact, I’m very confident that it should be able to pick out the X node.

Here is a link to an output file of my test run showing correct node labels for each of the parsed links.

Polishing Here and There

functions.php is mostly complete. The constituency testing with node labeling, missing node determination, and punctuation-secondary passes, are working quite well.

I copied the database back to my own to run tests to ensure the code worked. The results so far are quite satisfactory. Looking through logs of stderr with stdout, I noticed that some of the unknown errors were in fact due to inline HTML in the hyperlink text, including tags such as “<i>” (and of course, “</i>”). I now invoke stripTags() on the link text before generating the regexp pattern, though my stripTags() is a very simple preg_replace() with a simple regexp.

In hindsight, it’ll miss self-closing tags like “<br />”, but somehow, I doubt people will be using HTML much (especially the line break element, since a simple keyboard return will have the same effect, and links tend to only span one line anyways) in Mefi entries. However, it’ll also have some false positives, though I doubt anyone would ever type in a string like “< and >”.

Here are some preliminary stats from my test run on my own database compared to the unaltered constituency database:

anguyen+linguistics
+-----------------------+---------------------+
| constituency          | COUNT(constituency) |
+-----------------------+---------------------+
| constituent           |               18219 |
| error                 |                2644 |
| multiple_constituents |                5023 |
| not_constituent       |                5295 |
+-----------------------+---------------------+

constituency+hyperlinks
+-----------------------+---------------------+
| constituency          | count(constituency) |
+-----------------------+---------------------+
| constituent           |               18163 |
| error                 |                2647 |
| multiple_constituents |                5014 |
| not_constituent       |                5357 |
+-----------------------+---------------------+

constituent+56
error-3
multiple_constituents+9
not_constituent-62

Note, this is without the HTML stripping, so we can expect to have even fewer errors, in subsequent runs. Other errors included “)” and “/” in the PHP warnings when it parsed the link patterns, but I have no clue where they came from. I’ll check it later.

Non-constituent Findings

From only 6-7 “non-constituent” links that I hand-checked, we saw a pattern of hyperlinks that were nearly constituents but left off adjuncts such as prepositional phrases.

Since then, I have hand-checked about 60 non-constituent links and have seen very similar patterns.

The results mostly show complements/adjuncts (I didn’t attempt to distinguish between the two) being left out of the hyperlink. Other results show that determiners are being left out of the hyperlink. This, however, is expected. If we make trees according to DP theory (where NPs are complement to DPs), then this would be fine and dandy. However, the Stanford trees have their determiners (DT) as part of the noun phrase (possibly in the specifier position, if there was such a distinction here). As a result, links with missing determiners are incorrectly judged as non-constituents.

In other cases, leading adjectives were also left out. For example: “a slimy, warty, green frog” but the author would only make “green frog” or “warty, green frog” a hyperlink and leave out (predictably) the determiner and (maybe unpredictably) one or more leading adjectives. In this case, “slimy” or “slimy, warty” were left out. I believe AdjPs are adjunct to the NP (so most substrings of that string should be constituents), but here, the noun phrase structure given by the parser makes it so constituency doesn’t happen for such substrings.

Surprisingly, a large number of these links were simply victims of erroneous parsing and data preparation. Where the Stanford parser fails the most is in comma-delimited lists and the like. Items in conjoined sequences are most definitely constituents, but these tend to fail.

Another fault of the Stanford parser (and perhaps my own) is that final punctuation caused constituents to be judged as non-constituents. Take for example a hypothetical link with the text (without quotes) “the water.” presumably as the direct object of the verb. What happens is that “the water” is parsed normally and if you checked it for constituency yourself, it would be a constituent. But, the additional period “.” at the end is part of the string. Why is this a problem? It’s because, even though punctuation is given its own node in the tree, it is normally placed at the very end of the tree, outside of every node, i.e. not where we expect it to be.

We would expect “the water.” to match something like:

(DT the) (NN water) (. .))))…

Here, there probably would be at least a single node dominating either just “the water” or “the water.” nodes and that would give us our constituent. However, the Stanford parser usually parses the phrase like so:

(DT the) (NN water))))…(. .)

And this is not what we expect at all. A simple (but not perfect) fix would be to probably strip out all punctuation save for quotation marks (though these cause problems as well) and perhaps commas when creating the regular expression pattern for a link.

Below is a list of my findings and comments. Anything marked with “incorrect” is what I think should be a constituent under a manual parse, but was misjudged. Anything without an “incorrect” was judged correctly as a non-constituent. I added various comments to try to categorize each kind of failure. I initially didn’t note what kind of phrase was left off from complement/adjunct-less hyperlinks (noted as “C/A chopped off”).

87943+1 = incorrect, det chopped off
87943+2 = C/A chopped off
87943+8 = C/A chopped off
87944+0 = C/A chopped off
87944+5 = C/A chopped off
87944+6 = incorrect, stanford tree is wrong
87944+9 = incorrect, constituent
87947+0 = incorrect, final punctuation was included
87949+2 = C/A chopped off, stanford tree is ambiguous
87949+3 = incorrect, final punctuation was included
87952+3 = incorrect, stanford tree is wrong
87952+4 = C/A chopped off
87952+8 = have no clue, a string containing the name of the news source?
87952+15 = incorrect, final punctuation was included
87954+3 = incorrect, final punctuation was included
87954+4 = incorrect, final punctuation was included
87956+1 = incorrect, det chopped off (in NP->D system, correct, else DP->NP, incorrect)
87959+0 = C/A chopped off
87960+0 = C/A (appositive) chopped off
87960+2 = incorrect, det chopped off
87961+3 = incorrect, stanford tree is ambiguous
87962+0 = C/A (adverb/past participle) chopped off
87962+2 = C/A (non-restrictive? relative) chopped off
87963+2 = incorrect, stanford tree is wrong: misparsed an undelimited list
87963+3 = incorrect, stanford tree is wrong: misparsed an undelimited list
87964+0 = incorrect, stanford tree is wrong: misparsed strange (SLYT), just bad entry formatting
87971+2 = incorrect, stanford tree is wrong: misparsed list and used verb variant of the noun
87971+4 = C/A (second half of conjunction nested in VP) chopped off
87972+0 = incorrect, stanford tree is wrong
87974+6 = incorrect, stanford tree is wrong, misparsed verbal arguments, misplaced preposition
87975+0 = incorrect, stanford tree is wrong, misparsed topicalization/clefting, something like that
87976+1 = incorrect, initial punctuation (") was included
87976+2 = incorrect, final punctuation (.") was included
87976+3 = incorrect, final punctuation (.) was included
87977+0 = complement to P (of) chopped off...strange..."most of"
87978+1 = C/A (preposition) chopped off
87981+0 = C/A (appositive) chopped off
87981+4 = C/A (preposition) chopped off, stanford tree is wrong, misparsed "of" possession
87982+8 = C/A (preposition) chopped off, det ("'s" possession) chopped off
87982+9 = det and leading adjective chopped off
87983+1 = C/A (second half of conjuction) chopped off
87984+0 = C/A (appositive) chopped off
87986+2 = C/A (preposition) chopped off, det and leading adjective chopped off
87986+4 = incorrect, stanford tree is wrong, misparsed participle attachment after complete VP
87987+4 = det and leading adjectives chopped off
87987+26 = following adjectives and noun head chopped off...strange..."the only"
87987+36 = incorrect, no clue, should be a constituent
87988+9 = incorrect, stanford tree is wrong, misparsed list of things
87988+11 = C/A (other parts of conjunction, comma-list) chopped off
87989+1 = det chopped off
87989+2 = C/A (preposition in passive construction) chopped off
87989+7 = ??? unsure, what to do with "as"
87991+4 = C/A (other parts of conjuction nested in VP, comma-list) chopped off
87991+17 = det and leading adjectives chopped off
87991+20 = incorrect, det chopped off
87991+23 = C/A (preposition) chopped off
87991+24 = det chopped off, final punctuation included
87991+27 = vP chopped off from TP, strange..."already have (won one)"...
87991+39 = C/A (preposition) chopped off
87992+1 = C/A (preposition) chopped off, passive construction may be a problem

I also wrote some code to get the rest of an “incomplete” constituent, but haven’t fully tested it yet. It reads downwards and should be able to handle “C/A chopped off” links. As for “det chopped off” and “leading adjectives chopped off”, I just need to reverse the direction in which it looks.