August | 2010 | <a>constituency</a>

functions.php is mostly complete. The constituency testing with node labeling, missing node determination, and punctuation-secondary passes, are working quite well.

I copied the database back to my own to run tests to ensure the code worked. The results so far are quite satisfactory. Looking through logs of stderr with stdout, I noticed that some of the unknown errors were in fact due to inline HTML in the hyperlink text, including tags such as “<i>” (and of course, “</i>”). I now invoke stripTags() on the link text before generating the regexp pattern, though my stripTags() is a very simple preg_replace() with a simple regexp.

In hindsight, it’ll miss self-closing tags like “<br />”, but somehow, I doubt people will be using HTML much (especially the line break element, since a simple keyboard return will have the same effect, and links tend to only span one line anyways) in Mefi entries. However, it’ll also have some false positives, though I doubt anyone would ever type in a string like “< and >”.

Here are some preliminary stats from my test run on my own database compared to the unaltered constituency database:

anguyen+linguistics +-----------------------+---------------------+ | constituency | COUNT(constituency) | +-----------------------+---------------------+ | constituent | 18219 | | error | 2644 | | multiple_constituents | 5023 | | not_constituent | 5295 | +-----------------------+---------------------+

constituency+hyperlinks +-----------------------+---------------------+ | constituency | count(constituency) | +-----------------------+---------------------+ | constituent | 18163 | | error | 2647 | | multiple_constituents | 5014 | | not_constituent | 5357 | +-----------------------+---------------------+

constituent+56 error-3 multiple_constituents+9 not_constituent-62

Note, this is without the HTML stripping, so we can expect to have even fewer errors, in subsequent runs. Other errors included “)” and “/” in the PHP warnings when it parsed the link patterns, but I have no clue where they came from. I’ll check it later.

From only 6-7 “non-constituent” links that I hand-checked, we saw a pattern of hyperlinks that were nearly constituents but left off adjuncts such as prepositional phrases.

Since then, I have hand-checked about 60 non-constituent links and have seen very similar patterns.

The results mostly show complements/adjuncts (I didn’t attempt to distinguish between the two) being left out of the hyperlink. Other results show that determiners are being left out of the hyperlink. This, however, is expected. If we make trees according to DP theory (where NPs are complement to DPs), then this would be fine and dandy. However, the Stanford trees have their determiners (DT) as part of the noun phrase (possibly in the specifier position, if there was such a distinction here). As a result, links with missing determiners are incorrectly judged as non-constituents.

In other cases, leading adjectives were also left out. For example: “a slimy, warty, green frog” but the author would only make “green frog” or “warty, green frog” a hyperlink and leave out (predictably) the determiner and (maybe unpredictably) one or more leading adjectives. In this case, “slimy” or “slimy, warty” were left out. I believe AdjPs are adjunct to the NP (so most substrings of that string should be constituents), but here, the noun phrase structure given by the parser makes it so constituency doesn’t happen for such substrings.

Surprisingly, a large number of these links were simply victims of erroneous parsing and data preparation. Where the Stanford parser fails the most is in comma-delimited lists and the like. Items in conjoined sequences are most definitely constituents, but these tend to fail.

Another fault of the Stanford parser (and perhaps my own) is that final punctuation caused constituents to be judged as non-constituents. Take for example a hypothetical link with the text (without quotes) “the water.” presumably as the direct object of the verb. What happens is that “the water” is parsed normally and if you checked it for constituency yourself, it would be a constituent. But, the additional period “.” at the end is part of the string. Why is this a problem? It’s because, even though punctuation is given its own node in the tree, it is normally placed at the very end of the tree, outside of every node, i.e. not where we expect it to be.

We would expect “the water.” to match something like:

(DT the) (NN water) (. .))))…

Here, there probably would be at least a single node dominating either just “the water” or “the water.” nodes and that would give us our constituent. However, the Stanford parser usually parses the phrase like so:

(DT the) (NN water))))…(. .)

And this is not what we expect at all. A simple (but not perfect) fix would be to probably strip out all punctuation save for quotation marks (though these cause problems as well) and perhaps commas when creating the regular expression pattern for a link.

Below is a list of my findings and comments. Anything marked with “incorrect” is what I think should be a constituent under a manual parse, but was misjudged. Anything without an “incorrect” was judged correctly as a non-constituent. I added various comments to try to categorize each kind of failure. I initially didn’t note what kind of phrase was left off from complement/adjunct-less hyperlinks (noted as “C/A chopped off”).

87943+1 = incorrect, det chopped off 87943+2 = C/A chopped off 87943+8 = C/A chopped off 87944+0 = C/A chopped off 87944+5 = C/A chopped off 87944+6 = incorrect, stanford tree is wrong 87944+9 = incorrect, constituent 87947+0 = incorrect, final punctuation was included 87949+2 = C/A chopped off, stanford tree is ambiguous 87949+3 = incorrect, final punctuation was included 87952+3 = incorrect, stanford tree is wrong 87952+4 = C/A chopped off 87952+8 = have no clue, a string containing the name of the news source? 87952+15 = incorrect, final punctuation was included 87954+3 = incorrect, final punctuation was included 87954+4 = incorrect, final punctuation was included 87956+1 = incorrect, det chopped off (in NP->D system, correct, else DP->NP, incorrect) 87959+0 = C/A chopped off 87960+0 = C/A (appositive) chopped off 87960+2 = incorrect, det chopped off 87961+3 = incorrect, stanford tree is ambiguous 87962+0 = C/A (adverb/past participle) chopped off 87962+2 = C/A (non-restrictive? relative) chopped off 87963+2 = incorrect, stanford tree is wrong: misparsed an undelimited list 87963+3 = incorrect, stanford tree is wrong: misparsed an undelimited list 87964+0 = incorrect, stanford tree is wrong: misparsed strange (SLYT), just bad entry formatting 87971+2 = incorrect, stanford tree is wrong: misparsed list and used verb variant of the noun 87971+4 = C/A (second half of conjunction nested in VP) chopped off 87972+0 = incorrect, stanford tree is wrong 87974+6 = incorrect, stanford tree is wrong, misparsed verbal arguments, misplaced preposition 87975+0 = incorrect, stanford tree is wrong, misparsed topicalization/clefting, something like that 87976+1 = incorrect, initial punctuation (") was included 87976+2 = incorrect, final punctuation (.") was included 87976+3 = incorrect, final punctuation (.) was included 87977+0 = complement to P (of) chopped off...strange..."most of" 87978+1 = C/A (preposition) chopped off 87981+0 = C/A (appositive) chopped off 87981+4 = C/A (preposition) chopped off, stanford tree is wrong, misparsed "of" possession 87982+8 = C/A (preposition) chopped off, det ("'s" possession) chopped off 87982+9 = det and leading adjective chopped off 87983+1 = C/A (second half of conjuction) chopped off 87984+0 = C/A (appositive) chopped off 87986+2 = C/A (preposition) chopped off, det and leading adjective chopped off 87986+4 = incorrect, stanford tree is wrong, misparsed participle attachment after complete VP 87987+4 = det and leading adjectives chopped off 87987+26 = following adjectives and noun head chopped off...strange..."the only" 87987+36 = incorrect, no clue, should be a constituent 87988+9 = incorrect, stanford tree is wrong, misparsed list of things 87988+11 = C/A (other parts of conjunction, comma-list) chopped off 87989+1 = det chopped off 87989+2 = C/A (preposition in passive construction) chopped off 87989+7 = ??? unsure, what to do with "as" 87991+4 = C/A (other parts of conjuction nested in VP, comma-list) chopped off 87991+17 = det and leading adjectives chopped off 87991+20 = incorrect, det chopped off 87991+23 = C/A (preposition) chopped off 87991+24 = det chopped off, final punctuation included 87991+27 = vP chopped off from TP, strange..."already have (won one)"... 87991+39 = C/A (preposition) chopped off 87992+1 = C/A (preposition) chopped off, passive construction may be a problem

I also wrote some code to get the rest of an “incomplete” constituent, but haven’t fully tested it yet. It reads downwards and should be able to handle “C/A chopped off” links. As for “det chopped off” and “leading adjectives chopped off”, I just need to reverse the direction in which it looks.

<a>constituency</a>

A research project at MIT Linguistics

Monthly Archives: August 2010

Labeling

Polishing Here and There

Non-constituent Findings