From only 6-7 “non-constituent” links that I hand-checked, we saw a pattern of hyperlinks that were nearly constituents but left off adjuncts such as prepositional phrases.
Since then, I have hand-checked about 60 non-constituent links and have seen very similar patterns.
The results mostly show complements/adjuncts (I didn’t attempt to distinguish between the two) being left out of the hyperlink. Other results show that determiners are being left out of the hyperlink. This, however, is expected. If we make trees according to DP theory (where NPs are complement to DPs), then this would be fine and dandy. However, the Stanford trees have their determiners (DT) as part of the noun phrase (possibly in the specifier position, if there was such a distinction here). As a result, links with missing determiners are incorrectly judged as non-constituents.
In other cases, leading adjectives were also left out. For example: “a slimy, warty, green frog” but the author would only make “green frog” or “warty, green frog” a hyperlink and leave out (predictably) the determiner and (maybe unpredictably) one or more leading adjectives. In this case, “slimy” or “slimy, warty” were left out. I believe AdjPs are adjunct to the NP (so most substrings of that string should be constituents), but here, the noun phrase structure given by the parser makes it so constituency doesn’t happen for such substrings.
Surprisingly, a large number of these links were simply victims of erroneous parsing and data preparation. Where the Stanford parser fails the most is in comma-delimited lists and the like. Items in conjoined sequences are most definitely constituents, but these tend to fail.
Another fault of the Stanford parser (and perhaps my own) is that final punctuation caused constituents to be judged as non-constituents. Take for example a hypothetical link with the text (without quotes) “the water.” presumably as the direct object of the verb. What happens is that “the water” is parsed normally and if you checked it for constituency yourself, it would be a constituent. But, the additional period “.” at the end is part of the string. Why is this a problem? It’s because, even though punctuation is given its own node in the tree, it is normally placed at the very end of the tree, outside of every node, i.e. not where we expect it to be.
We would expect “the water.” to match something like:
(DT the) (NN water) (. .))))…
Here, there probably would be at least a single node dominating either just “the water” or “the water.” nodes and that would give us our constituent. However, the Stanford parser usually parses the phrase like so:
(DT the) (NN water))))…(. .)
And this is not what we expect at all. A simple (but not perfect) fix would be to probably strip out all punctuation save for quotation marks (though these cause problems as well) and perhaps commas when creating the regular expression pattern for a link.
Below is a list of my findings and comments. Anything marked with “incorrect” is what I think should be a constituent under a manual parse, but was misjudged. Anything without an “incorrect” was judged correctly as a non-constituent. I added various comments to try to categorize each kind of failure. I initially didn’t note what kind of phrase was left off from complement/adjunct-less hyperlinks (noted as “C/A chopped off”).
87943+1 = incorrect, det chopped off
87943+2 = C/A chopped off
87943+8 = C/A chopped off
87944+0 = C/A chopped off
87944+5 = C/A chopped off
87944+6 = incorrect, stanford tree is wrong
87944+9 = incorrect, constituent
87947+0 = incorrect, final punctuation was included
87949+2 = C/A chopped off, stanford tree is ambiguous
87949+3 = incorrect, final punctuation was included
87952+3 = incorrect, stanford tree is wrong
87952+4 = C/A chopped off
87952+8 = have no clue, a string containing the name of the news source?
87952+15 = incorrect, final punctuation was included
87954+3 = incorrect, final punctuation was included
87954+4 = incorrect, final punctuation was included
87956+1 = incorrect, det chopped off (in NP->D system, correct, else DP->NP, incorrect)
87959+0 = C/A chopped off
87960+0 = C/A (appositive) chopped off
87960+2 = incorrect, det chopped off
87961+3 = incorrect, stanford tree is ambiguous
87962+0 = C/A (adverb/past participle) chopped off
87962+2 = C/A (non-restrictive? relative) chopped off
87963+2 = incorrect, stanford tree is wrong: misparsed an undelimited list
87963+3 = incorrect, stanford tree is wrong: misparsed an undelimited list
87964+0 = incorrect, stanford tree is wrong: misparsed strange (SLYT), just bad entry formatting
87971+2 = incorrect, stanford tree is wrong: misparsed list and used verb variant of the noun
87971+4 = C/A (second half of conjunction nested in VP) chopped off
87972+0 = incorrect, stanford tree is wrong
87974+6 = incorrect, stanford tree is wrong, misparsed verbal arguments, misplaced preposition
87975+0 = incorrect, stanford tree is wrong, misparsed topicalization/clefting, something like that
87976+1 = incorrect, initial punctuation (") was included
87976+2 = incorrect, final punctuation (.") was included
87976+3 = incorrect, final punctuation (.) was included
87977+0 = complement to P (of) chopped off...strange..."most of"
87978+1 = C/A (preposition) chopped off
87981+0 = C/A (appositive) chopped off
87981+4 = C/A (preposition) chopped off, stanford tree is wrong, misparsed "of" possession
87982+8 = C/A (preposition) chopped off, det ("'s" possession) chopped off
87982+9 = det and leading adjective chopped off
87983+1 = C/A (second half of conjuction) chopped off
87984+0 = C/A (appositive) chopped off
87986+2 = C/A (preposition) chopped off, det and leading adjective chopped off
87986+4 = incorrect, stanford tree is wrong, misparsed participle attachment after complete VP
87987+4 = det and leading adjectives chopped off
87987+26 = following adjectives and noun head chopped off...strange..."the only"
87987+36 = incorrect, no clue, should be a constituent
87988+9 = incorrect, stanford tree is wrong, misparsed list of things
87988+11 = C/A (other parts of conjunction, comma-list) chopped off
87989+1 = det chopped off
87989+2 = C/A (preposition in passive construction) chopped off
87989+7 = ??? unsure, what to do with "as"
87991+4 = C/A (other parts of conjuction nested in VP, comma-list) chopped off
87991+17 = det and leading adjectives chopped off
87991+20 = incorrect, det chopped off
87991+23 = C/A (preposition) chopped off
87991+24 = det chopped off, final punctuation included
87991+27 = vP chopped off from TP, strange..."already have (won one)"...
87991+39 = C/A (preposition) chopped off
87992+1 = C/A (preposition) chopped off, passive construction may be a problem
I also wrote some code to get the rest of an “incomplete” constituent, but haven’t fully tested it yet. It reads downwards and should be able to handle “C/A chopped off” links. As for “det chopped off” and “leading adjectives chopped off”, I just need to reverse the direction in which it looks.