Labeling

I just pushed to the repository. There is very good immediately-dominating node identification code in functions.php that hasn’t yet been utilized in the main judgement loop.

The “link” problem of finding the right subtree of all possible subtrees is handled fairly well, at least for links that have more than one or two nodes in the subtree. If they have less, then it’s up in the air, which subtree the code actually picked out. But for smaller links of one or two nodes, it is very possible that it doesn’t matter which subtree we pick up ’cause the structure will still be the same. The word “for” will pretty much always be “(IN for)” no matter what subtree it’s in; it’s not a problem if we mismatch a subtree consisting of just one (or even two) words.

I tested the node id-ing code on the first 30 non-error rows of the links table and so far, everything checks out. If the logic is correct (and I just hand-traced this myself), it should even correctly identify the node in the two-pronged case:

(X
    (α
        (A Ø)
        <<<LINK(B Ø)
    )
    (β
        (C Ø)
LINK;
        (D Ø)
    )
)

In fact, I’m very confident that it should be able to pick out the X node.

Here is a link to an output file of my test run showing correct node labels for each of the parsed links.

Polishing Here and There

functions.php is mostly complete. The constituency testing with node labeling, missing node determination, and punctuation-secondary passes, are working quite well.

I copied the database back to my own to run tests to ensure the code worked. The results so far are quite satisfactory. Looking through logs of stderr with stdout, I noticed that some of the unknown errors were in fact due to inline HTML in the hyperlink text, including tags such as “<i>” (and of course, “</i>”). I now invoke stripTags() on the link text before generating the regexp pattern, though my stripTags() is a very simple preg_replace() with a simple regexp.

In hindsight, it’ll miss self-closing tags like “<br />”, but somehow, I doubt people will be using HTML much (especially the line break element, since a simple keyboard return will have the same effect, and links tend to only span one line anyways) in Mefi entries. However, it’ll also have some false positives, though I doubt anyone would ever type in a string like “< and >”.

Here are some preliminary stats from my test run on my own database compared to the unaltered constituency database:

anguyen+linguistics
+-----------------------+---------------------+
| constituency          | COUNT(constituency) |
+-----------------------+---------------------+
| constituent           |               18219 |
| error                 |                2644 |
| multiple_constituents |                5023 |
| not_constituent       |                5295 |
+-----------------------+---------------------+

constituency+hyperlinks
+-----------------------+---------------------+
| constituency          | count(constituency) |
+-----------------------+---------------------+
| constituent           |               18163 |
| error                 |                2647 |
| multiple_constituents |                5014 |
| not_constituent       |                5357 |
+-----------------------+---------------------+

constituent+56
error-3
multiple_constituents+9
not_constituent-62

Note, this is without the HTML stripping, so we can expect to have even fewer errors, in subsequent runs. Other errors included “)” and “/” in the PHP warnings when it parsed the link patterns, but I have no clue where they came from. I’ll check it later.

Non-constituent Findings

From only 6-7 “non-constituent” links that I hand-checked, we saw a pattern of hyperlinks that were nearly constituents but left off adjuncts such as prepositional phrases.

Since then, I have hand-checked about 60 non-constituent links and have seen very similar patterns.

The results mostly show complements/adjuncts (I didn’t attempt to distinguish between the two) being left out of the hyperlink. Other results show that determiners are being left out of the hyperlink. This, however, is expected. If we make trees according to DP theory (where NPs are complement to DPs), then this would be fine and dandy. However, the Stanford trees have their determiners (DT) as part of the noun phrase (possibly in the specifier position, if there was such a distinction here). As a result, links with missing determiners are incorrectly judged as non-constituents.

In other cases, leading adjectives were also left out. For example: “a slimy, warty, green frog” but the author would only make “green frog” or “warty, green frog” a hyperlink and leave out (predictably) the determiner and (maybe unpredictably) one or more leading adjectives. In this case, “slimy” or “slimy, warty” were left out. I believe AdjPs are adjunct to the NP (so most substrings of that string should be constituents), but here, the noun phrase structure given by the parser makes it so constituency doesn’t happen for such substrings.

Surprisingly, a large number of these links were simply victims of erroneous parsing and data preparation. Where the Stanford parser fails the most is in comma-delimited lists and the like. Items in conjoined sequences are most definitely constituents, but these tend to fail.

Another fault of the Stanford parser (and perhaps my own) is that final punctuation caused constituents to be judged as non-constituents. Take for example a hypothetical link with the text (without quotes) “the water.” presumably as the direct object of the verb. What happens is that “the water” is parsed normally and if you checked it for constituency yourself, it would be a constituent. But, the additional period “.” at the end is part of the string. Why is this a problem? It’s because, even though punctuation is given its own node in the tree, it is normally placed at the very end of the tree, outside of every node, i.e. not where we expect it to be.

We would expect “the water.” to match something like:

(DT the) (NN water) (. .))))…

Here, there probably would be at least a single node dominating either just “the water” or “the water.” nodes and that would give us our constituent. However, the Stanford parser usually parses the phrase like so:

(DT the) (NN water))))…(. .)

And this is not what we expect at all. A simple (but not perfect) fix would be to probably strip out all punctuation save for quotation marks (though these cause problems as well) and perhaps commas when creating the regular expression pattern for a link.

Below is a list of my findings and comments. Anything marked with “incorrect” is what I think should be a constituent under a manual parse, but was misjudged. Anything without an “incorrect” was judged correctly as a non-constituent. I added various comments to try to categorize each kind of failure. I initially didn’t note what kind of phrase was left off from complement/adjunct-less hyperlinks (noted as “C/A chopped off”).

87943+1 = incorrect, det chopped off
87943+2 = C/A chopped off
87943+8 = C/A chopped off
87944+0 = C/A chopped off
87944+5 = C/A chopped off
87944+6 = incorrect, stanford tree is wrong
87944+9 = incorrect, constituent
87947+0 = incorrect, final punctuation was included
87949+2 = C/A chopped off, stanford tree is ambiguous
87949+3 = incorrect, final punctuation was included
87952+3 = incorrect, stanford tree is wrong
87952+4 = C/A chopped off
87952+8 = have no clue, a string containing the name of the news source?
87952+15 = incorrect, final punctuation was included
87954+3 = incorrect, final punctuation was included
87954+4 = incorrect, final punctuation was included
87956+1 = incorrect, det chopped off (in NP->D system, correct, else DP->NP, incorrect)
87959+0 = C/A chopped off
87960+0 = C/A (appositive) chopped off
87960+2 = incorrect, det chopped off
87961+3 = incorrect, stanford tree is ambiguous
87962+0 = C/A (adverb/past participle) chopped off
87962+2 = C/A (non-restrictive? relative) chopped off
87963+2 = incorrect, stanford tree is wrong: misparsed an undelimited list
87963+3 = incorrect, stanford tree is wrong: misparsed an undelimited list
87964+0 = incorrect, stanford tree is wrong: misparsed strange (SLYT), just bad entry formatting
87971+2 = incorrect, stanford tree is wrong: misparsed list and used verb variant of the noun
87971+4 = C/A (second half of conjunction nested in VP) chopped off
87972+0 = incorrect, stanford tree is wrong
87974+6 = incorrect, stanford tree is wrong, misparsed verbal arguments, misplaced preposition
87975+0 = incorrect, stanford tree is wrong, misparsed topicalization/clefting, something like that
87976+1 = incorrect, initial punctuation (") was included
87976+2 = incorrect, final punctuation (.") was included
87976+3 = incorrect, final punctuation (.) was included
87977+0 = complement to P (of) chopped off...strange..."most of"
87978+1 = C/A (preposition) chopped off
87981+0 = C/A (appositive) chopped off
87981+4 = C/A (preposition) chopped off, stanford tree is wrong, misparsed "of" possession
87982+8 = C/A (preposition) chopped off, det ("'s" possession) chopped off
87982+9 = det and leading adjective chopped off
87983+1 = C/A (second half of conjuction) chopped off
87984+0 = C/A (appositive) chopped off
87986+2 = C/A (preposition) chopped off, det and leading adjective chopped off
87986+4 = incorrect, stanford tree is wrong, misparsed participle attachment after complete VP
87987+4 = det and leading adjectives chopped off
87987+26 = following adjectives and noun head chopped off...strange..."the only"
87987+36 = incorrect, no clue, should be a constituent
87988+9 = incorrect, stanford tree is wrong, misparsed list of things
87988+11 = C/A (other parts of conjunction, comma-list) chopped off
87989+1 = det chopped off
87989+2 = C/A (preposition in passive construction) chopped off
87989+7 = ??? unsure, what to do with "as"
87991+4 = C/A (other parts of conjuction nested in VP, comma-list) chopped off
87991+17 = det and leading adjectives chopped off
87991+20 = incorrect, det chopped off
87991+23 = C/A (preposition) chopped off
87991+24 = det chopped off, final punctuation included
87991+27 = vP chopped off from TP, strange..."already have (won one)"...
87991+39 = C/A (preposition) chopped off
87992+1 = C/A (preposition) chopped off, passive construction may be a problem

I also wrote some code to get the rest of an “incomplete” constituent, but haven’t fully tested it yet. It reads downwards and should be able to handle “C/A chopped off” links. As for “det chopped off” and “leading adjectives chopped off”, I just need to reverse the direction in which it looks.

More Issues

Just kidding! I have success now. There was weird stuff going on with PHP, but I got that fixed up and I had the fix_entries.php script run on my remote Windows machine and unlike running the Stanford parser on Scripts, the Stanford parser on a Windows 7 machine with only 512 MB of RAM worked almost like a charm. Entries that were previously unparsable due to the memory ceiling were now mostly parsable. (It still choked up on some entries, because my remote system doesn’t have a lot of RAM to work with). But since it could work on my remote system, I could run on it on my own laptop, which I’m doing right now and entries that ran out of memory on my remote system are parsing correctly! (The ones that still aren’t parsing correctly are, after inspection, “strange” entries with lists of book titles, etc.).

Everything seems to be going swimmingly. At some point, I’m going to have to rerun the constituency script (or modify it to skip ones it already has judgements for).

Coding and Parsing

So I’ve cleaned up some code for scraping Metafilter.com and have setup a SQL database to hold it all. I’ve run the code and it’s definitely running the code and populating the database correctly. The only problem is that the PHP script stops running after a while, there’s probably a script timeout somewhere that I should set to have it run (almost) indefinitely. I haven’t yet tested the constituency code and will be doing that now. Once the code has been tested for accuracy, I will go back to something I started last week.

I wanted to get around the PHP script timeout (for which I can probably assume that there’s a variable that controls that), but I also wanted to see how hard it would be to implement a similar program in Java. So far, writing the same Metafilter scraper in Java hasn’t been so easy.

First, I hate streams, everything in Java is in streams. They never give you a simple “loader” that gives you all of the loaded data at once. One issue I came across when running the PHP code was that there were a TON of HTML warnings, malformed markup, etc. Java’s Swing HTML parser (and SAX-based XML parsing), I’ve read aren’t too reliable for real-life HTML that you’ll find on websites. Fortunately, I found the Mozilla HTML Parser for which someone created a Java wrapper for (it’s originally written in C++) and am currently using that (in conjunction with dom4j.

So, I have that set up, I just need to write some regular expressions (I hope Java’s implementation is at least similar to PHP’s) to pull out data and some code to push it to the database. If successful, I’m sure I could just let this Java program run forever.

My immediate goals are to write the code, make sure it works, and run it. After I’ve got some parses and constituency tests to look at, then I can begin to think about the failures and how we can rate constituency.

Also, since the Stanford and Berkeley parsers are based on the WSJ portions of the Penn Treebank, it may be helpful if we could find a news source that is like Metafilter.com, but has WSJ styled writing (maybe this is impossible). But it would be much more accurate if we did, because that’s what the parsers were trained on.

Oh, I should also set up the subversion (Mercurial) on Google Code project hosting. I’ve been having terrible luck with SVN recently. Maybe it’s time to try Mercurial.

EDIT: max_execution_time in php.ini defines script execution time. The default is 30 seconds, but server configurations like Apache servers may have other defaults (say, 300 seconds). I set it to 600 seconds (10 minutes).

I’m curious about the entries that produce HTML warnings and the ones that say “no content on vwxyz”. I wonder if there really isn’t any content. Maybe I should keep track of what entries have warnings and what entries have “no content”. Additionally, there seems to be code missing to strip the HTML tags so I can feed it into the parsers, but that should just be an easy regular expression anyways.

Consistency

My goals yesterday from 10am to 3pm were to read over a text by Birgitta Bexten called Salience in Hypertext and begin looking over some code that Mitcho had already written. My ultimate goal is to begin scraping hyperlinks and text from Metafilter.com as soon as possible and begin determining the constituency of the hyperlinks using the Stanford and Berkeley parsers.

I got almost nothing from Bexten, except that she suggested that hyperlinks (at least in her examples) could be constituents and non-constituents. But it suggested pretty much nothing else.

I took a quick peek at the constituency-determining code and from a first glance (and from our prior meeting), I believe all it does is check for the same number of left and right parens (the parsers’ versions of square brackets used in bracket notation). And then I got confused as to why that was the only way, so I did some research and reading to refresh my definition of a constituent. Yeah, there are constituency tests that are based on grammaticality judgements, but simple code can’t do all of that. In Carnie’s Syntax: A Generative Introduction (2007), the final definition of a constituent is “A set of terminal nodes exhaustively dominated by a particular node”. So it means you can just do parens balancing to determine constituency.

That should mean that any single word (head) is a constituent as long as there aren’t any complements to it.

One thing I read in Bexten that confused me was on on page 15 where she has an example of a text and a hyperlink in German “Bei Simulationen handelt es sich um spezielle interaktive programme, die dynamische Modelle von Apparaten Prozessen und Systemen abbilden” which translates to “Simulations are special interactive programs which represent dynamic models of devices, processes and systems.”

According to Bexten, she claims that “It also occurs that … not a whole constituent is link-marked but only, e.g., an adjective.” That implies to me that the adjective isn’t a constituent.

This is what confused me and caused me to refresh my definition of a constituent. By throwing the English sentence (I’m making an assumption that DPs, AdjPs, and NPs work similarly in German, that may not even be necessary) into the Stanford, I got a tree where the adjective was very well a constituent (by parens balancing). I also hand traced the syntax tree for the DP “special interactive programs” and by Carnie’s definition, it is a constituent:

[DP
    [D'
        [D Ø]
        [NP
            [N'
                [AdjP
                    [Adj'
                        [Adj special]
                    ]
                ]
                [N'
                    [AdjP
                        [Adj'
                            [Adj interactive]
                        ]
                    ]
                    [N'
                        [N programs]
                    ]
                ]
            ]
        ]
    ]
]

It’s not a big deal, I just want to make sure my definition of a constituent is correct, because Bexten made it seem like the adjective wasn’t a constituent.

Constituency in Hyperlinks

Hi everyone! My name is Anton! I’m a rising junior at MIT in the Department of Linguistics. This blog consists of notes and logs of my UROP project at the MIT Department of Linguistics. You may find part of my project proposal below:

The internet has proven to be an incredible medium of communication. As with any medium of communication, be it text, audio, or video, there is an extraordinary corpus of linguistic data on the internet, primarily in the form of text, ranging from captions on pictures to weblog entries. In 2008, software engineers at Google declared that there were at least one trillion unique URLs on the web (Alpert, 2008). With modest amounts of text on each page, these webpages altogether provide an enormous body of linguistic utterances from which linguists can study from.

Of course, much of this linguistic data is no different from data recorded in elicitations with language consultants or from data recorded from the everyday speech of the masses. However, there are certain aspects of the web that the spoken word of a person does not have. One particular characteristic of the web is the hyperlink. A hyperlink is any visual element that redirects the user to another location presenting additional content. Though hyperlinks are not limited to plain text, the vast majority of active hyperlinks are indeed plain text. By seemingly natural convention, the text that constitutes a hyperlink is semantically related to the content to which it will redirect the user to. There is consequently some semantic meaning to be retrieved from the text of a hyperlink. Because of this, one may speculate that the text of hyperlinks are linguistic constituents; hyperlinks may naturally delimit whole units of syntactic data.

My work will investigate the constituency of the text of hyperlinks and determine whether or not the text that composes a hyperlink is in fact a constituent. I will be mining hyperlinks from bodies of text from social news websites and possibly other online communities with large bodies of text and hyperlinks. With these hyperlinks, I will manually and programmatically determine whether or not they are constituents or not. If they are indeed constituents, then my findings may further prove the existence of constituents and support the concept of segmenting linguistic utterances into syntactic constituents.