Statistics

After rerunning the “missing_tree” links, I tallied the links up:

mysql> SELECT stanford, COUNT(stanford) FROM `hyperlinks_links` GROUP BY stanford;
+----------------------------------+-----------------+
| stanford                         | COUNT(stanford) |
+----------------------------------+-----------------+
| almost_constituent:-LRB-         |               1 |
| almost_constituent:ADJP          |              17 |
| almost_constituent:ADVP          |              14 |
| almost_constituent:CC            |               1 |
| almost_constituent:CD            |              41 |
| almost_constituent:FRAG          |               5 |
| almost_constituent:JJ            |              57 |
| almost_constituent:NN            |              58 |
| almost_constituent:NNP           |              41 |
| almost_constituent:NNPS          |               6 |
| almost_constituent:NNS           |              43 |
| almost_constituent:NP            |             283 |
| almost_constituent:PP            |               8 |
| almost_constituent:PRT           |               1 |
| almost_constituent:QP            |               3 |
| almost_constituent:RB            |               3 |
| almost_constituent:ROOT          |              70 |
| almost_constituent:S             |              15 |
| almost_constituent:SBAR          |              16 |
| almost_constituent:SBARQ         |               1 |
| almost_constituent:VBN           |               1 |
| almost_constituent:VP            |              31 |
| almost_constituent:X             |               4 |
| almost_constituent_2ndpass_:NN   |               1 |
| almost_constituent_2ndpass_:NNP  |               1 |
| almost_constituent_2ndpass_:NP   |               5 |
| almost_constituent_2ndpass_:PRP$ |               1 |
| constituent                      |           17435 |
| missing_tree                     |            1338 |
| multiple_constituents            |            5014 |
| not_constituent                  |            5056 |
| unknown_error                    |            1309 |
| xclausal                         |             301 |
+----------------------------------+-----------------+

Variable Value
Almost constituents 728
Constituents 17435
Multiple constituents 5014
Non-constituents 5056
Cross-clausal links 301
Links with missing trees 1338
Links with unknown errors 1309

Now, we tally up further and say that intended and actual constituents are the sum of almost-constituent, constituent, and multiple constituent links. This sum is 23177.

For ultimately non-constituents, this is the sum of the non-constituents and cross-clausal links: 5357.

The grand total of correctly parsed links is 28534.

So intended/actual constituents counted for 81.23% of the correctly parsed links, while non-constituents counted for 18.77% of the correctly parsed links.

More Issues

Just kidding! I have success now. There was weird stuff going on with PHP, but I got that fixed up and I had the fix_entries.php script run on my remote Windows machine and unlike running the Stanford parser on Scripts, the Stanford parser on a Windows 7 machine with only 512 MB of RAM worked almost like a charm. Entries that were previously unparsable due to the memory ceiling were now mostly parsable. (It still choked up on some entries, because my remote system doesn’t have a lot of RAM to work with). But since it could work on my remote system, I could run on it on my own laptop, which I’m doing right now and entries that ran out of memory on my remote system are parsing correctly! (The ones that still aren’t parsing correctly are, after inspection, “strange” entries with lists of book titles, etc.).

Everything seems to be going swimmingly. At some point, I’m going to have to rerun the constituency script (or modify it to skip ones it already has judgements for).

Memory Issues

After testing various things with the parsers, on my own laptop and on the Scripts servers, I modified the code to exclude Berkeley parses, not because I haven’t figured out how to get the output, but because the Berkeley parser runs out of memory.

I was reading through the Scripts blog and they mentioned something about the JVM and memory issues. In fact, they said the option -Xmx128M is the default for their JVM, that is, they initially only let the JVM allocate 128 MB of memory. This happens to be a problem for the Berkeley parser which runs out of memory rather quickly.

For parsing the sentence, “Simulations are special interactive programs which represent dynamic models of devices, processes and systems.”, Windows commits 775.75 MB for the parser (it actually only uses 571.08 MB), which way over the 128 MB initial allocation that Scripts places. Scripts claims that you can try to increase the memory to 256 MB, perhaps even 512 MB, but not 768 MB. They said it became unstable at 768 MB. When I did try it on the Scripts server, with 256 MB, it wouldn’t run still. And at 512 MB, the JVM can’t allocate that much memory because apparently it’s not available.

Now, the Stanford parser is better at memory management. For the same parse, the JVM committed 242.64 MB and only actually used 147.1 MB. The Stanford parser is more ideal for running on the Scripts server. However, with longer sentences, as the parsing script (parse_entries.php) ran through all 5,865 rows of entry data, I noticed that some of the parses had an error message “Sentence skipped: no PCFG fallback. SENTENCE_SKIPPED_OR_UNPARSABLE”.

This was before I did any memory tests, so I thought the Stanford parser was choking up on malformed text, like my HTML-stripping regexp was failing, or I didn’t decode HTML entities like &quot to “, or the parser wasn’t handling special whitespace characters well. I fixed those, but these errors still persisted. It turns out the Stanford parser fails on longer sentences on the Scripts server due to lack of sufficient memory.

On longer sentences, Windows commits around 400+ MB of RAM (but only used around 255 MB) to the parser. Either way, I think it must run out of memory on the Scripts server.

The solution is simply to run it on my own computer, run the PHP script locally so that I can run the parsing without the memory ceiling. But my Apache installation broke after I updated PHP, so I currently can’t run any PHP scripts. As an alternative, I let the parser just run through all of the rows, but add the entry ids of entries that contained the memory error to a new table I called hyperlinks_bad_entries. Of the 5864 entries, 2,642 had errors, so that’s close to half. When doing constituency tests, I can probably ignore the bad entries for the time being.

Coding and Parsing

So I’ve cleaned up some code for scraping Metafilter.com and have setup a SQL database to hold it all. I’ve run the code and it’s definitely running the code and populating the database correctly. The only problem is that the PHP script stops running after a while, there’s probably a script timeout somewhere that I should set to have it run (almost) indefinitely. I haven’t yet tested the constituency code and will be doing that now. Once the code has been tested for accuracy, I will go back to something I started last week.

I wanted to get around the PHP script timeout (for which I can probably assume that there’s a variable that controls that), but I also wanted to see how hard it would be to implement a similar program in Java. So far, writing the same Metafilter scraper in Java hasn’t been so easy.

First, I hate streams, everything in Java is in streams. They never give you a simple “loader” that gives you all of the loaded data at once. One issue I came across when running the PHP code was that there were a TON of HTML warnings, malformed markup, etc. Java’s Swing HTML parser (and SAX-based XML parsing), I’ve read aren’t too reliable for real-life HTML that you’ll find on websites. Fortunately, I found the Mozilla HTML Parser for which someone created a Java wrapper for (it’s originally written in C++) and am currently using that (in conjunction with dom4j.

So, I have that set up, I just need to write some regular expressions (I hope Java’s implementation is at least similar to PHP’s) to pull out data and some code to push it to the database. If successful, I’m sure I could just let this Java program run forever.

My immediate goals are to write the code, make sure it works, and run it. After I’ve got some parses and constituency tests to look at, then I can begin to think about the failures and how we can rate constituency.

Also, since the Stanford and Berkeley parsers are based on the WSJ portions of the Penn Treebank, it may be helpful if we could find a news source that is like Metafilter.com, but has WSJ styled writing (maybe this is impossible). But it would be much more accurate if we did, because that’s what the parsers were trained on.

Oh, I should also set up the subversion (Mercurial) on Google Code project hosting. I’ve been having terrible luck with SVN recently. Maybe it’s time to try Mercurial.

EDIT: max_execution_time in php.ini defines script execution time. The default is 30 seconds, but server configurations like Apache servers may have other defaults (say, 300 seconds). I set it to 600 seconds (10 minutes).

I’m curious about the entries that produce HTML warnings and the ones that say “no content on vwxyz”. I wonder if there really isn’t any content. Maybe I should keep track of what entries have warnings and what entries have “no content”. Additionally, there seems to be code missing to strip the HTML tags so I can feed it into the parsers, but that should just be an easy regular expression anyways.

Consistency

My goals yesterday from 10am to 3pm were to read over a text by Birgitta Bexten called Salience in Hypertext and begin looking over some code that Mitcho had already written. My ultimate goal is to begin scraping hyperlinks and text from Metafilter.com as soon as possible and begin determining the constituency of the hyperlinks using the Stanford and Berkeley parsers.

I got almost nothing from Bexten, except that she suggested that hyperlinks (at least in her examples) could be constituents and non-constituents. But it suggested pretty much nothing else.

I took a quick peek at the constituency-determining code and from a first glance (and from our prior meeting), I believe all it does is check for the same number of left and right parens (the parsers’ versions of square brackets used in bracket notation). And then I got confused as to why that was the only way, so I did some research and reading to refresh my definition of a constituent. Yeah, there are constituency tests that are based on grammaticality judgements, but simple code can’t do all of that. In Carnie’s Syntax: A Generative Introduction (2007), the final definition of a constituent is “A set of terminal nodes exhaustively dominated by a particular node”. So it means you can just do parens balancing to determine constituency.

That should mean that any single word (head) is a constituent as long as there aren’t any complements to it.

One thing I read in Bexten that confused me was on on page 15 where she has an example of a text and a hyperlink in German “Bei Simulationen handelt es sich um spezielle interaktive programme, die dynamische Modelle von Apparaten Prozessen und Systemen abbilden” which translates to “Simulations are special interactive programs which represent dynamic models of devices, processes and systems.”

According to Bexten, she claims that “It also occurs that … not a whole constituent is link-marked but only, e.g., an adjective.” That implies to me that the adjective isn’t a constituent.

This is what confused me and caused me to refresh my definition of a constituent. By throwing the English sentence (I’m making an assumption that DPs, AdjPs, and NPs work similarly in German, that may not even be necessary) into the Stanford, I got a tree where the adjective was very well a constituent (by parens balancing). I also hand traced the syntax tree for the DP “special interactive programs” and by Carnie’s definition, it is a constituent:

[DP
    [D'
        [D Ø]
        [NP
            [N'
                [AdjP
                    [Adj'
                        [Adj special]
                    ]
                ]
                [N'
                    [AdjP
                        [Adj'
                            [Adj interactive]
                        ]
                    ]
                    [N'
                        [N programs]
                    ]
                ]
            ]
        ]
    ]
]

It’s not a big deal, I just want to make sure my definition of a constituent is correct, because Bexten made it seem like the adjective wasn’t a constituent.

Getting My Feet Wet

Today (12:00-5:30), I downloaded both the Stanford Parser and the Berkeley Parser and parsed the following text I found on the Stanford NLP site:

The strongest rain ever recorded in India shut down the financial hub of Mumbai, snapped communication lines, closed airports and forced thousands of people to sleep in their offices or walk home during the night, officials said today.

For which I got these two corresponding trees: Stanford parse and Berkeley parse.

They do pretty well, though after comparing the two, it’s clear that Stanford’s is better for that one sentence (that’s probably why they put it on their website, because it produces an accurate result with their parser). The differences are that the Berkeley parser messes up with the first past participle after “Mumbai”: “snapped …” which is NOT supposed to be an adjunct to “Mumbai”, but rather a verb phrase in a series of verb phrases with the subject “The strongest rain”. Additionally, the Berkeley parser produces an odd SBAR node (which isn’t documented, but is most likely intended to be an S’ projection, judging by its name).

It seems that the Stanford parser may indeed be “better”, but this can only be confirmed by parsing more sentences.

Meanwhile, I also looked at the Penn Treebank project to hopefully find more information on the strange node labels the Stanford (and the Berkeley) parsers use. I found a PostScript file containing the label descriptions and converted it to a PDF here.

The Penn Treebank seems to make (rather unnecessary) distinctions among certain parts of speech (at least for my own uses).

Constituency in Hyperlinks

Hi everyone! My name is Anton! I’m a rising junior at MIT in the Department of Linguistics. This blog consists of notes and logs of my UROP project at the MIT Department of Linguistics. You may find part of my project proposal below:

The internet has proven to be an incredible medium of communication. As with any medium of communication, be it text, audio, or video, there is an extraordinary corpus of linguistic data on the internet, primarily in the form of text, ranging from captions on pictures to weblog entries. In 2008, software engineers at Google declared that there were at least one trillion unique URLs on the web (Alpert, 2008). With modest amounts of text on each page, these webpages altogether provide an enormous body of linguistic utterances from which linguists can study from.

Of course, much of this linguistic data is no different from data recorded in elicitations with language consultants or from data recorded from the everyday speech of the masses. However, there are certain aspects of the web that the spoken word of a person does not have. One particular characteristic of the web is the hyperlink. A hyperlink is any visual element that redirects the user to another location presenting additional content. Though hyperlinks are not limited to plain text, the vast majority of active hyperlinks are indeed plain text. By seemingly natural convention, the text that constitutes a hyperlink is semantically related to the content to which it will redirect the user to. There is consequently some semantic meaning to be retrieved from the text of a hyperlink. Because of this, one may speculate that the text of hyperlinks are linguistic constituents; hyperlinks may naturally delimit whole units of syntactic data.

My work will investigate the constituency of the text of hyperlinks and determine whether or not the text that composes a hyperlink is in fact a constituent. I will be mining hyperlinks from bodies of text from social news websites and possibly other online communities with large bodies of text and hyperlinks. With these hyperlinks, I will manually and programmatically determine whether or not they are constituents or not. If they are indeed constituents, then my findings may further prove the existence of constituents and support the concept of segmenting linguistic utterances into syntactic constituents.