Labeling

I just pushed to the repository. There is very good immediately-dominating node identification code in functions.php that hasn’t yet been utilized in the main judgement loop.

The “link” problem of finding the right subtree of all possible subtrees is handled fairly well, at least for links that have more than one or two nodes in the subtree. If they have less, then it’s up in the air, which subtree the code actually picked out. But for smaller links of one or two nodes, it is very possible that it doesn’t matter which subtree we pick up ’cause the structure will still be the same. The word “for” will pretty much always be “(IN for)” no matter what subtree it’s in; it’s not a problem if we mismatch a subtree consisting of just one (or even two) words.

I tested the node id-ing code on the first 30 non-error rows of the links table and so far, everything checks out. If the logic is correct (and I just hand-traced this myself), it should even correctly identify the node in the two-pronged case:

(X
    (α
        (A Ø)
        <<<LINK(B Ø)
    )
    (β
        (C Ø)
LINK;
        (D Ø)
    )
)

In fact, I’m very confident that it should be able to pick out the X node.

Here is a link to an output file of my test run showing correct node labels for each of the parsed links.

Coding and Parsing

So I’ve cleaned up some code for scraping Metafilter.com and have setup a SQL database to hold it all. I’ve run the code and it’s definitely running the code and populating the database correctly. The only problem is that the PHP script stops running after a while, there’s probably a script timeout somewhere that I should set to have it run (almost) indefinitely. I haven’t yet tested the constituency code and will be doing that now. Once the code has been tested for accuracy, I will go back to something I started last week.

I wanted to get around the PHP script timeout (for which I can probably assume that there’s a variable that controls that), but I also wanted to see how hard it would be to implement a similar program in Java. So far, writing the same Metafilter scraper in Java hasn’t been so easy.

First, I hate streams, everything in Java is in streams. They never give you a simple “loader” that gives you all of the loaded data at once. One issue I came across when running the PHP code was that there were a TON of HTML warnings, malformed markup, etc. Java’s Swing HTML parser (and SAX-based XML parsing), I’ve read aren’t too reliable for real-life HTML that you’ll find on websites. Fortunately, I found the Mozilla HTML Parser for which someone created a Java wrapper for (it’s originally written in C++) and am currently using that (in conjunction with dom4j.

So, I have that set up, I just need to write some regular expressions (I hope Java’s implementation is at least similar to PHP’s) to pull out data and some code to push it to the database. If successful, I’m sure I could just let this Java program run forever.

My immediate goals are to write the code, make sure it works, and run it. After I’ve got some parses and constituency tests to look at, then I can begin to think about the failures and how we can rate constituency.

Also, since the Stanford and Berkeley parsers are based on the WSJ portions of the Penn Treebank, it may be helpful if we could find a news source that is like Metafilter.com, but has WSJ styled writing (maybe this is impossible). But it would be much more accurate if we did, because that’s what the parsers were trained on.

Oh, I should also set up the subversion (Mercurial) on Google Code project hosting. I’ve been having terrible luck with SVN recently. Maybe it’s time to try Mercurial.

EDIT: max_execution_time in php.ini defines script execution time. The default is 30 seconds, but server configurations like Apache servers may have other defaults (say, 300 seconds). I set it to 600 seconds (10 minutes).

I’m curious about the entries that produce HTML warnings and the ones that say “no content on vwxyz”. I wonder if there really isn’t any content. Maybe I should keep track of what entries have warnings and what entries have “no content”. Additionally, there seems to be code missing to strip the HTML tags so I can feed it into the parsers, but that should just be an easy regular expression anyways.

Consistency

My goals yesterday from 10am to 3pm were to read over a text by Birgitta Bexten called Salience in Hypertext and begin looking over some code that Mitcho had already written. My ultimate goal is to begin scraping hyperlinks and text from Metafilter.com as soon as possible and begin determining the constituency of the hyperlinks using the Stanford and Berkeley parsers.

I got almost nothing from Bexten, except that she suggested that hyperlinks (at least in her examples) could be constituents and non-constituents. But it suggested pretty much nothing else.

I took a quick peek at the constituency-determining code and from a first glance (and from our prior meeting), I believe all it does is check for the same number of left and right parens (the parsers’ versions of square brackets used in bracket notation). And then I got confused as to why that was the only way, so I did some research and reading to refresh my definition of a constituent. Yeah, there are constituency tests that are based on grammaticality judgements, but simple code can’t do all of that. In Carnie’s Syntax: A Generative Introduction (2007), the final definition of a constituent is “A set of terminal nodes exhaustively dominated by a particular node”. So it means you can just do parens balancing to determine constituency.

That should mean that any single word (head) is a constituent as long as there aren’t any complements to it.

One thing I read in Bexten that confused me was on on page 15 where she has an example of a text and a hyperlink in German “Bei Simulationen handelt es sich um spezielle interaktive programme, die dynamische Modelle von Apparaten Prozessen und Systemen abbilden” which translates to “Simulations are special interactive programs which represent dynamic models of devices, processes and systems.”

According to Bexten, she claims that “It also occurs that … not a whole constituent is link-marked but only, e.g., an adjective.” That implies to me that the adjective isn’t a constituent.

This is what confused me and caused me to refresh my definition of a constituent. By throwing the English sentence (I’m making an assumption that DPs, AdjPs, and NPs work similarly in German, that may not even be necessary) into the Stanford, I got a tree where the adjective was very well a constituent (by parens balancing). I also hand traced the syntax tree for the DP “special interactive programs” and by Carnie’s definition, it is a constituent:

[DP
    [D'
        [D Ø]
        [NP
            [N'
                [AdjP
                    [Adj'
                        [Adj special]
                    ]
                ]
                [N'
                    [AdjP
                        [Adj'
                            [Adj interactive]
                        ]
                    ]
                    [N'
                        [N programs]
                    ]
                ]
            ]
        ]
    ]
]

It’s not a big deal, I just want to make sure my definition of a constituent is correct, because Bexten made it seem like the adjective wasn’t a constituent.