So I’ve cleaned up some code for scraping Metafilter.com and have setup a SQL database to hold it all. I’ve run the code and it’s definitely running the code and populating the database correctly. The only problem is that the PHP script stops running after a while, there’s probably a script timeout somewhere that I should set to have it run (almost) indefinitely. I haven’t yet tested the constituency code and will be doing that now. Once the code has been tested for accuracy, I will go back to something I started last week.
I wanted to get around the PHP script timeout (for which I can probably assume that there’s a variable that controls that), but I also wanted to see how hard it would be to implement a similar program in Java. So far, writing the same Metafilter scraper in Java hasn’t been so easy.
First, I hate streams, everything in Java is in streams. They never give you a simple “loader” that gives you all of the loaded data at once. One issue I came across when running the PHP code was that there were a TON of HTML warnings, malformed markup, etc. Java’s Swing HTML parser (and SAX-based XML parsing), I’ve read aren’t too reliable for real-life HTML that you’ll find on websites. Fortunately, I found the Mozilla HTML Parser for which someone created a Java wrapper for (it’s originally written in C++) and am currently using that (in conjunction with dom4j.
So, I have that set up, I just need to write some regular expressions (I hope Java’s implementation is at least similar to PHP’s) to pull out data and some code to push it to the database. If successful, I’m sure I could just let this Java program run forever.
My immediate goals are to write the code, make sure it works, and run it. After I’ve got some parses and constituency tests to look at, then I can begin to think about the failures and how we can rate constituency.
Also, since the Stanford and Berkeley parsers are based on the WSJ portions of the Penn Treebank, it may be helpful if we could find a news source that is like Metafilter.com, but has WSJ styled writing (maybe this is impossible). But it would be much more accurate if we did, because that’s what the parsers were trained on.
Oh, I should also set up the subversion (Mercurial) on Google Code project hosting. I’ve been having terrible luck with SVN recently. Maybe it’s time to try Mercurial.
EDIT: max_execution_time in php.ini defines script execution time. The default is 30 seconds, but server configurations like Apache servers may have other defaults (say, 300 seconds). I set it to 600 seconds (10 minutes).
I’m curious about the entries that produce HTML warnings and the ones that say “no content on vwxyz”. I wonder if there really isn’t any content. Maybe I should keep track of what entries have warnings and what entries have “no content”. Additionally, there seems to be code missing to strip the HTML tags so I can feed it into the parsers, but that should just be an easy regular expression anyways.