Coding and Parsing

So I’ve cleaned up some code for scraping Metafilter.com and have setup a SQL database to hold it all. I’ve run the code and it’s definitely running the code and populating the database correctly. The only problem is that the PHP script stops running after a while, there’s probably a script timeout somewhere that I should set to have it run (almost) indefinitely. I haven’t yet tested the constituency code and will be doing that now. Once the code has been tested for accuracy, I will go back to something I started last week.

I wanted to get around the PHP script timeout (for which I can probably assume that there’s a variable that controls that), but I also wanted to see how hard it would be to implement a similar program in Java. So far, writing the same Metafilter scraper in Java hasn’t been so easy.

First, I hate streams, everything in Java is in streams. They never give you a simple “loader” that gives you all of the loaded data at once. One issue I came across when running the PHP code was that there were a TON of HTML warnings, malformed markup, etc. Java’s Swing HTML parser (and SAX-based XML parsing), I’ve read aren’t too reliable for real-life HTML that you’ll find on websites. Fortunately, I found the Mozilla HTML Parser for which someone created a Java wrapper for (it’s originally written in C++) and am currently using that (in conjunction with dom4j.

So, I have that set up, I just need to write some regular expressions (I hope Java’s implementation is at least similar to PHP’s) to pull out data and some code to push it to the database. If successful, I’m sure I could just let this Java program run forever.

My immediate goals are to write the code, make sure it works, and run it. After I’ve got some parses and constituency tests to look at, then I can begin to think about the failures and how we can rate constituency.

Also, since the Stanford and Berkeley parsers are based on the WSJ portions of the Penn Treebank, it may be helpful if we could find a news source that is like Metafilter.com, but has WSJ styled writing (maybe this is impossible). But it would be much more accurate if we did, because that’s what the parsers were trained on.

Oh, I should also set up the subversion (Mercurial) on Google Code project hosting. I’ve been having terrible luck with SVN recently. Maybe it’s time to try Mercurial.

EDIT: max_execution_time in php.ini defines script execution time. The default is 30 seconds, but server configurations like Apache servers may have other defaults (say, 300 seconds). I set it to 600 seconds (10 minutes).

I’m curious about the entries that produce HTML warnings and the ones that say “no content on vwxyz”. I wonder if there really isn’t any content. Maybe I should keep track of what entries have warnings and what entries have “no content”. Additionally, there seems to be code missing to strip the HTML tags so I can feed it into the parsers, but that should just be an easy regular expression anyways.

No related posts.

Related posts brought to you by Yet Another Related Posts Plugin.

3 thoughts on “Coding and Parsing”

mitcho on July 15, 2010 at 12:05 am said:

Sorry if I missed this, but what’s the point of rewriting it in Java? Doesn’t the MetaFilter dump PHP already work?

Keeping track of which entries return “no content on …” might be a good idea, if only to know what percentage it is. If it’s just under a percent or something, I wouldn’t worry at all. At the end of the day, if we’re taking a random sample, it won’t matter.

Reply ↓
mitcho on July 15, 2010 at 12:06 am said:

ps: if you’re looking to try out hg, there’s never been a better time! I can give you tips to bring you into the light. ^^

Reply ↓
anguyen on July 16, 2010 at 9:10 am said:

So, I have no idea why, but the PHP script to scrap Metafilter stops randomly, regardless of the script timeout. Actually, I think part of it is the HTML parser in PHP, because it completely chokes on the following entries that I’ve tried to parse: 88880, 89043, 89100, 89678, etc. It just stops, I’m not sure why, but they tend to have a lot of HTML parse warnings, not sure if that actually caused it though.

What I did learn from half-implementing the same thing in Java is that you can use just one regexp to pull out the same exact information without ever having to parse the HTML into a DOM (I had a ridiculously hard time trying to compile the Mozilla HTML Parser and gave up).

I’m going to set up the project on Google Code today.

Reply ↓

<a>constituency</a>

A research project at MIT Linguistics

3 thoughts on “Coding and Parsing”

Leave a Reply Cancel reply