More Issues

Just kidding! I have success now. There was weird stuff going on with PHP, but I got that fixed up and I had the fix_entries.php script run on my remote Windows machine and unlike running the Stanford parser on Scripts, the Stanford parser on a Windows 7 machine with only 512 MB of RAM worked almost like a charm. Entries that were previously unparsable due to the memory ceiling were now mostly parsable. (It still choked up on some entries, because my remote system doesn’t have a lot of RAM to work with). But since it could work on my remote system, I could run on it on my own laptop, which I’m doing right now and entries that ran out of memory on my remote system are parsing correctly! (The ones that still aren’t parsing correctly are, after inspection, “strange” entries with lists of book titles, etc.).

Everything seems to be going swimmingly. At some point, I’m going to have to rerun the constituency script (or modify it to skip ones it already has judgements for).

Memory Issues

After testing various things with the parsers, on my own laptop and on the Scripts servers, I modified the code to exclude Berkeley parses, not because I haven’t figured out how to get the output, but because the Berkeley parser runs out of memory.

I was reading through the Scripts blog and they mentioned something about the JVM and memory issues. In fact, they said the option -Xmx128M is the default for their JVM, that is, they initially only let the JVM allocate 128 MB of memory. This happens to be a problem for the Berkeley parser which runs out of memory rather quickly.

For parsing the sentence, “Simulations are special interactive programs which represent dynamic models of devices, processes and systems.”, Windows commits 775.75 MB for the parser (it actually only uses 571.08 MB), which way over the 128 MB initial allocation that Scripts places. Scripts claims that you can try to increase the memory to 256 MB, perhaps even 512 MB, but not 768 MB. They said it became unstable at 768 MB. When I did try it on the Scripts server, with 256 MB, it wouldn’t run still. And at 512 MB, the JVM can’t allocate that much memory because apparently it’s not available.

Now, the Stanford parser is better at memory management. For the same parse, the JVM committed 242.64 MB and only actually used 147.1 MB. The Stanford parser is more ideal for running on the Scripts server. However, with longer sentences, as the parsing script (parse_entries.php) ran through all 5,865 rows of entry data, I noticed that some of the parses had an error message “Sentence skipped: no PCFG fallback. SENTENCE_SKIPPED_OR_UNPARSABLE”.

This was before I did any memory tests, so I thought the Stanford parser was choking up on malformed text, like my HTML-stripping regexp was failing, or I didn’t decode HTML entities like &quot to “, or the parser wasn’t handling special whitespace characters well. I fixed those, but these errors still persisted. It turns out the Stanford parser fails on longer sentences on the Scripts server due to lack of sufficient memory.

On longer sentences, Windows commits around 400+ MB of RAM (but only used around 255 MB) to the parser. Either way, I think it must run out of memory on the Scripts server.

The solution is simply to run it on my own computer, run the PHP script locally so that I can run the parsing without the memory ceiling. But my Apache installation broke after I updated PHP, so I currently can’t run any PHP scripts. As an alternative, I let the parser just run through all of the rows, but add the entry ids of entries that contained the memory error to a new table I called hyperlinks_bad_entries. Of the 5864 entries, 2,642 had errors, so that’s close to half. When doing constituency tests, I can probably ignore the bad entries for the time being.