Overview of our code and workflow

This documentation is out of date as we’ve moved away from automatic constituency detection. – mitcho Jan 7 2012


The files currently in this repository are scripts for (1) building our corpus (scraping MetaFilter and putting the entries in our db), (2) parsing those entries with the stanford parser, and (3) checking individual links within those parses and seeing whether they are constituents or not.

We have no real search/researching tools (yet!).

Prerequisites and environment

To actually run these scripts and have the results reflected on the dashboard, you will have to get the appropriate connect_mysql.php file from mitcho. If you would like to run these scripts locally or with another database server, you would have to create the appropriate database schema yourself. (We should document the appropriate schema.)

In reality, I (mitcho) don’t normally run these scripts much on my own machine, except for development and running the test suite. I use the [http://web.mit.edu/dialup/www/ssh.html MIT Athena dialup linux machines] for running the scripts.

The scripts

Here are the scripts in the repository and what we use them for:


This script scrapes individual MetaFilter entries and enters the entry and each of its links into our database. It obligatorily takes two arguments: start and end. Luckily MetaFilter’s URL’s are very straightforward and have numerical ID’s, so that’s what the start and end values correspond to.


This script parses each of the entries (the entire chunk of text, with links stripped out) using the stanford parser and puts the result into our database’s hyperlinks_entries table.

It takes two arguments: start entry number, end entry number. There’s also an additional filter value which can be set to reparse previously failed entries.


This script looks at links within each entry, will find the corresponding chunk within the parse of the entire entry, and then will check whether it corresponds to a constituent or not.

It also takes two arguments: start entry number, end entry number. There’s also a third value which is used as a filter to determine which entries will be processed/tried.


The functions file (which probably could get reorganized) is where we have the bulk of our “logic” for parsing and constituency-checking. Many functions here are used by both parse_entries and judge_constituency.

Coding conventions and style guide

Test suite

The code includes a unit test suite. Individual tests files are in the tests directory, and the bootstrapping code is tests.php. Running tests.php will execute all tests in succession.

We use unit tests to make sure our code doesn’t regress, and that our parsing output is as expected. Tests are written for our constituency-checking and parsing code… there are no tests for the actual scraping part. Please make sure to run the tests and verify that they pass before committing code. If a test doesn’t pass, either (a) you broke some functionality and you need to fix it or (b) the test specification became outdated and the test file itself must be updated (but there should be a good reason for doing so). In addition, bugfixes and new features in our constituency-checking code should be accompanied with a new test which explicitly tests the new expected functionality.