Constituency coding guidelines

General philosophy

The overarching philosophy is:

We only mark things as “constituent” if it is undisputedly and unambiguously a constituent.

Speaking more practically, the two key rules of our manual constituency judgment task are:

  1. Tag things as “unsure” if you are not completely sure.
  2. If the link in question may be a constituent depending on particular theoretical assumptions, we will mark them as “nonconstituent.”

Constituency tests

TODO

Attachment ambiguities

Attachment ambiguities are when there are multiple possible positions where a particular phrase may be interpreted in the clause. A classic example is “I saw a person with binoculars”: “with binoculars” here could be an instrumental modifier on the whole sentence (modifying the “saying”) or a modifier of “person.” In many cases, we cannot decide from context what the proper interpretation is.

Always tag links involving attachment ambiguities as “unsure” so mitcho will also definitely check it.

Punctuation

TODO

Parentheticals

TODO

Affixes and contractions

English sometimes combines morphemes into a single word, both orally and orthographically. This can affect constituency judgments. Here is the rule of thumb to use:

If the morphemes are not separable in Written English, the word counts as a constituent. If it is separable in Written English, judge as if they are separated.

For example, “John’s” in “John’s dog” should be marked as a constituent because the “‘s” is never separable orthographically (nor phonologically). “John’s” in “John’s a nice guy” is not a constituent, because it is a contraction for “John is”, and thus we should evaluate the link as if it were “John is a nice guy”. Similarly, “mom’n” in “mom’n pop” should be judged as if it is “mom and pop”, and thus should be categorized as “nonconstituent” (see below for notes on coordination).

Coordination

There are various theories of coordination by which different subparts of a coordinated expression are constituents. As such, when you see a coordinate structure, only (a) the entire conjunction including all conjuncts and (b) an individual constituent (or a subpart thereof) can be marked as constituents.

Examples:

#42862: Trompe L’oeil geometry that is a visual cousin to surrealism, visual games and mosaics.

“surrealism” is a constituent, “visual games” is a constituent, but “and mosaics” is not.

#81223: Governor Charlie (“No H.”) Crist has come out in support of the bill (or at least in support of not vetoing it).

Here, as the parentheses are really a part of the same sentence rather than a separate interjection (as evidenced by the fact that we interpret of “at least in support…” as actually “Govenor Chris has come out at least in support…”) so the link is within the conjunction “[in support of the bill] or [at least in support of not vetoing it].” However, the link in question is fully contained within a single conjunct, so we don’t need to worry about the conjunction here. “support of the bill” is a constituent.

TODO: be careful of RNR

Sub-words and non-linguistic links

If either edge of the link is in the middle of a word, it should be tagged “sub-word.” There is no need to judge sub-words for constituency.

TODO: nonlinguistic.

Examples:

#109953: Kirshner’s former creation, The Monkees, had earlier recorded a series of commercials for Kellogg’s Rice Krispies (123456), but by 1970 songs such as “The Day We Fall in Love” and “Forget That Girl” were being featured on Post cereal boxes.

Links within non-English sentences (we’ve seen French and morse code, for example) should also be tagged “non-linguistic.” There is no need to judge non-linguistic links for constituency.

The “title at the beginning of the line” tag

This tag (perhaps to be renamed) should be applied to any link which behaves as a unit which is not part of the adjacent sentence, but without adequate punctuation. TODO: why this is important to do.

Example:

#3917: Nader/Gore Vote swap in effect 1,107 votes changed hands so far

#4003: Clinton’s final days – this whimsical take on what Clinton’s doing now that he hasn’t got much time left is pretty funny.

TODO:

SV without overt O is always nonconstituent
– quotes don’t change things: tab 7
– gaps: don’t count that against them (tab 2)
DP-internal structures: generally right branching, feel free to mark things as unsure
(tab 3, 4)
be careful about dropped D: tab 6, tab 10