Consistency

My goals yesterday from 10am to 3pm were to read over a text by Birgitta Bexten called Salience in Hypertext and begin looking over some code that Mitcho had already written. My ultimate goal is to begin scraping hyperlinks and text from Metafilter.com as soon as possible and begin determining the constituency of the hyperlinks using the Stanford and Berkeley parsers.

I got almost nothing from Bexten, except that she suggested that hyperlinks (at least in her examples) could be constituents and non-constituents. But it suggested pretty much nothing else.

I took a quick peek at the constituency-determining code and from a first glance (and from our prior meeting), I believe all it does is check for the same number of left and right parens (the parsers’ versions of square brackets used in bracket notation). And then I got confused as to why that was the only way, so I did some research and reading to refresh my definition of a constituent. Yeah, there are constituency tests that are based on grammaticality judgements, but simple code can’t do all of that. In Carnie’s Syntax: A Generative Introduction (2007), the final definition of a constituent is “A set of terminal nodes exhaustively dominated by a particular node”. So it means you can just do parens balancing to determine constituency.

That should mean that any single word (head) is a constituent as long as there aren’t any complements to it.

One thing I read in Bexten that confused me was on on page 15 where she has an example of a text and a hyperlink in German “Bei Simulationen handelt es sich um spezielle interaktive programme, die dynamische Modelle von Apparaten Prozessen und Systemen abbilden” which translates to “Simulations are special interactive programs which represent dynamic models of devices, processes and systems.”

According to Bexten, she claims that “It also occurs that … not a whole constituent is link-marked but only, e.g., an adjective.” That implies to me that the adjective isn’t a constituent.

This is what confused me and caused me to refresh my definition of a constituent. By throwing the English sentence (I’m making an assumption that DPs, AdjPs, and NPs work similarly in German, that may not even be necessary) into the Stanford, I got a tree where the adjective was very well a constituent (by parens balancing). I also hand traced the syntax tree for the DP “special interactive programs” and by Carnie’s definition, it is a constituent:

[DP
    [D'
        [D Ø]
        [NP
            [N'
                [AdjP
                    [Adj'
                        [Adj special]
                    ]
                ]
                [N'
                    [AdjP
                        [Adj'
                            [Adj interactive]
                        ]
                    ]
                    [N'
                        [N programs]
                    ]
                ]
            ]
        ]
    ]
]

It’s not a big deal, I just want to make sure my definition of a constituent is correct, because Bexten made it seem like the adjective wasn’t a constituent.

Constituency in Hyperlinks

Hi everyone! My name is Anton! I’m a rising junior at MIT in the Department of Linguistics. This blog consists of notes and logs of my UROP project at the MIT Department of Linguistics. You may find part of my project proposal below:

The internet has proven to be an incredible medium of communication. As with any medium of communication, be it text, audio, or video, there is an extraordinary corpus of linguistic data on the internet, primarily in the form of text, ranging from captions on pictures to weblog entries. In 2008, software engineers at Google declared that there were at least one trillion unique URLs on the web (Alpert, 2008). With modest amounts of text on each page, these webpages altogether provide an enormous body of linguistic utterances from which linguists can study from.

Of course, much of this linguistic data is no different from data recorded in elicitations with language consultants or from data recorded from the everyday speech of the masses. However, there are certain aspects of the web that the spoken word of a person does not have. One particular characteristic of the web is the hyperlink. A hyperlink is any visual element that redirects the user to another location presenting additional content. Though hyperlinks are not limited to plain text, the vast majority of active hyperlinks are indeed plain text. By seemingly natural convention, the text that constitutes a hyperlink is semantically related to the content to which it will redirect the user to. There is consequently some semantic meaning to be retrieved from the text of a hyperlink. Because of this, one may speculate that the text of hyperlinks are linguistic constituents; hyperlinks may naturally delimit whole units of syntactic data.

My work will investigate the constituency of the text of hyperlinks and determine whether or not the text that composes a hyperlink is in fact a constituent. I will be mining hyperlinks from bodies of text from social news websites and possibly other online communities with large bodies of text and hyperlinks. With these hyperlinks, I will manually and programmatically determine whether or not they are constituents or not. If they are indeed constituents, then my findings may further prove the existence of constituents and support the concept of segmenting linguistic utterances into syntactic constituents.