Constituency in Hyperlinks

Hi everyone! My name is Anton! I’m a rising junior at MIT in the Department of Linguistics. This blog consists of notes and logs of my UROP project at the MIT Department of Linguistics. You may find part of my project proposal below:

The internet has proven to be an incredible medium of communication. As with any medium of communication, be it text, audio, or video, there is an extraordinary corpus of linguistic data on the internet, primarily in the form of text, ranging from captions on pictures to weblog entries. In 2008, software engineers at Google declared that there were at least one trillion unique URLs on the web (Alpert, 2008). With modest amounts of text on each page, these webpages altogether provide an enormous body of linguistic utterances from which linguists can study from.

Of course, much of this linguistic data is no different from data recorded in elicitations with language consultants or from data recorded from the everyday speech of the masses. However, there are certain aspects of the web that the spoken word of a person does not have. One particular characteristic of the web is the hyperlink. A hyperlink is any visual element that redirects the user to another location presenting additional content. Though hyperlinks are not limited to plain text, the vast majority of active hyperlinks are indeed plain text. By seemingly natural convention, the text that constitutes a hyperlink is semantically related to the content to which it will redirect the user to. There is consequently some semantic meaning to be retrieved from the text of a hyperlink. Because of this, one may speculate that the text of hyperlinks are linguistic constituents; hyperlinks may naturally delimit whole units of syntactic data.

My work will investigate the constituency of the text of hyperlinks and determine whether or not the text that composes a hyperlink is in fact a constituent. I will be mining hyperlinks from bodies of text from social news websites and possibly other online communities with large bodies of text and hyperlinks. With these hyperlinks, I will manually and programmatically determine whether or not they are constituents or not. If they are indeed constituents, then my findings may further prove the existence of constituents and support the concept of segmenting linguistic utterances into syntactic constituents.