Session Title: Corpus Linguistics Panel


John Newman
Professor

Department of Linguistics

4-32 Assiniboia Hall
University of Alberta
Edmonton, Alberta
T6G 2E7

john.newman@ualberta.ca

The availability of electronic corpora has led to new and exciting developments in theoretical and descriptive linguistics. This panel will present current work illustrating corpus-based methodologies in language research.

Title: Electronic Corpora and Linguistics

John Newman
Professor

Department of Linguistics

4-32 Assiniboia Hall
University of Alberta
Edmonton, Alberta
T6G 2E7

john.newman@ualberta.ca

I briefly review the impact that the widespread availability of electronic texts, compiled into corpora, has had on the field of linguistics. Electronic corpora and the array of tools available to search them have facilitated an empirical turn in contemporary linguistics, not just creating new methodologies for the study of language but, to some extent, also defining a new object of study. I report on my own research which explores some ways in which the study of English corpora can reveal tendencies, or probabilities, which mirror "categorical" phenomena in other languages. In particular, tendencies observable in the nature of coordinated verbal structures (/come and play/, /go and tell/, etc.) show patterns which are reminiscent of categorical facts in other languages. In this way, the corpus-based study of English usage reveals patterns which might appear unusual or exotic in grammars of other languages.

Title: Computational Methods for Corpus-based Semantic Analysis

Suzanne Stevenson
Associate Professor

Department of Computer Science

6 King's College Road
Pratt Building, Room PT 290F
University of Toronto
Toronto, Ontario
M5S 3H5

suzanne[at]cs[dot]toronto[dot]edu

Electronic text is invaluable in the study of language, as a source of data on the distribution of words and constructions. Much of this work is focused on "corpus counts", raising issues of exactly what to count and how to analyze the counts. These questions become increasingly important when we try to go beyond shallow analysis, by using corpora to determine semantic properties which are not directly represented in text. In this talk, I'll illustrate some recent approaches to these issues drawing on our work on multiword predicates, as in /give a groan/ or /take a jog/. By relating their underlying linguistic properties to detectable patterns of usage, we have devised automatic methods for determining various semantic properties of these constructions. This type of research shows the importance of bringing together linguistic theory with computational techniques to extract knowledge about the meaning of linguistic constructs from text.


Title: Exploiting large corpora in variationist sociolinguistics

Gerard Van Herk

University of Ottawa

gvanherk@uottawa.ca


The corpora requirements of variationist sociolinguistics are driven by two often competing forces. On the one hand, multivariate linguistic analysis requires large, searchable data sets; on the other, the focus of sociolinguistics on non-standard and/or vernacular language discourages the use of existing electronic (written) texts. This talk will describe the corpora that we have built through transcribing and concordancing large collections of sociolinguistic interviews (Ottawa-Hull French, Quebec English, African American English), as well as adapting historical sources (folktales, letters) to sociolinguistic purposes. These corpora are flexible enough to have permitted a quarter century of study of 42 different linguistic variables in French and English. In addition, they permit exploratory studies without the high front-end time costs associated with much sociolinguistic work. I illustrate the utility of such corpora with data drawn from my studies of variation in question formation, negation, and the marking of past, perfect, and present tense verbs.