Session Title: Corpus Linguistics Panel
John Newman
Professor
Department of Linguistics
4-32 Assiniboia Hall
University of Alberta
Edmonton, Alberta
T6G 2E7
john.newman@ualberta.ca
The
availability of electronic corpora has led to new and exciting developments in
theoretical and descriptive linguistics. This panel will present current
work illustrating corpus-based methodologies in language research.
Title: Electronic Corpora and Linguistics
John Newman
Professor
Department of Linguistics
4-32 Assiniboia Hall
University of Alberta
Edmonton, Alberta
T6G 2E7
john.newman@ualberta.ca
I briefly review the impact that the
widespread availability of electronic texts, compiled into corpora, has had on
the field of linguistics. Electronic corpora and the array of tools available to
search them have facilitated an empirical turn in contemporary linguistics, not
just creating new methodologies for the study of language but, to some extent,
also defining a new object of study. I report on my own research which explores
some ways in which the study of English corpora can reveal tendencies, or
probabilities, which mirror "categorical" phenomena in other languages. In
particular, tendencies observable in the nature of coordinated verbal structures
(/come and play/, /go and tell/, etc.) show patterns which are reminiscent of
categorical facts in other languages. In this way, the corpus-based study of
English usage reveals patterns which might appear unusual or exotic in grammars
of other languages.
Title: Computational Methods for Corpus-based Semantic
Analysis
Suzanne Stevenson
Associate Professor
Department of Computer Science
6 King's College Road
Pratt Building, Room PT 290F
University of Toronto
Toronto, Ontario
M5S 3H5
suzanne[at]cs[dot]toronto[dot]edu
Electronic text is
invaluable in the study of language, as a source of data on the distribution of
words and constructions. Much of this work is focused on "corpus counts",
raising issues of exactly what to count and how to analyze the counts. These
questions become increasingly important when we try to go beyond shallow
analysis, by using corpora to determine semantic properties which are not
directly represented in text. In this talk, I'll illustrate some recent
approaches to these issues drawing on our work on multiword predicates, as in
/give a groan/ or /take a jog/. By relating their underlying linguistic
properties to detectable patterns of usage, we have devised automatic methods
for determining various semantic properties of these constructions. This type of
research shows the importance of bringing together linguistic theory with
computational techniques to extract knowledge about the meaning of linguistic
constructs from text.
Title: Exploiting large corpora in variationist
sociolinguistics
Gerard Van Herk
University of
Ottawa
gvanherk@uottawa.ca
The
corpora requirements of variationist sociolinguistics are driven by two often
competing forces. On the one hand, multivariate linguistic analysis requires
large, searchable data sets; on the other, the focus of sociolinguistics on
non-standard and/or vernacular language discourages the use of existing
electronic (written) texts. This talk will describe the corpora that we have
built through transcribing and concordancing large collections of
sociolinguistic interviews (Ottawa-Hull French, Quebec English, African American
English), as well as adapting historical sources (folktales, letters) to
sociolinguistic purposes. These corpora are flexible enough to have permitted a
quarter century of study of 42 different linguistic variables in French and
English. In addition, they permit exploratory studies without the high front-end
time costs associated with much sociolinguistic work. I illustrate the utility
of such corpora with data drawn from my studies of variation in question
formation, negation, and the marking of past, perfect, and present tense verbs.