LibGuides: From Hwæt to Whattup: A Bibliography of the History of English (July 2014): Corpora and Databases

Corpora and Databases

Computer technology and corpus linguistics—the computer analysis of texts—have become increasingly central to English historical linguistics. A variety of corpora and tools are available; these reveal the fine-grained, variable, and incremental nature of language change, and help to identify beginnings, endings, and social/stylistic variation. Corpus Linguistics by Tony McEnery and Andrew Wilson gives a book-length overview and history of corpus linguistics and places it in the context of historical research and functionalist linguistic theory. Two edited volumes discussed earlier in this essay, The Handbook of the History of the English and The Oxford History of English, both include lists of corpora, and Momma and Matto’s edited volume A Companion to the History of the English Language (also mentioned previously) includes Anne Curzan’s fine essay on using corpora, “Corpus-Based Linguistic Approaches to the History of English.”

A key electronic resource for historical research is the Helsinki Corpus of English Texts, compiled by Matti Rissanen et al. and completed in 1992. The 1.5-million-word corpus contains a historical part with texts from 750 to 1750 and a dialect part based on transcribed interviews with speakers of British rural dialects made in the 1970s. Early English in the Computer Age: Explorations through the Helsinki Corpus, ed. by Matti Rissanen, Merja Kytö, and Minna Palander-Collin, discusses the principles of compilation and offers a number of pilot studies illustrating the use of the material. Coauthored with R. D. Fulk, volume two of Hogg’s A Grammar of Old English (mentioned above) draws on data from the Helsinki Corpus to provide an in-depth analysis of Old English word structure. Christiane Dalton-Puffer’s The French Influence on Middle English Morphology likewise uses the corpus in analyzing Middle English morphology and French-derived suffixes.

ARCHER: A Representative Corpus of Historical English Registers, a corpus developed by Douglas Biber and Edward Finegan in the 1990s, includes multigenre material from 1650 to 1990 and complements the Helsinki Corpus. Some of the findings from the ARCHER corpus are reported in Corpus Linguistics: Investigating Language Structure and Use, by Biber, Susan Conrad, and Randi Reppen. For historical research on American English, The Corpus of Historical American English (COHA), the work of Mark Davies, is an excellent free online resource. A collection of 400 million words covering the period from 1810 to 2009, COHA allows researchers to track the frequency of words and phrases and to identify words that have increased (like freak out and guys) or decreased (like beauteous and fellow) over time. COHA can also track parts of words (morphemes) so that researchers can follow the up and downs of phrasal verbs, prefixes and suffixes, and compounds. Davies also maintains The Corpus of Contemporary American English (COCA), which includes 450 million words from 1990 to 2012. Anthony Kroch and his collaborators at the University of Pennsylvania have contributed the Penn-Helsinki Parsed Corpus of Middle English and the Penn-Helsinki Parsed Corpus of Early Modern English, a set of parsed historical corpora (now available together on CD-ROM as Penn-Helsinki Parsed Corpora of Historical English) in which the syntactic annotation allows searching not only for words and word sequences but also for syntactic structure.

Never-Ending History

Studying the history of English allows one to reflect on one’s origins and on the myths and ideology surrounding language. English historical linguistics also allows one to interrogate the details and theories of that history and of language change itself, from the Celtic substratum to the Great Vowel Shift to the emergence of the passive voice. English historical linguistics includes questions of standardization, diversity, social class, and global trajectory: Whose English is studied and accepted? How do varieties influence one another? Questions, debates, and plenty of mysteries remain, and the tools of sociolinguistic and corpus linguistics complement traditional philology, dialectology, and linguistic analysis to enable continual refinements. The history of the English language is an inexhaustible subject.

Works Cited

Corpus Linguistics by Tony McEnery and Andrew Wilson
ISBN: 9780748604821

Publication Date: 1996
The Handbook of the History of English by Ans van Kemenade and Bettelou Los (editors)
ISBN: 9780631233442

Publication Date: 2006
The Oxford History of English by Lynda Mugglestone (editor)
ISBN: 9780199249312

Publication Date: 2006
A Companion to the History of the English Language by Haruko Momma and Michael Matto (editors)
ISBN: 9781405129923

Publication Date: 2008
Helsinki Corpus of English Texts by Matti Rissanen et al.
Early English in the Computer Age by Matti Rissanen, Merja Kytö, and Minna Palander-Collin (editors)
ISBN: 9783110137392

Publication Date: 1993
A Grammar of Old English by Richard M. Hogg
ISBN: 9781444396218

Publication Date: 1992/2011
The French Influence on Middle English Morphology by Christiane Dalton-Puffer
ISBN: 9783110149906

Publication Date: 1996
ARCHER: A Representative Corpus of Historical English Registers, first constructed by Douglas Biber and Edward Finegan
Corpus Linguistics by Douglas Biber, Susan Conrad, and Randi Reppen
ISBN: 9780521499576

Publication Date: 1998
The Corpus of Historical American English: 400 Million Words, 1810-2009 by Mark Davies
The Corpus of Contemporary American English (COCA): 450 million words, 1990-2012 by Mark Davies
Penn-Helsinki Parsed Corpus of Middle English (PPCME2) by Anthony Kroch and Ann Taylor (editors)
Publication Date: 2010
Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME) by Anthony Kroch, Beatrice Santorini, and Ariel Diertani (editors)
Publication Date: 2010