From Hwæt to Whattup: A Bibliography of the History of English (July 2014): Corpora and Databases

by Edwin Battistella

Corpora and Databases

Computer technology and corpus linguistics—the computer analysis of texts—have become increasingly central to English historical linguistics.  A variety of corpora and tools are available; these reveal the fine-grained, variable, and incremental nature of language change, and help to identify beginnings, endings, and social/stylistic variation.  Corpus Linguistics by Tony McEnery and Andrew Wilson gives a book-length overview and history of corpus linguistics and places it in the context of historical research and functionalist linguistic theory.  Two edited volumes discussed earlier in this essay, The Handbook of the History of the English and The Oxford History of English, both include lists of corpora, and Momma and Matto’s edited volume A Companion to the History of the English Language (also mentioned previously) includes Anne Curzan’s fine essay on using corpora, “Corpus-Based Linguistic Approaches to the History of English.”

A key electronic resource for historical research is the Helsinki Corpus of English Texts, compiled by Matti Rissanen et al. and completed in 1992.  The 1.5-million-word corpus contains a historical part with texts from 750 to 1750 and a dialect part based on transcribed interviews with speakers of British rural dialects made in the 1970s.  Early English in the Computer Age: Explorations through the Helsinki Corpus, ed. by Matti Rissanen, Merja Kytö, and Minna Palander-Collin, discusses the principles of compilation and offers a number of pilot studies illustrating the use of the material.  Coauthored with R. D. Fulk, volume two of Hogg’s A Grammar of Old English (mentioned above) draws on data from the Helsinki Corpus to provide an in-depth analysis of Old English word structure.  Christiane Dalton-Puffer’s The French Influence on Middle English Morphology likewise uses the corpus in analyzing Middle English morphology and French-derived suffixes.

ARCHER: A Representative Corpus of Historical English Registers, a corpus developed by Douglas Biber and Edward Finegan in the 1990s, includes multigenre material from 1650 to 1990 and complements the Helsinki Corpus.  Some of the findings from the ARCHER corpus are reported in Corpus Linguistics: Investigating Language Structure and Use, by Biber, Susan Conrad, and Randi Reppen.  For historical research on American English, The Corpus of Historical American English (COHA), the work of Mark Davies, is an excellent free online resource.  A collection of 400 million words covering the period from 1810 to 2009, COHA allows researchers to track the frequency of words and phrases and to identify words that have increased (like freak out and guys) or decreased (like beauteous and fellow) over time.  COHA can also track parts of words (morphemes) so that researchers can follow the up and downs of phrasal verbs, prefixes and suffixes, and compounds.  Davies also maintains The Corpus of Contemporary American English (COCA), which includes 450 million words from 1990 to 2012.  Anthony Kroch and his collaborators at the University of Pennsylvania have contributed the Penn-Helsinki Parsed Corpus of Middle English and the Penn-Helsinki Parsed Corpus of Early Modern English, a set of parsed historical corpora (now available together on CD-ROM as Penn-Helsinki Parsed Corpora of Historical English) in which the syntactic annotation allows searching not only for words and word sequences but also for syntactic structure.

Never-Ending History

Studying the history of English allows one to reflect on one’s origins and on the myths and ideology surrounding language.  English historical linguistics also allows one to interrogate the details and theories of that history and of language change itself, from the Celtic substratum to the Great Vowel Shift to the emergence of the passive voice.  English historical linguistics includes questions of standardization, diversity, social class, and global trajectory: Whose English is studied and accepted?  How do varieties influence one another?  Questions, debates, and plenty of mysteries remain, and the tools of sociolinguistic and corpus linguistics complement traditional philology, dialectology, and linguistic analysis to enable continual refinements.  The history of the English language is an inexhaustible subject.

