Corpus use and translating

Kim Lacroix
(Language Update, Volume 9, Number 4, 2013, page 8)

The knowledge of how to compile and use corpora is an essential part of modern translational competence….
Krista Varantola (2003)

In order to translate effectively, you need a good grasp of not only your target language, but also your source language. And one of the most effective tools for getting to know a language—its quirks and traits, tricky turns of phrase, idiomatic expressions and collocations—is a corpus.

Corpora (or "corpuses," if you prefer) for use as translation resources have been around for a long time. Linguists use them to study language patterns and change, and most modern translators use corpora daily when translating—but they may not even be aware of doing it!

What is a corpus, exactly? A corpus is a collection of documents that have been compiled for a specific use. Today, these documents are mainly in electronic form, and we use programs called concordancers to investigate the contents of the documents more easily. A concordancer retrieves all the occurrences of a particular search pattern in its immediate contexts and displays these in an easy-to-read format. Corpora (and concordancers) can be unilingual or bilingual—and I’ll discuss how each of these types of corpora can be useful for professional translators.

Bilingual corpora

Most (if not all) translation firms keep archives of their completed translations. Compiled, these archived documents form bilingual corpora (also called parallel corpora) that translators can use for reference, or to see how something was translated in the past, for example. For more effective searching, bilingual concordancers align the pairs of English and French documents section by section. Using a concordancer means that when search results are presented, you can see the corresponding translated section immediately.

In addition to local corpora that you compile with your own translations, you can find online corpora powered by bilingual concordancers that also give you aligned results. These are corpora that have been compiled by humans, usually with public documents, and made freely available for use by anyone. Knowing that the content of the corpus is monitored by humans means that the results are usually reliable examples of English and French usage and can inspire your own translations.^*

Unilingual corpora

It’s easy enough to see how pairs of translated documents could be useful search tools for translators. But what about unilingual corpora? Translators can—and should—use unilingual corpora to investigate how language works and how it is used.

The simplest unilingual corpus-searching tool may just be Google. How many translators use a Web search engine like Google to look up an expression that they don’t understand in the text they are translating, to check whether an expression is common in a specific subject field, or to find collocations for a term or expression? When you perform any of these searches, you are using the Web as a corpus. There are many advantages to using Google this way, but also some risks.

The main advantages of using the Web as a corpus of documents are (a) the sheer size of the corpus (enormous!) and (b) the speed at which Google delivers search results. When you want to quickly check whether a term or expression exists—somewhere, anywhere—then Google is a good tool. It’s also a good way to look up new expressions or terms that have just appeared in the language. As for frequency counts, you can get a good idea of an expression’s frequency with Google BUT you can’t rely blindly on the "number of hits" that the search engine provides. Why? Because the number of hits that Google provides on the first page of its search results is actually just an estimate. Sometimes, if you refresh the page, or click through to the second, fifth or twentieth page of results, you’ll see that the "number of hits" has changed; Google has revised its estimate. You may also notice that all the results on the page are from the same site, or from identical pages that have been copied from one site to another. So although the number can give you a very general idea of a term’s frequency, it’s not as reliable as the number of hits provided by a concordancer.

The other disadvantage of using Google is that you don’t know exactly what is in the corpus. It’s easy for anyone to put a Web page online, and Google indexes all sorts of pages, not only serious websites with well-written documents, but also personal blogs and sites, shopping sites, spam pages, etc. The quality of the language used on those websites isn’t necessarily reliable. Many pages are also written by non-native speakers and may contain some non-idiomatic usage. Try it out yourself: type "les de" in Google, with quotation marks. Based on the number of hits you get, can you conclude that "les de" is a common, correct expression in French?

With a validated French corpus—that is, a unilingual corpus containing documents that were written by native speakers of French—you can reliably find out how an expression is used or what it means. Unilingual source-language corpora can provide both linguistic and encyclopedic information about terms and expressions that we are asked to translate. Sometimes dictionary definitions aren’t enough! Looking at different contexts in which these terms and expressions are used can certainly shine a light on their meaning. You can also see collocations that you may not have noticed, get a better idea of the level of language of an expression, or see what subject field a term or expression is used in, which can help orient your research.

It stands to reason that a unilingual English corpus can be useful as well. Using a target-language corpus can help you find collocations for different terms or expressions (and write more idiomatically); determine which expressions are more commonly used (because unless you’re producing a literary translation, you should use common expressions in your translation rather than obscure ones); identify a calqued structure in your texts (if you can’t find it used in your corpus, maybe it’s not very idiomatic!) and establish the "ordinary" meaning of a term or expression (that is, how it is currently used as opposed to how dictionaries define it).

Translators are language professionals who need to know how language works in order to use it effectively and produce accurate, authentic-sounding translations. Bilingual and unilingual corpora are part of the modern translator’s arsenal of tools, just like dictionaries and terminology databases. Aren’t we lucky to have all of these tools at our disposal?

Examples of online unilingual corpora

The Corpus of Contemporary American English (Brigham Young University)
British National Corpus
Lexiqum (University of Montréal)
Corpus de français parlé au Québec (University of Sherbrooke)
Corpus français (Leipzig University)

Copyright notice for Favourite Articles

© His Majesty the King in Right of Canada, represented by the Minister of Public Services and Procurement
A tool created and made available online by the Translation Bureau, Public Services and Procurement Canada

Search by related themes

Want to learn more about a theme discussed on this page? Click on a link below to see all the pages on the Language Portal of Canada that relate to the theme you selected. The search results will be displayed in Language Navigator.

Favourite Articles (home page)
Writing tools
Language Navigator (for fast access to language tips)
TERMIUM Plus^®
Contact the Language Portal of Canada

WxT Language switcher