Legalese as Seen Through the Lens of Corpus Linguistics. An Introduction to Software Tools for Terminological Analysis

María José Marín


In spite of the plethora of possibilities offered by Corpus Linguistics to the study of legal English, the research devoted to the study of this English variety based on this discipline is not as fruitful as that dedicated to other branches of ESP. The present research could be regarded as an introduction into major issues related to the design and compilation of a legal corpus such as the application of appropriate sampling strategies to ensure its representative value. This study also examines the implementation of Automatic Term Recognition (ATR) methods for the analysis of legal terminology and the automatic deployment of collocate networks. The first section explores such a controversial issue as establishing the ideal size for a specialised corpus applying the type/term ratio to a corpus of judicial decisions, the BLaRC, used as reference. In section 3, the assessment of different Automatic Term Recognition (ATR) methods is described. Out of five different methods, Drouin’s (2003) TermoStat is found and recommended as the most efficient one in legal term mining. Finally, sections 4 and 5 demonstrate the practicality of collocate networks (Williams, 1998; 2001) in their capacity to reveal lexico-grammatical patterns which provide plenty of information for the study of legal text. A case study of the sub-technical legal term party using Lancsbox – designed by Brezina, McEnery & Wattam (2015)is presented in section 5.2, where its general and specialised contexts are examined. Such scrutiny brings to the foreground interesting data such as the relevance of marriages of convenience in a collection of judicial decisions.

Cite as: Marín, JLL 6 (2017), 18–45, DOI: 10.14762/jll.2017.018



Legal English, Corpus Linguistics, Terminology, Automatic Term Recognition, Collocate Networks, Lancsbox

Full Text:



Alcaraz Varó, Enrique (1994). El inglés jurídico: textos y documentos. Madrid: Ariel Derecho.

Anthony, Laurence (2014). AntConc (Version 3.4.3) [Computer Software]. Tokyo, Japan: Waseda University. Retrieved from

Baker, Paul (2005). Public Discourses of Gay Men. London: Routledge.

Baker, Paul (2016). The shapes of collocation. International Journal of Corpus Linguistics, 21 (2), 139–164. DOI: 10.1075/ijcl.21.2.01bak.

Bhatia, Vijay (1993). Analysing Genre: Language Use in Professional Settings. London: Longman.

Biber, Douglas (1993). Representativeness in corpus design. Literary and Linguistic Computing, 8 (4), 243–57. DOI: 10.1007/978-0-585-35958-8_20.

Biber, Douglas, Conrad, Susan, & Reppen, Randy (1998). Corpus Linguistics. Investigating Language Structure and Use. Cambridge: Cambridge University Press.

Biel, Łucja & Engberg, Jan (2013). Research models and methods in legal translation. Linguistica Antverpiensia, 12, 1–11. Available at

Borja Albí, Anabel (2000). El texto jurídico en inglés y su traducción. Barcelona: Ariel.

Breeze, Ruth (2015). Teaching the vocabulary of legal documents: a corpus-driven approach. ESP Today, 3 (1), 44–63. Available at

Brezina, Vaclav, McEnery, Tony & Wattam, Stephen (2015). A new perspective on collocation networks. International Journal of Corpus Linguistics, 20 (2), 139–173. DOI: 10.1075/ijcl.20.2.01bre.

British National Corpus (2007). BNC XML Edition version 3, distributed by Oxford University Computing Services on behalf of the BNC Consortium. Available at

Cabré Castellví, María Teresa, Estopà Bagot, Rosa & Vivaldi Palatresi, Jordi (2001). ‘Automatic term detection: a review of current systems’, in Bourigault, Jacquemin & L’Homme (Eds.), Recent Advances in Computational Terminology (53–87). Amsterdam: John Benjamins. DOI: 10.1075/nlp.2.04cab.

Chomsky, Noam (1965). Aspects of the Theory of Syntax. Boston: The Massachusetts Institute of Technology (MIT).

Chung, Teresa (2003). A corpus comparison approach for terminology extraction. Terminology, 9 (2): 221–246. DOI: 10.1075/term.9.2.05chu.

Church, Kenneth Ward & Hanks, Patrick (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16 (1), 22–29. Available at

Corpas Pastor, Gloria & Seghiri Dominguez, Míriam (2010). El concepto de representatividad en lingüística de corpus: aproximaciones teóricas y consecuencias para la traducción. Málaga: Servicio de Publicaciones de la Universidad de Málaga.

Cruse, David Alan (1986). Lexical semantics. Cambridge: Cambridge University Press.

Danet, Brenda (1980). Language in the Legal Process. Law and Society Review, 14 (3), 445–564. DOI: 10.2307/3053192.

Drouin, Patrick (2003). Term extraction using non-technical corpora as a point of leverage. Terminology, 9 (1): 99–117. DOI: 10.1075/term.9.1.06dro.

Dudley-Evans, Tony & St John, Maggie Jo (1998). Developments in English for Specific Purposes. Cambridge: Cambridge University Press.

Dunning, Ted (1993). Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19 (1), 61–74. Available at

Firth, John Rupert (1957) Papers in Linguistics 1934–1951. London: Oxford University Press.

Flowerdew, Lynne (2004). The argument for using English specialised corpora to understand academic and professional language. In Connor & Upton (Eds.), Discourse In The Professions: Perspectives From Corpus Linguistics. Amsterdam/Philadelphia: John Benjamins. DOI: 10.1075/scl.16.02flo.

Flowerdew, Lynne (2009). Applying corpus linguistics to pedagogy: A critical evaluation. International Journal of Corpus Linguistics 14 (3), 393–417. DOI: 10.1075/ijcl.14.3.05flo.

Geary, Adam & Morrison, Wayne (2012). Common Law Reasoning and Institutions. London: University of London.

Goźdź-Roszkowski, Stanisław & Pontrandolfo, Gianluca (2014). Legal phraseology today: corpus-based applications across legal languages and genre. Fachsprache: International Journal of Specialized Communication, 3–4, 130–138.

Gries, Stefan Thomas (2013). 50-something years of work on collocations: What is or should be next. International Journal of Corpus Linguistics, 18 (1), 137–166. DOI: 10.1075/ijcl.18.1.09gri.

Gries, Stefan Thomas & Wulff, Stephanie (Eds.) (2010). Corpus-linguistics applications. Current studies, new directions. Amsterdam/New York: Rodopi.

Heaps, Harold Stanley (1978). Information Retrieval: Computational and Theoretical Aspects. New York: Academic Press.

Kennedy, Graeme (1998). An introduction to corpus linguistics. New York: Longman.

Kilgarriff, Adam, Baisa, Vít, Bušta, Jan, Jakubíček, Miloš, Kovář, Vojtěch, Michelfeit, Jan, Rychlý, Pavel & Suchomel, Vít (2014). The Sketch Engine: Ten Years On. Lexicography, 1, 7–36. DOI: 10.1007/s40607-014-0009-9.

Kit, Chunyu & Liu, Xiaoyue (2008). Measuring mono-word termhood by rank difference via corpus comparison. Terminology, 14 (2), 204–229. DOI: 10.1075/term.14.2.05kit.

Lemay, Chantal, L’Homme, Marie-Claude & Drouin, Patrick (2005). Two Methods for Extracting ‘Specific’ Single-word Terms from Specialised Corpora: Experimentation and Evaluation. International Journal of Corpus Linguistics, 10 (2), 227–255. DOI: 10.1075/ijcl.10.2.05lem.

Maley, Yon (1994). The Language of the Law. In J. Gibbons (Ed.), Language and the Law. London: Longman.

Marín, María José (2014). Evaluation of five single-word term recognition methods on a legal corpus. Corpora, 9 (1), 83–107. DOI: 10.3366/cor.2014.0052.

Marín, María José (2015). Measuring precision in legal term mining: a corpus-based validation of single and multi-word term recognition methods. ESP World, 46, 1–23. Available at www.esp-world. info/Articles_46/MARIN_MEASURING%20PRECISION%20IN%20LTM-AN.pdf.

Marín, María José (2016). Measuring the degree of specialisation of sub-technical legal terms through corpus comparison: a domain-independent method. Terminology, 22 (1), 80–102. DOI: 10.1075/term.22.1.04mar.

Marín, María José & Rea Rizzo, Camino (2012). Structure and design of the BLRC: a legal corpus of judicial decisions from the UK. Journal of English Studies, 10, 131–145. DOI: 10.18172/jes.184.

Maynard, Diana & Ananiadou, Sophia (2000). TRUCKS: A model for automatic multi-word term recognition. Journal of Natural Language Processing 8 (1), 101–125. DOI: 10.5715/jnlp.8.101.

McEnery, Tony (2006). Swearing in English: Bad Language, Purity and Power from 1586 to the Present. Abington, UK: Routledge.

McEnery, Tony & Wilson, Andrew (1996). Corpus Linguistics. Edinburgh: Edinburgh University Press.

McEnery, Tony, Xiao, Richard & Tono, Yukio (2006). Corpus-based language studies: an advanced resource book. Routledge Applied Linguistics: New York.

Mellinkoff, David (1963). The Language of the Law. Boston: Little, Brown & Co.

Nesi, Hillary & Gardner, Sheena (2012). Genres across the disciplines: Student writing in higher education. Cambridge: Cambridge University Press.

Orts Llopis, María Ángeles (2006). Aproximación al discurso jurídico en inglés: las pólizas de seguro marítimo de Lloyd’s. Madrid: Edisofer.

Orts Llopis, María Ángeles (2009). Legal genres in English and Spanish: some attempts of analysis. Iberica, 18, 109–130. Available at

Partington, Adam (1998). Patterns and Meanings. Using Corpora for English Language Research and Teaching. Amsterdam: John Benjamins.

Pazienza, Maria Teresa, Pennacchiotti, Marco & Zanzotto, Fabio Massimo (2005). Terminology extraction: An Analysis of Linguistic and Statistical Approaches. Studies in Fuzziness and Soft Computing, 185, 225–279. DOI: 10.1007/3-540-32394-5_20.

Pearson, Jennifer (1998). Terms in Context. Amsterdam: John Benjamins.

Sánchez Aquilino & Cantos Gómez, Pascual (1997). Predictability of Word Forms (Types), and Lemmas in Linguistic Corpora. A Case Study Based on the Analysis of the CUMBRE Corpus. International Journal of Corpus Linguistics, 2 (2), 251–272. DOI: 10.1075/ijcl.2.2.06san.

Scott, Mike (2008). WordSmith Tools version 5. Liverpool: Lexical Analysis Software. Available at www.

Sinclair, John (1991). Corpus, Concordance and Collocation. Oxford: Oxford University Press.

Sinclair, John (2005). Corpus and Text: Basic Principles. In Wynne 2005 (see below). Available at uk/documents/creating/dlc/chapter1.htm.

Sinclair, Stéfan, Rockwell, Geoffrey & the Voyant Tools team (2012). Voyant Tools [Computer software]. Retrieved from

Sparck Jones, Karen (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21. DOI: 10.1108/eb026526.

Sternfeld, Joshua (2012). Pedagogical Principles of Digital historiography. In Hirsch (Ed.), Digital Humanities Pedagogy. London: Open Book Publishers. Available at

Stubbs, Michael (2001). Words and Phrases. London: Blackwell.

Tiersma, Peter (1999). Legal Language. Chicago: The University of Chicago Press.

Tognini-Bonelli, Elena (2001). Corpus Linguistics at Work. Amsterdam: John-Benjamins.

Vivaldi, Jorge, Cabrera-Diego, Luis Adrián, Sierra, Gerardo & Pozzi, María (2012). Using Wikipedia to Validate the Terminology Found in a Corpus of Basic Textbooks. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC12). Istanbul, Turkey. Retrieved from

Widdowson, Henry (2000). The limitations of linguistics applied. Applied Linguistics, 21 (1), 3–25. DOI: 10.1093/applin/21.1.3.

Williams, Geoffrey (1998). Collocational Networks: Interlocking Patterns of Lexis in a Corpus of Plant Biology Research Articles. International Journal of Corpus Linguistics, 3(1), 151–171. DOI: 10.1075/ ijcl.3.1.07wil.

Williams, Geoffrey (2001). Mediating between lexis and texts: collocational networks in specialised corpora. ASp, la revue du GERAS, 31, 63–76. Available at

Wynne, Michael (Ed.) (2005). Developing Linguistic Corpora: a Guide to Good Practice. Oxford: Oxbow books. Retrieved from



  • There are currently no refbacks.

Copyright (c) 2017 María José Marín

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.