C is for (COCA) Corpus

12 05 2013

A recent ELT Chat on using corpora suggested that what might be helpful for teachers coming to online corpora (such as COCA- the Corpus of Contemporary American English) for the first time would be some kind of screencast or video, demonstrating some of the more useful functions. Since I’ve just been doing a unit on using corpora with my MA class, I put together this screencast, using Zoom and uploaded on to YouTube:

The steps that I demonstrate are summarised here:

summary of COCA searches

Actions

Information

Date : May 12, 2013
Categories : English language teaching

37 responses

12 05 2013: eflnotes (09:02:06) :

nice screencast you got a jump on Phil Longwell who is going to do some, you’ve set the benchmark 🙂

i’d recommend Ken Lackman’s Classroom Games from Corpora http://www.kenlackman.com/files/CorporaGamesBook103.pdf since it shows some nice COCA searches used for games focusing on both form and meaning.

gotta blag my quick cup of coca posts http://eflnotes.wordpress.com/tag/cupofcoca/

one typo in video KWIC – keyword in context

ta
mura

Reply
12 05 2013: Laura Adele (09:50:22) :

Hadn’t seen this link before I posted. Thanks for the resources 🙂

Reply
12 05 2013: eflnotes (15:04:37) :

yr welcome 🙂

Reply
12 05 2013: Scott Thornbury (13:23:13) :

Thanks, Mura – your blog is one I direct my students to, so your endorsement means a lot. I’m so embarrassed, though, about getting KWIC wrong – see my comment to Jeremy below. 😦

Reply
12 05 2013: eflnotes (15:03:35) :

wow thanks that’s a massive endoresment! the slip-up is minor compared to quality of screencast, loved the recap list at the end of video.
ta
mura

Reply
13 05 2013: Scott Thornbury (09:55:09) :

Thanks, Mura. Incidentally, one of the beauties of Present.Me (I’m starting to sound like I invented the thing!) is that you can re-record your presentation without having to change the slideshow – which I shall eventually do, so if anyone spots any other gaffes, please let me know!
12 05 2013: Laura Adele (09:47:50) :

Thanks a lot for making this video, Scott! I still have some screencasts you shared with us (using Jing) on how to use COCA, but this present.me version is an extremely useful short tutorial 🙂 Good to review the compare function, I don’t use it as often as I could. Phras.in used to be my tool to compare words or expressions, but it’s not working anymore. No idea why.

Do you know of any website or resource for ideas on how to use corpus tools with students?

Reply
12 05 2013: Emilia Siravo (10:53:16) :

Dear Scott,

Thanks for sharing this video on how to use corpus (COCA). Since being introduced to corpus data last summer in your class, I often refer to the database to ensure what I teach is not prescriptive but rather reflective of ‘real language in use.’ Most of my students (especially in more advanced classes) are well aware of corpus – in fact just the other day when talking about the grammar of reported language one student ask me, “What does corpus say?”. This, of course, was music to my ears.

While my students have bought into the idea of referring to corpus data especially for figuring out typical collocations, almost all are reluctant to use the COCA database itself. Many students say it’s too cumbersome for their purposes (quick searches on commonly used collocations and use of grammar). As a result, they leave most of the querying work to me.

While trying to find easier ways to query data, I’ve recently discovered Google’s ngrams corpus database. In his article for the Atlantic, Ben Zimmer highlights some of the key features in Google’s ngrams including: database size – (half of trillion words in English), ability to query by parts of speech, availability in 8 languages and ability to compare AE and BE language usage. Zimmer also notes some of ngrams drawbacks including its lack of the lemma function (which is available in COCA).

To read Zimmer’s article go to: http://www.theatlantic.com/technology/archive/2012/10/bigger-better-google-ngrams-brace-yourself-for-the-power-of-grammar/263487/

For more information on how to use ngrams go to: http://books.google.com/ngrams/info and for the website itself go to: http://books.google.com/ngrams/

Although querying through these databases has been daunting for me (as I constantly wonder if I am interpreting the data correctly, if the data sources and result are really valid, etc) I think and hope that in the next few years, our language material, course books and teaching will be strongly influenced by ‘big data’ that is based on authentic sources rather than simply prescriptive rules.

emilia

Reply
12 05 2013: Scott Thornbury (13:36:33) :

Thanks for the comment, Emilia. And to Laura, too – both of you having had to endure my ‘introducing corpora’ sessions on the MA. My list of bookmarked corpus websites has grown steadily longer – just keeping up with Mark Davies at BYU (site of COCA) is a full-time job! My current favorite is StringNet at http://nav.stringnet.org/ although it has a more limited functionality than COCA – but it’s accompanying blog is well worth checking out. I also like ForbetterEnglish at http://forbetterenglish.com/ which is very user friendly – and learner friendly in that its example sentences have been selected on the basis of intelligibility.

As for the Google Ngram site – yes this is really amazing, and great for comparing the history of particular words or phrases (check out ‘gay’, ‘queer’ for example, or ‘beatnik’ vs ‘hippie’) – but it is a little limited in that it draws only on written texts that exist in book form.

Reply
14 05 2013: wible45 (10:36:58) :

Thanks Scott for the mention of StringNet, http://nav.stringnet.org/. The context of your mentioning us above has inspired me to write an entry for the StringNet blog on what distinguishes corpora and concordancing on the one hand from StringNet as we have tried to design it on the other. Here’s part of it, in case it might help readers find a better match between tools and tasks.

Some basic distinctions influence what we can get out of corpora, yet these distinctions seem to fly under the radar. Some that we’ve tried to keep in mind in designing StringNet are: (1) token vs type; (2) syntagmatic vs paradigmatic patterns of word behavior; and (3) finding what I am looking for in a corpus vs discovering what I wouldn’t have thought to look for. We’ve published some work elaborating on these (see a reference below), but here’s a thumbnail of just one of them–-tokens vs types.

Corpora are collections of tokens, that is, of instances. Saussure’s ‘parole’. What puts off many learners is that KWIC searches yield tokens and tokens are of little interest in and of themselves. Tokens are interesting to corpus users mainly as windows onto what they betoken; and what they betoken is types, that is, patterns–in this case, patterns of word use. StringNet has tried to distill the recurrent patterns of word use. So rather than listing tokens, a response to a StringNet query is a list of types, of the patterns in which the query word participates. Clicking on one of those listed patterns, in turn, yields tokens of those patterns used in sentences (would that be PIC? patterns in context?).

By the way, click on everything you see in the StringNet query results. The clickability is intended to encourage exploration through the net that is StringNet.

In the following chapter we say more on all this stuff:

David Wible and Nai-Lung Tsao (2011).“Towards a New Generation of Corpus-derived Lexical Resources for Language Learning,”in Meunier F., De Cock S., Gilquin G. and Paquot M. (eds). A Taste for Corpora. Amsterdam: John Benjamins.

Reply
14 05 2013: Scott Thornbury (12:25:12) :

Thanks so much for this, David. I knew I liked StringNet but I couldn’t quite articulate why. Now you have put the words into my mouth! ‘Rather than listing tokens, a response to a StringNet query is a list of types, of the patterns in which the query word participates’.

Exactly, and if you take the view (as I do) that ‘the acquisition of grammar is the piecemeal learning of many thousands of constructions and the frequency-biased abstraction of regularities within them’ (N.Ellis), then StringNet is the perfect tool for facilitating this process, both for teachers and for learners.
12 05 2013: Marisa Constantinides (11:22:10) :

Thanks ever so much for responding to our #ELTchat discussion and adding this great tutorial to your blog – will add to our summary; shame we cannot embed as the #ELTchat blog is also WordPress.

Would this be your first edtech tutorial?

It’s really great and thanks again 🙂

Marisa

Reply
12 05 2013: Scott Thornbury (11:34:20) :

Thanks Marisa – and thanks for creating the link. I’ve been playing with this tool for the last few weeks, and have to date embedded three presentations in my university’s Blackboard site for my MA students. The feedback has been very positive. Of course, I have done PowerPoints before with an audio narration, but I think the video adds a more human touch. A great asset if you teach online.

Reply
12 05 2013: Jeremy harmer (12:52:26) :

I enjoyed watching the presentation (don’t get out much, you see!). Lots of questions to ask you about present.me. I can do that soon f2f!

The question I have (ever since Chris Tribble did corpus work way way back) is how or whether to get students using COCA or any other corpus. Or rather, is the act of searching around better/more efficacious than just using a really good dictionary. Maybe the cognitive effort required to do all this is its own reward?

KWIC? I always thought ‘in context’….

Jeremy

Reply
12 05 2013: eflnotes (15:01:47) :

a perennial question this, as far as i know the evidence points to students benefiting more from teacher mediated corpus output rather than student mediated corpus output.
i think that the development of multi-media or pedagogically mediated corpora hold the key for student led corpus investigation, as examples such corpora include the backbone corpus. i link to this corpus fleuron http://web.atilf.fr/fleuron on french student life as a great example due to ease of use of interface and the linked video and audio.
ta
mura

Reply
12 05 2013: Rob (13:20:05) :

As has been mentioned, a screencast on how teachers and students use COCA and other corpora for learning English would interest me. I’m also interested in the limits of corpus data (eg, lack of context, typically not much spoken text-I know there are exceptions) and how a conversation-driven approach would in-corpor-ate COCA.

For a cultural (?) perspective on connotation, look at how American English uses ‘impending’ compared to British English. 😉

Rob

Reply
12 05 2013: Scott Thornbury (13:20:45) :

Oops. KWIC = Key Words in Context – you’re right, Jeremy. That’s what comes of not scripting the presentation – you get spontaneity, but also goofs!

I’ll happily talk to you about present.me – but check out Russell’s training video. His enthusiasm is infectious (well, it infected me!)

As for your question about the value of learners using a corpus … I’ll leave that one quivering fitfully in cyberspace, expecting others will address it better than I can!

Reply
12 05 2013: Peter Cox (13:38:17) :

I have introduced some of my more inquisitive students to corpora and concordancers via http://www.lextutor.ca/concordancers/ to emphasise the point that language is not fixed and rule constrained. The idea that (even) the dictionary is the product of research and not that of a super-being who arbitrarily decides on the definition of words is a liberating revelation. If they can come to the same understanding of the origin of grammar so much the better. I doubt and indeed don’t expect them to use corpora themselves, that’s not their interest, they need to focus on communicating not academic language research but having some knowledge of the “clockwork” behind any process is surely helpful.

Reply
13 05 2013: Scott Thornbury (09:58:11) :

Yes, Peter . The Compleat Lexical Tutor is one of my favorite sites (as I think it says on the ‘front page’!) The VocabProfile tool is alone worth the price of the subscription. (Actually the site is FREE and, unlike COCA, you are not even required to register. Ah! Canadians!)

Reply
12 05 2013: Neill (14:59:43) :

Brilliant

Reply
12 05 2013: Kathy (19:22:18) :

Some fellow teachers and I are currently involved in a vocabulary project and my biggest question so far has been “How can I use corpora to support vocabulary learning?” And voilà, here is a practical introduction — thank you so much! I’m also happy to see tips and links from other readers. This is great!!

Reply
13 05 2013: Scott Thornbury (09:59:33) :

Hi Kathy – have a look at the summary of the ELT Chat discussion on corpora, which might point you in useful directions (the link is in my introductory comments before the slide show).

Reply
13 05 2013: Kathy (17:54:59) :

Thanks, I see there’s a lot there to consider. Interesting idea, creating a corpus of student work for analysis!

Reply
13 05 2013: alannahfitz (00:04:25) :

Nice to see a post on one of the most exciting monitor corpora out there, the COCA.

Mark Davies has put some useful screencasts onto YouTube for using the new WordandPhrase interface he has developed for this corpus: http://www.youtube.com/user/CorpusProf/videos?flow=grid&view=0
His Guided Tour on-screen help information is also useful.

Tipped to be the most impactful derivative resources stemming from the COCA are the Academic Vocabulary Lists (AVLs) he and Dee Gardner developed from the COCA academic subcorpus as an improvement on the Academic Word List (AWL) from Coxhead. This work shows the limitations of the AWL which was derived from a comparatively small academic corpus and which was built on top of the General Service List by West from the early 1950s. More on the COCA AVLs here: http://www.academicwords.info/compare.asp

I wonder when the ELT coursebook publishing world will pick up on this momentous contribution to the research into ELT? Possibly never, as coursebooks can’t compete with freely accessible online resources that draw on massive databases and which link in data from further super resources such as the online WordNet project at Princeton. There’s only so much you can cram into a paper-based coursebook or dictionary for that matter. And this is where specificity comes into play with specialist corpora and the growing trend in do-it-yourself corpora – something we’ll never get from a coursebook where the publishers are looking to attract large general audiences. I’ll soon be putting some training videos on Russell Stannard’s TTV site on how to build your own DIY corpora using many of the open educational resources available to us now using the FLAX open source software.

As Jeremy Harmer has noted above in relation to Chris Tribble’s ongoing research into why the use of corpora in language teaching has never become a mainstream sport, you can get a fair idea about the type of questions he has been asking here: http://www.surveyconsole.com/console/TakeSurvey?id=742964

Accessibility with corpora is being blown open with developments in technology and the growing mandate for much publicly-funded research to share its findings and assets. Accessibility issues with corpora, however, are two-fold in my opinion – related to both cost with subscriptions and complexity with resource design, which I have discussed in a recent blog post http://www.alannahfitzgerald.org/the-middleman-in-data-driven-learning/. We have a good number of free corpus-based resources available to us now, e.g. we can use AntConc rather than subscribe to WordSmith Tools. However, more work needs to be done with creating interfaces for analysing corpora that are easier to manipulate by non-expert corpus users with designs that are for language teachers and learners rather than tools that have been designed by, and for, corpus linguists.

Mainstream ELT teacher training bodies could also do more to get up to speed with technologies that have been designed for linguistic analysis and support rather than focusing on corpus-derived resources e.g. course books and dictionaries which take you away from actual corpus analyses with much valuable training time spent instead on how to use mainstream technologies in language teaching such as PowerPoint etc. Meeting somewhere half-way would be nice where teachers and learners know what and how they can get what they need out of corpora and where researcher-developers have moved away from traditional KWIC interface designs that teachers can’t remember what the acronym stands for anyway…

Reply
13 05 2013: Scott Thornbury (10:05:13) :

Thanks, Alannah – I’ll check out those training vids of Mark Davies’s – I wish I’d known about them when I inflicted a corpus research assignment on my long-suffering students!

Your point (that ‘more work needs to be done with creating interfaces for analysing corpora that are easier to manipulate by non-expert corpus users ‘) is very well made. I’m also very much looking forward to more from you on how to build your own corpora.

Reply
13 05 2013: Scott Thornbury (10:05:59) :

‘… traditional KWIC interface designs that teachers can’t remember what the acronym stands for anyway…’ Ouch. 😉

Reply
13 05 2013: chazpugliese (15:18:38) :

Thanks for the presentation. One aspect that is often overlooked is that corpora are a tremendously empowering tool for the gazillions of teachers out there whose first language is not English. I see lots of opportunities for some serious contrastive analysis work between EFL-ese as represented in coursebooks and the more real, authentic (yes, VERY loaded terms, these) language.

Reply
13 05 2013: Ken Edwards (16:22:47) :

COCA: ‘Campaign for the Outlawing of Contrived Acronyms’

Reply
15 05 2013: Rima (19:02:37) :

Very useful indeed. But there’s a question I’ve been mulling over for quite some time and would really love to know your opinion or that of the contributors to this bog.
With the newly assigned status of English as a Lingua Franca, I wonder whether it would be permissible for non native speakers of English – in their position as the new owners of the language since they substantially outnumber native speakers- to authoritatively produce new collocations and incorporate them in the English language. This would call for a more inclusive corpus data that would take into account the new dimension English acquired as a language that no longer belong to its traditional owners. Or would any attempt to produce a new collocation be doomed as it will always sound off-key to a native ear?

Rima

Reply
15 05 2013: Scott Thornbury (19:46:55) :

Great question, Rima. There has certainly been a lot of discussion regarding the assumption that NS corpora are somehow privileged, and that, for normative purposes (such as the writing of dictionaries or grammars), they should be kept free of such ‘impurities’ as dialects, non-standard varieties and non-native speaker content. Or that corpora of ‘learner English’ should be assembled separately (with the implication that learners are somehow deficient, and that a corpus of learner English is really a collection of errors!). However, there have been significant developments in the direction of ‘user English corpora’, irrespective of the ‘nativeness’ of the users, or whether they are ‘learners’. One of these is the VOICE project at the University of Vienna, which claims to be the first corpus of English as a Lingua Franca. You can access it online here: https://www.univie.ac.at/voice/page/corpus_availability_online

Reply
25 06 2013: Neil McMillan (15:45:34) :

Another corpus which may be useful for ELF is the recently available Corpus of Global Web-based English (http://corpus2.byu.edu/glowbe/). Although the data is drawn from countries where English is a native or official language, the range of these is broad and it may give up some info on non-standard Englishes.

It might be worth pointing out that you get access to the above, COCA, the BNC and other corpora with one (free) sign-in, if you enter via http://corpus.byu.edu/

And thanks for the video, I learned a few tricks with COCA I didn’t know before!

Reply
15 05 2013: Scott Thornbury (22:53:42) :

A short postscript, having a just read a blog post that Mura referred me to, which reports on a study of idiomaticity in an ELF corpus:

Creativity and color in academic ELF

where (in the light of your comment, Rima) the following really stood out:

Part of the study design was checking to see if the idiomatic language used by ELF speakers was also attested in native-speaker reference material. Of the 128 instances of idiomatic language she found, 78 were attested. That means the other 50 cases were “wrong”, right? This is where the ELF paradigm parts ways with native-speakerists who call anything that isn’t in the Oxford English Dictionary an error or “not a real word” (in other words, “it’s not a real word unless we say it first”). Linguistic creativity is inherent to natural language, and it seems both narrow-minded and counterintuitive to deny this basic linguistic reality to non-native speakers of English.

Reply
16 05 2013: Rima (12:26:58) :

Thank you, Scott for your insightful reply and for the useful links you provided. This is the first time I heard about the VOICE project and I think this one significant positive step forward.

Rima

Reply
17 05 2013: geoffjordan (22:37:34) :

Well done, Scott for drawing attention to COCA. Your presentation is enough to awake interest among many but may I respectfully suggest that you follow it up with a slower, more step-by-step account of how to use the concordance program. Those who know, as you know, often underestimate the difficulties newcomers have of getting the hang of how a new tool works.

I was inspired by comments on various blogs to do my own account of concordancing on my blog, which I hope you don’t mind if I mention here. The blog is Aplinglinks, the address is http://canlloparot.wordpress.com/: and the posts are Concordancers, Corpora and ELT, and Using Concordancers with Students. .

Reply
28 05 2013: Kerri Rizzotto (18:03:57) :

Scott,
Thank you so very much for this. I am speaking at NJTESOL tomorrow and I am referring to authentic language used in speaking about opinions. I will be referring and citing this wonderful video on how to use COCA ……
Thanks again!
Kerri Rizzotto

Reply
13 05 2016: James Pengelley (03:26:24) :

very useful thanks!

A trainee recently asked me about adjective order – citing that two different coursebooks had given conflicting information about the correct order.

So I suggested he look into using an online corpus to find out – I wonder what search query he would need to run in order to answer his question.

colour / substance / origin NOUN. OR colour / origin / substance / (purpose) NOUN.

I wonder if searching for some adjectives of origin or substance, and refining that search for pre and post collocating adjectives would be the simplest way to do this?

Reply
13 05 2016: Glenys Hanson (09:01:07) :

James,
That’s very interesting! I’d love to know the results.

Some time ago I created a series of exercises on adjective word order before nouns: http://www.esl-exos.info/english-grammar-exercises/adjectives/adjective-order/ I just based my examples on my intuition but I’d prefer to have a more solid basis. My feeling is that there are no rigid rules – the context determines the order to a large extent

Reply

	Jeff Buck on T is for Technology
	Janet Mournard on F is for Forty years on
	Mohammad Alnahas on N is for Native-speakerism
	Evan Millner on P is for (Thomas) Prender…
	Nili Pinhasi on I is for Imitation
	Mohammad Jahangir Ho… on R is for Rapport
	Md. Mahbubr Rahman on R is for Rapport
	Mahmoud Heikal on V is for Vocabulary teach…
	Joe Bonner on G is for Grammar McNugget…
	Philip on G is for Gist

An A-Z of ELT