C is for Core vocabulary

5 11 2017

West GSL“Lexis is the core or heart of language”, wrote Michael Lewis  (Lewis, 1993, p. 89). Yes, but which lexis?  Given the hundreds of thousands of words that there are, which ones should we be teaching soonest? Is there a ‘core’ vocabulary? If so, where can we find it? If it is a list, how is it organized? And on what principles of selection is it based?

These questions were prompted by a student on my MA TESOL who asked if the measure of an item’s ‘core-ness’ was simply its frequency. I suspected that there might be more to it than this, and this impelled me to look at the literature on word lists.

The most famous of these, of course, is Michael West’s General Service List (GSL), first published in 1936 and then revised and enlarged in 1953. I am the proud owner of not just one but two copies of West, one of which clearly once belonged to a writer (see pic), who used it to keep within the 3000 word limit imposed by his or her publishers.

Michael West flyleaf.jpgCompiled before the days of digitalized corpora, the GSL was based on a print corpus of up to 5 million words, diligently trawled through by a small army of researchers (‘of high intelligence and especially trained for the task’) for the purposes of establishing frequency counts – not just of individual words but of their different meanings.

But frequency was not the only criterion for inclusion in the GSL. West and his collaborators also assessed whether a word was relatively infrequent but necessary, because it lacked a viable equivalent – ‘vessel’ being one: ‘container’ doesn’t work for ‘blood vessels’, for example.  Conversely, some words may be frequent but unnecessary, because there are adequate non-idiomatic alternatives, i.e. they have cover. Finally, informal and highly emotive words were excluded, on the grounds that they would not be a priority for learners.

In the end the GSL comprised around 2000 word families (but over 4000 different lemmas, i.e. words that have the same stem and are the same part of speech: dance, danced, dancing, but not dancer) and even today, despite its age, the GSL gives a coverage of nearly 85% of the running words in any corpus of non-specialist texts (according to Bresina & Gablasova, 2015 – see below).

Subsequently, Carter (1998) has elaborated on the criteria for what constitutes ‘core-ness’. One is a core word’s capacity to define other words. Hence the words chosen by lexicographers for dictionary definitions are a reliable source of core vocabulary. One such is the Longman Defining Vocabulary (LDV): you can find it at the back of the Longman Dictionary of Contemporary English (my edition is that of 2003) or at a number of websites, including this one.

The publishers comment, ‘The words in the Defining Vocabulary have been carefully chosen to ensure that the definitions are clear and easy to understand, and that the words used in explanations are easier than the words being defined.’

laugh entry GSL

Entry for ‘laugh’ in the GSL

Another test of coreness is superordinateness: ‘Core words have generic rather than specific properties’ (Carter 1998, p. 40). Hence, flower is more core than rose;  tool more core than hammer. For this reason, perhaps, core words are the words writers tend to use when they are writing summaries.

 

Core words are also more likely to have opposites than non-core words: fat vs. thin, laugh vs. cry. But what is the opposite of corpulent, say? Or giggle?

Core words also tend to have a greater range of collocates – compare start vs commence (start work/an argument/a career/a rumour/a conversation etc.) And they have high word-building potential, i.e. they combine productively with other morphemes: startup, headstart, starter, starting line, etc.  Core words are also neutral: they do not have strong emotional associations; they do not index particular cultures (dress vs sari, for example), nor are they specific to certain discourse fields: compare galley, starboard, and below deck  with kitchen, left, and downstairs. (i.e. a nautical discourse vs. a less marked one.)

On this last aspect, an important test of a word’s coreness is not just its overall frequency but its frequency in a wide range of contexts and genres – its dispersion. In a  recent attempt to update the GSL, and to eliminate the subjectivity of West’s criteria, Bresina & Gablasova (2015) tested for the ‘average reduced frequency’ (ARF): ‘ARF is a measure that takes into account both the absolute frequency of a lexical item and its distribution in the corpus… Thus if a word occurs with a relatively high absolute frequency only in a small number of texts, the ARF will be small’ (op. cit, p. 8). Bresina and Gablasova also drew on – not just one corpus – but a range of corpora, including the 12-billion word EnTenTen12 corpus, to produce a New General Service List which, while much trimmer than West’s original (2500 vs. 4000 lemmas), and therefore perhaps more ‘learnable’, still gives a comparable coverage of corpus-based text – around 80%. (The full text of the article, along with the word list itself, can be found here).

More impressive still, and also called a New General Service List, is the one compiled by Browne et al (2013) which, with 2800 lemmas, claims to provide more than 90% coverage of the kinds of general English texts learners are likely to read.

Other potentially useful word lists include The Oxford 3000: ‘a list of the 3000 most important words to learn in English’ – accessible here.  Again, dispersion – not just frequency – has been an important criterion in choosing these: ‘ We include as keywords only those words which are frequent across a range of different types of text. In other words, keywords are both frequent and used in a variety of contexts.’ And the publishers add:

In addition, the list includes some very important words which happen not to be used frequently, even though they are very familiar to most users of English. These include, for example, words for parts of the body, words used in travel, and words which are useful for explaining what you mean when you do not know the exact word for something. These words were identified by consulting a panel of over seventy experts in the fields of teaching and language study.

Inevitably, there is a lot of overlap in these lists (they would hardly be ‘core vocabulary’ lists if there were not) but the differences, more than the similarities, are intriguing – and suggestive, not only of the corpora from which the lists were derived, but also of the criteria for selection, including their intended audience and purpose. To give you a flavor:

Words in West’s GSL not in LDV: plaster, jealous, gay, inch, widow, elephant, cushion, cork, chimney, pupil, quart.

Words in LDV not in GSL: traffic, sexual, oxygen, nasty, infectious, piano, computer, prince.

Words in Oxford 3000 not in either GSL or LDV: fridge, gamble, garbage, grandchild, sleeve, software, vocabulary… Note also that the Oxford 3000 includes phrasal verbs, which are not systematically included in the other lists, e.g. pull apart/ down/ off/ in/ over/ through/ up + pull yourself together.

Of course, the key question is: what do you actually do with these lists? Are they simply guidelines for materials writers and curriculum planners? Or should learners be encouraged to memorize them? In which case, how?

Discuss!

References

Brezina, V. and Gablasova, D. (2015) ‘Is there a core general vocabulary? Introducing the New General Service List,’ Applied Linguistics, 36/1. See also this website: http://corpora.lancs.ac.uk/vocab/index.php

Browne, C., Culligan, B. & Phillips, J. (2013) ‘New General Service List’ http://www.newgeneralservicelist.org/

Carter, R. (1998) Vocabulary: Applied linguistic perspectives (2nd edition) London: Routledge.

Lewis, M. (1993) The lexical approach. Hove: LTP.

West, M. (1953) A general service list of English words. London: Longman.