C is for Corpus

10 01 2010

In the entry for corpus I made the point that “the availability, now, of corpus and concordancing software on the internet is a useful tool for teachers, especially when it comes to checking hunches, such as whether people really say I’m going to go to the shops (rather than I’m going to the shops) or she mustn’t have known…(rather than she can’t have known…) or me and my sister…(rather the my sister and I…)”

Out of interest there are 595 examples of going to go in the online British National Corpus (courtesy Mark Davies’ Brigham Young University site at http://corpus.byu.edu/bnc/x.asp) of which the majority (339) are in the spoken language section of the corpus. But there are a good number in the written corpus too, e.g. Secretary Cecil Parkinson took the view that cars were not going to go away; he stormed off to France, saying that he was going to go to Jerusalem – certainly, enough to make a nonsense of this advice (from the Collins COBUILD Student’s Grammar, 1991):

You do not normally use ‘going to’ with the verb ‘go’. You usually just say ‘I’m going’ rather than ‘I’m going to go’.

But even more fun than checking hunches is discovering areas of grammar that – theoretically – ought to exist, but don’t exist in the standard grammars. I’ve always argued, for example, that the tense and aspect system of English is compositional. That is, the so-called tenses in English, such as past continuous or present perfect continuous etc – are nothing more than the combination of their separate components – not just in terms of their form, but also their meaning. So, the past continuous combines past tense (meaning) plus continuous aspect (meaning). Viewed this way, the past continuous needn’t be taught as a special ‘tense’ in its own right, but can simply be inferred from what is known about the past (from the past simple) and what is known about the continuous (from the present continuous).  

But, according to that theory, if the present perfect continuous is nothing more than present tense + perfect aspect + continuous aspect, it should have an “indefinite time in the past” meaning. That is, if we can say Have you ever been to Rome? we should also be able to say Have you ever been doing X [at some indefinite time in the past]? Using WebCorp (a corpus that uses web-based search engines: http://www.webcorp.org.uk/), I found enough examples to confirm my hypothesis. For example:

Have you ever been driving when you get almost hit by a tornado

Have you ever been camping when you didn’t want to spend a lot of time preparing meals? 

Or, have you ever been watching sport on tv and wanted….

In other words, it is possible to use the present perfect continuous for finished past situations (although the only examples I have found so far are in the question form with ever, which suggests that this use of the present perfect continuous is severely constrained).

Finally, here’s another area of grammar that doesn’t exist in the standard grammars but does exist in corpora: the use of would have + past participle for real (as opposed to hypothetical) past situations.  Recall that normally would have done is taught as a counterfactual, compatible with but didn’t, as in “I would have baked you a cake [but I didn’t]”. But check these (also found using WebCorp)  – none are counterfactual, judging by their contexts:

my elder brother, Leonard, was killed during the Southern African War while I was still a boy. Naturally, my father would have felt this loss keenly;

Stone-age man would have noticed that birds navigating by means of the magnetic properties of the ley lines…

Whatever local radio you were listening to during the 3rd week of June, it is likely that you would have heard Delahunty’s editorial director, Paul Mace, on the hour, every hour, bringing you those live reports from the Pilkington Glass Ladies’ Championships at Eastbourne.

When it was on scheduled service, the jumbo would have had up to 450 passengers and 12 crew on board.

Of all the grammars I’ve consulted, only one makes any mention of this usage: English as a Foreign Language, by R.A. Close (1962)

 Would have –ed

 …It can also be a present, hesitant assumption about past action…as in:

A. Someone telephoned you last night.

B. That would have been Jeremy, I would say.

(Not something I’ve ever seen mentioned in a coursebook!)

Does anyone know of any other attested grammar items (i.e. ones for which there’s corpus evidence), but which the conventional grammars ignore?



50 responses

11 01 2010
Elizabeth Anne

Apart from the obvious “corpora are only as valid as their constituent texts” I jumped in here because I can “hear” so clearly the person using your example “Or, have you ever been watching sport on tv and wanted to …” in a turn which is entirely in the “narrative present” and full of the likes of:
“so we gets up and make them a cuppa, and there were like 10 of them “.
i.e. a young (British?) English dialect which I would not wish to confuse my struggling ELF learners with 🙂

11 01 2010
Scott Thornbury

Thanks, Elizabeth for your comment. Here’s the fuller context of that citation:

Have you ever been watching sport on tv when the score pops up and blocks the view? Or, have you ever been watching sport on tv and wanted to know the score, but it’s not on screen at that moment?


This appears to be a blog or news site of some description, but not one written by a teenager, necessarily. To me this seems to be, not an example of narrative present, but of the timeless present. Here are a couple of invented examples that seem consistent with this usage:

Have you ever been in the bath when the lights have gone off? (i.e. present perfect simple)
Have you ever been having a bath when the lights have gone off?(i.e. present perfect continuous)
Have you ever been having a bath when the lights go off? (ditto, but with present simple in the subordinate clause)

Does anyone else find these (un)acceptable?

11 01 2010
Marisa Constantinides

I was hoping that Leech in his “Meaning & the English Verb” might have something, but no… neither in past or in hypothetical meaning…

11 01 2010

I can’t believe ‘Practical English Usage’ doesn’t have it. Round this way, if it’s not in the ‘Bible’, as it’s called, it doesn’t exist.

11 01 2010
Scott Thornbury

No, I’ve checked the latest edition of Swan, and all his examples of would have done are counterfactual. Not so with may have done and might have done, which are used “to say that it is possible that something happened or was true in the past”. Would have done appears to belong to the same set, except that it implies more certainty. Compare:

A: Someone phoned last night.
B. 1. That might have been Jeremy.
2. That may have been Jeremy.
3. That could have been Jeremy.
4. That would have been Jeremy.

Incidentally, I’ve tried this out on speakers of American English, and they find #4 strange, even unacceptable.

11 01 2010

In which case, this is very exciting – like being witness to a discovery. Actually, if you don’t mind me saying, this whole blog is very exciting. Most of the debate is fascinating.

11 01 2010
Scott Thornbury

Thanks for the comment. Let me now take the argument a step further and suggest that would have done always means “is likely to have done” and it is only the co-text that determines whether the situation is real or unreal. Compare:

1. When it was on scheduled service, the jumbo would have had up to 450 passengers and 12 crew on board. (i.e it did)

2. If it had been on scheduled service, the jumbo would have had up to 450 passengers and 12 crew on board. (i.e. it didn’t)

Which is another way of saying that hypothetical meaning is (at least partly) pragmatic (i.e. context dependent) and not encoded in the grammar. The third conditional is a convenient simplification.

12 01 2010

“Incidentally, I’ve tried this out on speakers of American English, and they find #4 strange, even unacceptable.”

Do they know Jeremy?

It seems hard to believe that “would have” is missing from coursebooks. I am sure I remember teaching it alongside “must have” etc for making deductions. This, of course, was some time back when I used to believe that the best way of spending time in a class was to teach grammar.

12 01 2010
Scott Thornbury

It’s not entirely missing from course materials, Diarmuid. I included would have done for speculating about the past in a unit I wrote for an online advanced course (when I used to do that kind of thing). That’s in fact when I “discovered” it: I was adapting a text about Easter Island and I kept finding this structure: The islanders would have used palm-trunks to roll the stone heads into position… The island would have been covered in palm-trees…etc. I looked it up in all the grammars and there was no mention of it. At the time I didn’t have access to an online corpus, so I had to trust my intuitions.

13 01 2010
cindy hauert

I’m an American English speaker and I find no. 4 quite acceptable. But as I haven’t lived in the US for many years, maybe that’s the explanation.

14 01 2010
Scott Thornbury

Further thoughts on X would have been/done:

It seems to me that this always means it was predictable that X was/did…, but that, according to the context, it can be applied to either real or unreal situations – the realness or unrealness being explicit in the co-text, or implied in the context, and not inherent in would have done at all.

To illustrate – in the following examples he would already have been an old man is simply a prediction about the past, but in the first two examples, it’s a real past that is being referred to, while in the third it’s a hypothetical past – signalled by the backshift into the past perfect:

1. When Shakespeare wrote “Two Noble Kinsmen” he would already have been an old man. (The assumption is that he did write it)
2. If Shakespeare wrote “Two Noble Kinsmen” he would already have been an old man. (The assumption is that he might have written it)
3. If Shakespeare had written “Two Noble Kinsmen” he would already have been an old man. (The assumption is that he didn’t write it)

Thus, what we call the third conditional (3), is simply an unreal past situation (the if- clause) + its predictable consequence. Sentences 1 and 2 also describe predictable consequences, but about real past situations.

11 01 2010

Hi Scott and thanks for another great entry. As someone following grammar syallbi for coursebooks (cringe) I agree that there are often things “left out” and that most materials writers are following the grammar books we have so the “canon” keeps getting repeated. However, there are signs this is changing, partly due to works like Natural Grammar (a staple now of most materials writers) and books by David Willis (Rules, Patterns Words). The “non-canonical” grammar items, for lack of a better term, sometimes now get categorized under vocabulary. One example I’ve included is the pattern go +verb+ing (go swimming, go shopping etc) which is very generative and I picked up from a talk of yours.

I would say little by little things are changing thanks in no small part to work you have done.

However, something in your post does make me wonder. Is 595 examples of “going to go” in a 100 million word corpus enough to “make rubbish” of the Collins claim that we don’t usually say this? I found 1069 examples in the same corpus of “delightful” and it’s not a word I would use that often. Or the 3.3 million words for salutations that I found on Google? I rarely hear salutations, and would really just say – and teach – hello.

11 01 2010
Scott Thornbury

Thanks Lindsay – for your comments and flattery!

Regarding your question re the relative importance of going to go, I agree that 595 occurences out of 100m words doesn’t sound like much. To put this in perspective, though, there are 574 occurences of was watching in the same corpus – a combination that I would guess is introduced at pre-intermediate level in most courses.

What’s more, a follow-up search on the BNC corpus (at http://corpus.byu.edu/bnc/) shows that the verb go is the seventh most common verb that follows going to – the others being (in order) be, do, have, get, take, say. This would seem to suggest that, if you present going to without go, you are misrepresenting – or at best under-representing – its typical collocations.

(When I say ‘you’, I mean of course ‘one’ – not ‘you, Lindsay’!)

12 01 2010
Diana England

Hi, there Scott!
Really sorry you couldn’t make to the IH Dos Conference – would have been one of my conference highlights.

How about the use of ‘there’s’ followed by plural nouns as in ‘there’s lots of things I’d like to do next year’? I’ve been listening out to see how prevalent this is and in the last three or more months not one person I’ve heard has used ‘there are’. I don’t think the same problem arises in the negative form (ie people ‘correctly’ use ‘there are’ plus plural noun), so it must be being retrieved as a chunk. The speakers I’ve heard are not correctling themselves either, indicating they don’t recognise it as a slip. And I’ve heard this in a variety of situations, from informal to more formal rehearsed BBC reports by journalists.

I’m now downgrading my clarification and correction of this with my learners – what’s the point of getting them to sweat blood and tears learning to pronounce ‘there are’ and remembering to use it if native speakers can’t be othered and no one’s noticing anyway?!


12 01 2010
Scott Thornbury

Hi Diana (yes, I’m sorry I didn’t make it to the conference too – apart from anything else it would have been [sic] the first time I’d presented on a ship!).

As for your point about there’s + pl. noun, I did a quick search of the online BNC Corpus and these are the plural nouns that most commonly collocate with there’s: loads, things, bits, times, men, others, women, kids…
Some examples:

1. There ‘s loads of people I know sitting up to watch this.
2. there ‘s things are not necessarily wrong, there’s the legitimate things,
3. Do you know that there ‘s men running clubs up there who would murder somebody like you or me
4. There ‘s others who are probably more motivated by fame and hit records and money
5. there ‘s kids on sledges coming at you at about fifteen miles an hour

It’s significant perhaps that they all come from spoken data (the 2nd example actually comes from a sermon!), so it does seem to be a feature of spoken, rather than written, grammar.

The Corpus of Contemporary American English, at http://www.americancorpus.org/ gives plenty of results, too, but the typical nouns don’t include loads. The most frequent combinations are with: things, ways, problems, others, women, kids. E.g.

1. as a kid there ‘s things you learn that people don’t do in front of other people
2. there ‘s problems everywhere when people illegally use guns.
3. There ‘s women everywhere, every shape and size.
4. there are ways to pursue food and medicine, there ‘s ways to buy food and medicine

(The last example suggests that there is a certain amont of free variation in the choice of verb)

12 01 2010

Here’s one I’ve often wondered about and never seen anywhere (maybe I haven’t been very attentive): Using was/were + to + inf. for counterfactuals (with action verbs).

For example in the song “Light my Fire” by the Doors when Jim Morrison says “You know that I would be untrue, you know that I would be a liar/If I was to say to you, girl we couldn’t get much higher”.

It’s a structure that gets skipped over almost entirely, I imagine because it would further complicate the presentation of 2nd conditional that most books do. But I’d be interested to see what the frequency would be of such a structure.

12 01 2010
Scott Thornbury

Great one, Nicky!

I did a quick search of the Corpus of Contemporary American English (http://www.americancorpus.org/) and found that this is indeed quite frequent, but seems to be much more frequent a) with pronoun subjects and b) with were, rather than was. The stats:

If [noun] was to = 61 instances
If [noun] were to = 291
If [pronoun] was to = 321
If [pronoun] were to = 3108

Of these 3108 instances of if [pn] were to, a third occurred in the spoken data, but the others were fairly evenly distributed across the different written genres (fiction, academic, and journalism). I.e. it’s not specifically a spoken thing.

Here are some examples of all combinations, fairly randomly chosen:

1. If prices were to move up again, it would be a force against improving profits
2. If Congress were to implement that idea, American imports probably would decline as intended.
3. sometimes I think, like, maybe if I was to have a baby, maybe it will make me a stronger person
4. What if I was to not come back? Is them your last words?
5. And if you was to see him in court in his wig and his long robe!
6. It’s the same if you were to meet a beautiful girl and go bowling.
7. That soda can is so toxic that if you were to stand within 3 feet of it, it would kill you in
8. These members forewarned that if they were to be punished, they would auction their farms

The question, then, is: is this structure interchangeable with if [n/pn] past tense? And if not, why not? E.g. are these pairs essentially the same?:

1. a) If prices were to move up again, it would be a force against improving profits
b) If prices moved up again, it would be a force against improving profits

2. a) It’s the same if you were to meet a beautiful girl and go bowling.
b) It’s the same if you met a beautiful girl and went bowling.

12 01 2010

Yeah, “were” does sound much more natural than “was”, but you know, our friend Li’l Jimmy Morrison was the Lizard King and he could do anything 🙂

I have a nagging feeling that “were to move up” is more counterfactual than “moved up”, though I don’t know why that is or if that’s correct. Or maybe I just associate it with a specific kind of discourse, or a function such as (I’m kind of inventing this) “‘Holding forth’ on a subject” (i.e., like a political commentator or a financial analyst), though not all the examples fit in that context.

In fact it’s interesting in that, it seems to pop up in quite formal contexts (“If Congress were to implement that idea”) and totally informal ones (“sometimes I think, like, maybe if I was to have a baby”). Weird, that.

Will promptly bookmark that American corpus, thanks!

12 01 2010

I need to choose my words carefully here because I certainly don’t want to offend anyone. I’ll start with two examples:

If I was to go to Paris next week, we could meet up.

If I went to Paris next week, we could meet up

I think that in the above case, the use of “went” and “was to go” can be seen as broadly synonymous. It might be argued that there is an implied question in the first and an offer in the second, BUT this would depend very much on context and intonation. In other examples, it my be possible to resort to the ELT “get out of jail free” card – “You could say that, but nobody would.” There are often arbitrary choices made by users that have no explanation. They simply are that way.

I’ve always understood English to be a high-context language, and therefore the exact meaning of whole sentences, or the choice of grammatical forms, cannot always be identified without a context.

My French teacher once wrote in a school report of a classmate, “John is like the Swiss watchmaker who has become so obsessed with the mechanism of the watch that he has forgotten what the time is.” I often think of this when I see discussions about grammar. Without the full context and (for spoken language) the intonation, things that are essentially the same can have very different meanings. Or they can be synonymous and reflect simply the users choice.

Of course in many cases, the differences may be small but very significant. but we should be careful of thinking that any difference must have another meaning.

For the sake of full reporting, the same French teacher wrote on my report that I was “like a man lying poisoned on his deathbed. Sadly he is the only one who has the antidote, and he refuses to administer it.” I hope he would be pleased that he taught me something (though he may well be horrified to learn that I have become a teacher!)

13 01 2010
Scott Thornbury

Hi Olaf – no cause to be afraid of offending!

Your comment: “I’ve always understood English to be a high-context language, and therefore the exact meaning of whole sentences, or the choice of grammatical forms, cannot always be identified without a context” is spot on (and I think the same can be said for any language, not just English).

The fiercest criticisms of an over-reliance on corpus data (e.g. by Widdowson or Guy Cook) make more or less the same point – in the absence of both co-text and context, any claims, apart from the purely quantitative, about usage have to be heavily qualifed. In defence, the nice thing about the corpus sites I’ve been using (i.e. those designed by Mark Davies at Brigham Young University) is that they allow you to expand the concordance line in order to see the fuller textual context, and that they tag each corpus line with its genre (spoken, academic etc) and provenance (CNN, Fox News etc). So, at least at the level of co-text, you have access to data that can inform the analysis. But of course you have no data as to how the utterance was intended or interpreted in its immediate context of use, and you lack crucial phonological and paralinguistic information, not to mention sociolinguistic information (Where is the speaker from? What gender? What age? Is he/she a native speaker? etc).

But corpus data does at least help identify patterns that might otherwise have slipped through the net, especially in terms of teaching grammars, including those that have been mentioned in this thread.

Incidentally, this is what Carter and McCarthy have to say about the “were to do” structure, in their Cambridge Grammar of English (2006), itself based on extensive use of corpus data:

If + noun phrase + was/were to + verb (base form) also refers to hypothetical situations and actions (with a singular subject, was to is less formal than were to). (p. 751)

But there is no indication as to what might motivate this choice.

12 01 2010
Willy C Cardoso

Hi all, this is a great discussion.
I’m new in posting here so maybe you’ve talked about what I’ll ask in another post, anyway:

All coursebooks I’ve used so far, and there were many, emphasized the differences between the use of ‘will’ and ‘going to’, as regards the certainty of the event/action. However, none of the American teachers, here in Brazil at least, seem to acknowledge this difference, they simply say, – it’s the same.
i.e. I’ll go to the beach next week and I’m going to the beach next week can both be a ‘definetely’ or a ‘maybe’.

Any hints on this?

Thanks @thornburyscott! btw About Language is the book I most lend to my fellow teachers. I haven’t seen it for months, so I can’t say whether the answer to my question is there.

13 01 2010

Willy- answering questions like this makes me worried I am going to unveil my ignorance, but I would venture this as a possible answer: it doesn’t really matter. However, as this doesn’t do you much justice, I would also tentatively suggest that you could think of “I’ll go to the beach next week” as making some sort of commitment, either t another or to yourself. The same might be said of things like, “I’ll go to the gym next year,” “I’ll have fewer donuts next time” etc. Rather than definite, I would say that it expresses a definite resolution.

“I’m going to the beach next week” on the other hand does indicate a definite plan in the speaker’s mind. The resolution has been made and mental space has been cleared to allow for the planning of the trip to the beach. At the moment of speaking, our hero has no intention of doing anything other than going to the beach. Rather than a definite resolution, it is now a definite intention.

There are far, far more knowledgeable people than me here – not least our honourable host- so I will stand by to be corrected.

PS. Am I wrong to miss the days when teachers wrote reports in the style that Olaf refers to? If we were to write reports like that, wouldn’t life be more interesting for all?

13 01 2010
Scott Thornbury

Diarmuid has answered this one neatly. I would only add that both will and going to (like most modals or modal phrases) are used to express two kinds of meanings: 1. meanings related to how we see the likelihood of events (sometimes called extrinsic, or epistemic, modality); and meanings related to how we intervene in, or exert change on, events (intrinsic or deontic modality). Thus, a sentence like She may drink is ambiguous, since it could mean It’s likely that she will drink (extrinsic modality) or She’s allowed to drink (intrinsic modality).

Will and going to can both be used in the first (extrinsic) sense, to talk about likelihood, i.e. to make predictions (It’ll rain tomorrow/It’s going to rain tomorrow). And they can both be used to talk about volition (I’ll do the washing up/I’m going to do the washing up). When used for predictions, they are often interchangeable – the difference being more stylistic than semantic.

When used in the volitional sense, however, the difference is more acute. As Diarmuid pointed out, I’ll do the washing up implies “volition at the moment of speaking”, i.e. making a decision, while I’m going to do the washing up implies “retrospective volition”, i.e. reporting an intention.

The difference is buried in the etymology: I will do the washing up = I wish to do it; I’m going to do the washing up = I’m already on my way to doing it.

13 01 2010
Alice M

“The difference is buried in the etymology: I will do the washing up = I wish to do it; I’m going to do the washing up = I’m already on my way to doing it.”

Thank you for this!
It is different from the two future tenses in French, but the “going to” / I’m on my way to, seems to match “aller+infinitive”.
In French when you say “je la ferai” (using the futur simple), and talking about the washing up, it does not mean at all (I’ll do it/I wish to do it)! on the contrary!
If I hear a (male) friend say “je la ferai”, it implies “maybe next year” !!
I much prefer hear him say “je vais la faire”.
Your explanation helps me understand why most of my English speaking students have difficulties using future tenses in French.
In French I would say the aller+infinitive form is by large the most common.
Which form of future would you say is the most frequent in English?
Thanks again, this topic is a great source of questioning for me.

13 01 2010
Scott Thornbury

Hi Alice, thanks for the comment. As for your question (Which form of future would you say is the most frequent in English?) the Longman Grammar of Spoken and Written English (Biber et al. 1999), based on the British National Corpus) provides a neat graph showing the relative frequencies. The total of occurences of will in the corpus by far outnumber the total occurences of going to. But in the conversation sub-corpus, going to is more common than will in the volitional/intentional sense, but less common in the predictive sense. However, the vast majority of instances of will, in the conversational data, are labelled as ambiguous – meaning they could be either volitional or predictive. In the academic sub-corpus, will trumps going to on all counts.

13 01 2010
Willy C Cardoso

Nice comment Alice, I’ve realized that in French as well.

In case anyone wants to know, in Brazilian Portuguese there’s no such a thing in futurity. We got both structures like in French, but what would be ‘going to’ is far far more frequent than a ‘simple future’, still the intended meaning is 100% the same. It’s so irrelevant to compare the two possibilities to the point that my Portuguese students almost refuse to practice the simple future, which means learning the conjugation. As long as they understand it when someone uses it, they don’t feel like using it themselves at all.

Of course I’m saying this for practical teaching purposes, to a doctor in Pt this will sound blasfemy.

14 01 2010

If I were to

When learning German my teachers used this as the only example in English of the verb changing its form for the conditional subjunctive – ie was to were – whereas in German all verbs change (or use an auxiliary) – ie war to wäre, werde to würde…

15 01 2010
Jason Renshaw

Fantastic post and discussion! I don’t really have anything to add or ask of it, except to say (as someone who has written a 4-level grammar series and constantly struggled with the desire to feature context, co-text and pragmatics where editorially it was almost purely a matter of featuring the grammar McNuggets from several MoE lists), it’s wonderfully informative and thought-provoking. Despite the dark reputation it often cops, you have to admit, with grammar, there are always more fasincating discoveries awaiting!

15 01 2010
Ramesh Krishnamurthy

Hi Scott

We met several years ago in Madrid at Spain TESOL – you gave a great talk on ‘the McNuggetization of Grammar’ (and you liked my anecdote about Panini receiving his inspiration in the form of a dream in which Shiva DANCED the grammar; has anyone written the Dance of Grammar for English yet? I thought you were going to?)…

Anyway, I’m not really a grammarian, but… I was the Senior Editor of the Collins COBUILD Student’s Grammar, 1991, and would like to be allowed to add a small corrective (rather than correction) to your comment in paragraph 2: “there are 595 examples of going to go in the online British National Corpus… certainly, enough to make a nonsense of this advice (from the Collins COBUILD Student’s Grammar, 1991): You do not normally use ‘going to’ with the verb ‘go’. You usually just say ‘I’m going’ rather than ‘I’m going to go’.”

I would like to raise the following points:

a) BNC contains only British English; Cobuild advice was based on data from a wider global English-speaking community, including UK, USA, Australia, etc. It may be that the usage ‘going to go’ is more common in British English?

b) You are correct to say that there are 595 examples of ‘going to go’ in the BYU-BNC. However, this is only 595 out of 27299 examples of ‘going to + VERB’ or 2.18% (and 32900 examples for ‘going to’ in total). I think this is substantive evidence for saying that ‘going to’ plus the verb ‘go’ is not very common – especially for the purposes of intermediate learners of English, the target audience for the Students’ Grammar?

I also noticed the later comment you made “the verb go is the seventh most common verb that follows going to – the others being (in order) be, do, have, get, take, say. This would seem to suggest that, if you present going to without go, you are misrepresenting – or at best under-representing – its typical collocations. ”

Again, you are correct… BUT, ‘be’ is 10 times more frequent, and ‘do, have,and get’ are 2-3 times more frequent:
1 GOING TO BE 5942
2 GOING TO DO 1795

I also looked at ‘will go’, and was surprised to find that there are only 2180 examples in BNC, of which 421 are for ‘will go to’ and only 54 for ‘will go to + VERB’.

c) By the way, I also checked the corresponding figures for the COCA (American English) corpus at BYU: 9048 for ‘going to go’, out of 251559 for ‘going to + VERB’, or 3.6% (and 331456 for ‘going to’ in total). This surprised me, but that’s the excitement of working with corpora for me! My intuition was wrong, yet again!
Isn’t it fun learning about authentic language?

Out of 8630 examples for ‘will go’ (8630), there are 1782 for ‘will go to’ and 205 for ‘will go to + VERB’.

A couple more comments, if I may:

d) “Or, have you ever been watching sport on tv and wanted….” = continuous for interrupted actions? e.g. “I was watching TV when X happened”.

e) I think Dave Willis in The Lexical Syllabus drew attention to various uses of ‘would’
that were not given in previous grammars.

f) “Would have –ed
…It can also be a present, hesitant assumption about past action…as in:
A. Someone telephoned you last night.
B. That would have been Jeremy, I would say.
(Not something I’ve ever seen mentioned in a coursebook!)”

The Cobuild Dictionary, 1987, has at paragraph 17 of the entry for ‘would’:
” You use ‘would have’ when you are mentioning something that you are assuming
or guessing happened or was true. EG They’d had a hundred the previous year. That would’ve been in ’62… She wouldn’t have noticed. She was too far away.”

g) “Does anyone know of any other attested grammar items (i.e. ones for which there’s corpus evidence), but which the conventional grammars ignore?”

I think Dave Willis and other grammarians working with Cobuild (Gill Francis, Susan Hunston, Elizabeth Manning) have probably given examples in various Cobuild publications…?

h) “1. When it was on scheduled service, the jumbo would have had up to 450 passengers and 12 crew on board. (i.e it did)
2. If it had been on scheduled service, the jumbo would have had up to 450 passengers and 12 crew on board. (i.e. it didn’t)
Which is another way of saying that hypothetical meaning is (at least partly) pragmatic (i.e. context dependent) and not encoded in the grammar. ”

Doesn’t ‘if’ as opposed to ‘when’ signal/allow for hypothetical meaning?

Best wishes

16 01 2010
Scott Thornbury

Hi Ramesh, nice to catch up with you again, and thanks for posting so insightfully – I’m honoured to have a “real” corpus linguist visiting the blog. Just a few comments on your comments:

On going to go – thanks for the corrective. It’s so easy to be selective with quantitative data, and thereby skew the facts. What I thought you were going to point out, though, is that a lot of the occurrences of going to go are in fact phrasal verb combinations, and therefore often idiomatic. Of the 595 occurrences, 229 (over a third) are going to go + adverb, of which the most common are going to go away (as in the debate on energy policy is not going to go away) and going to go back. It strikes me that it would be more awkward to conflate an idiomatic use than a literal one, viz:

I’m not going to go to the pub = I’m not going to the pub
it’s not going to go away = ?it’s not going away

2. would’ve done – thanks for pointing out that the COBUILD dictionary mentions the “real” use of this. Odd, therefore, that the COBUILD grammar doesn’t – on p. 226 of the 1990 edition, it says “You use ‘would ‘ with ‘have’ to talk about actions and events that were possible in the past, although they did not in fact happen: Denial would have been useless; I would have said yes, but Julie talked us into staying at home… etc (I.e. nothing about actions or events that were possible in the past and probably did happen: That would have been Jeremy)

3. Yes, Dave Willis has written about would, but mainly to point out (a) that the past habit use of would (as in We would often take the train) is more common than coursebooks suggest, and (b) that the hypothetical use of would often occurs in the absence of if clauses: I would never eat dog… So, teaching would as if it were primarily an element of conditional structures is another coursebook misrepresentation of usage.

4. Yes, if does signal conditional meaning, but not necessarily unreal meaning. E.g. (When we were on holiday at the beach) If it rained we would play cards. It’s the combination of if + backshift that signals irrealia: If it rained (tomorrow) we would play cards.

And, no I never did write about grammar and the dance of Shiva, although I did try to persuade my publishers to put a dancing Shiva on the cover of Uncovering Grammar, to no avail!

18 01 2010
Ramesh Krishnamurthy

Good to talk to you again, Scott!

1. Yes, I overlooked the phrasal verbs – good point!
2. Shame the COBUILD grammar missed out that item. Not an excuse, but a logistical problem of teamwork…? Some things slip through the collective net…

Shame also about Shiva – I may use the anecdote in some distance-learning materials I’m preparing…

Hope our paths cross again before too long!

17 01 2010
vicki hollett

The ‘would’ in that your ‘would have’ examples strikes me as being a past habit “would’ (so not conterfactual)

18 01 2010
vicki hollett

Just re-read my posting and realised it might sound like I was trying to pick an argument. Ha!
On the contrary, I meant to provide further support for what you’d said about those ‘would have’ examples and seeing things in terms of time and aspect.
I think they might be helpfully viewed as ‘would’ (in the sense of past habit) + perfect aspect (in the sense of actions where we’re interested in the later consequences of an earlier event)

18 01 2010
Scott Thornbury

Hi Vicki – yes, I agree that the “core” meaning of would in examples like the following (whether past or with the perfect infinitive), is the same, although I would call it ‘predictability’:

1. They would always take the train.
2. They would’ve taken the train.

In the first example, the predictable behaviour occurs more than once, so is understood as habitual; in the second a prediction is being made about a past situation, e.g. A: How did they get home? B. Well, they don’t have a car, so I guess they would’ve taken the train. The prediction can also be about a hypothetical (i.e. unreal) situation too:
If they hadn’t had a car, they would’ve taken the train.

So I agree with you, though I would substitute past prediction for past habit, in your analysis. Does that make sense?

6 02 2010

Wonderful thoughts but my head spins when it comes to grammar! However, I have a very basic question to ask others.

Isn’t a corpus something that consists of (for the most part) written material? Doesn’t that have some bearing on the discussion and conclusions? For example when you say, “most people say”…. how can we make that inference and leap of faith when the corpus is based on written entries? Same would hold for google. The data is biased and for a different “field”.

Just wondering,


PS. I will say in my own defense, I’m computer literate and have compiled and programmed my own corpora. But I’m not aware of anyone that has used speech recognition to compile a corpus that is extensive (there are some small spoken corpora based on transcripts that I’m aware of ).

6 02 2010
Scott Thornbury

Fair point, David,a nd one that is often levelled at corpus data.

The COCA corpus (http://www.americancorpus.org/) which I use a lot, claims to include the following:

“Spoken: (83 million words) Transcripts of unscripted conversation from more than 150 different TV and radio programs (examples: All Things Considered (NPR), Newshour (PBS), Good Morning America (ABC), Today Show (NBC), 60 Minutes (CBS), Hannity and Colmes (Fox), Jerry Springer, etc).”

That’s a fairly extensive collection of spoken text, although you might quibble that the register is a little narrow (TV and radio). But still…

7 02 2010

That’s large as they come (I had in mind the Michigan spoken corpus (MICASE) which I think is about 10 mil. words). But yes, IMO – still too narrow when you compare it against written text. Also, not representative (in my view) of how people actually speak – the register that is beyond the camera/performance. But I do agree that corpora are probably an under used resource for teachers! Especially when it comes to word frequency and noting change in language use.

Have you ever read George Steiner’s wonderful essay – “The Distribution of Discourse”. I read it years ago (during my poetry days) and its a kind of thought experiment on the extent and possibility of recording all the “thought” in the world – now that would be a huge corpus.

By the way – I’ll be making sure my students get a copy of your A-Z …..


7 02 2010
Ramesh Krishnamurthy

Hi David and Scott

Yes, there are fewer spoken corpora than written corpora, and spoken data usually forms a (much) smaller proportion of a general language corpus than written data. But one of the earliest corpora was of speech (see John Sinclair’s OSTI Report, about work done in the early 1960s).

This is mainly because of effort/expense required: in the mid-1990s, I estimated that spoken data (recording, transcribing, etc) costs 25 times as much to collect as written data. This disparity has probably increased, because, as you realised, written data is more available on the Web.

Copyright (if you take it seriously, which we tried to do) is also a greater problem, especially when many people are involved in recorded conversations, as every speaker should have veto/editorial rights. There are also problems of anonymisation.

However, despite this, the COBUILD Bank of English (Collins/Birmingham) of
2002 (448 million words; the last version I was involved in) contained:
22m words (5%) of semi-scripted radio broadcasts from National Public Radio (USA);
19m words (4%) of semi-scripted radio broadcasts from BBC World Service;
20m words (4%) of UK recorded speech: including phone-ins from Talk Radio; British Market Research interviews; volunteers wearing microphones for 1 week; university lectures, public meetings, private phone calls, social events (dinner parties, etc)
2m words (0.5%) of Professional Speech (Academic and Political; USA)
63m words (15%) TOTAL

The BNC contains 10% of spoken data and 90% of written data.

As far as I know, speech recognition software has not yet become widely used for data collection. I understand it needs to be trained separately for each speaker (just as we need to train optical scanners on each text, as fonts, print quality, paper quality, etc vary so much). However, researchers are now working on audio and video corpora, and software is being developed to search and analyse these modes of data.

Best wishes

24 02 2010
Emma Lay

As a teacher working on corpora-related projects, I am thoroughly enjoying reading these posts and would like to invite readers’ thoughts/comments on the future of corpora. As technology develops, I hope there will be better ways of acquiring more spoken text to highlight native speakers’ usage of language (and not just English) on an everyday level. The Talk radio elements that Ramesh mentioned are certainly interesting and can really show learners that native speakers are quite liberal with grammar ‘rules’.

I have a few questions (maybe too many for one thread…oops!)

1. Why do we think teachers are sometimes unaware of/resistant to using corpora in the classroom?
2. What kinds of developments would we like to see in corpora in the future?
3. How can corpora be improved/set up for pedagogical uses?
4. Do you think we will see coursebooks that are not only more corpora-based/-informed but actually incorporate tasks that help students use corpora themselves to enhance their learning?

I look forward to reading your ideas! 🙂

25 02 2010
Scott Thornbury

Hi Emma, thanks for posting.

These are questions that I have often pondered myself. I think one issue is simply technological: teachers who don’t have “at their fingertips” access to a computer, projector, and screen in their classrooms are not likely to refer to corpora in a spontaneous way, which is one of their great advantages. I remember when I was teacher training in a “smart classroom” for the first time, it was a real boon to be able to answer my trainees’ questions (like, for example, is “I’m loving it” normal usage?) by going online and checking the corpus. Another reason is that I think many teachers are not aware of how many corpus tools are available online and free. I now include a unit in my online MA language analysis course which introduces teachers to some of these sites, and sets easy-peasy tasks, such as identifying the statistically significant collocations in a text.

As to how corpora can be improved (now that you ask) for classroom purposes, would it be a good idea to have a corpus whose content was controlled, e.g. to within a specific vocabulary range, such as the top 2000 most frequent words of English? This way learners wouldn’t have to deal with that kind of rogue lexis that infiltrates concordance lines and can be very distracting. Just a thought.

Finally, the question as to whether coursebooks will incorporate corpus-based tasks — I see this developing first in specialised language areas such as English for Academic Purposes or English for Legal Purposes. In fact I remember seeing recently an article on how corpora were used in an ESP classroom — but I can’t remember, for the life of me, where!

Ah, yes, here it is here: http://llt.msu.edu/default.html
That’s ‘Language Learning and Technology’, volume 14 number one: ‘Corpus Assisted Creative Writing…’

Hope these rather random thoughts are of use!

25 02 2010

Hi Scott,

I just have ask a question about a remark you made in the post above.

What is the position on “I’m loving it.”?

My understanding was that stative verbs aren’t usually used in the continuous form, (and yes, I know there are a few exceptions). To me, “I’m loving it” just sounds plain wrong. I can’t ever really remember hearing this phrase before the McDonalds commercials.

I think it was in the nineties that the comedian Harry Enfield used this style in a series of sketches to highlight the ignorance of the character he was portraying.

Is this a case of people using the form incorrectly, or are there sufficient occurrences of it to make this usage acceptable? I know that the moment stative verbs crop up as a theme in a lesson, this example will be thrown at me.

Alternatively, is this a British English attitude? I know American and Australian native speakers. they all consider it wrong, but the Americans have less trouble with it than the Australians.

Forgive me for roping you into this, but the words of a “guru” would be useful in our long-running discussion on this. (Even if they prove me wrong!)


25 02 2010
Nick Bilbrough

There is a certain irony too in that as the corpora technology has developed, so too has the interest in ELF, and the idea that native speaker norms are not necessarily the most appropriate models in all teaching contexts.

I believe some materials (eg the Natural English series of coursebooks) were at least partly developed with reference to data on how advanced English learners expressed things, rather than purely native speakers).

I’d like to get my hands on a corpus of spoken English based on advanced learners utterances. Does such a thing exist?


26 02 2010
Peter Fenton

I’ve recently become more interested in how I can use corpora with my students in a way that’s useful and relevant to them. I’m only just getting to grips on how to use the BNC and COCA and find that they both take a bit of getting used to. I think if corpora are to be useful to students, then they need to be user-friendly enough for students to access in their own time. Unfortunately, I’m not convinced that either the BNC or the COCA are that user-friendly to the vast majority of English language learners.

More realistic I’ve found is training learners how to use monolingual dictionaries and the frequency information which is included in them.

Both Macmillan (http://www.macmillandictionary.com/) and Longman (http://www.longmandictionariesonline.com/) include this information in their free online versions. Longman has the advantage of distinguishing between spoken and written forms whereas Macmillan gives information on the first 7,500 words compared to 3,000 with Longman.

Oxford also has a free collocation checker (http://www.lixiaolai.com/ocd/) which I’ve also found useful and accessible for learners.

While none of these resources are anywhere near as comprehensive as the BNC and the COCA, they are at least accessible to learners, which seems to me to be a pre-requisite for usefulness.

I’ve found them particularly invaluable when using authentic texts as I’ll often ask students to find lexis that they want to learn from them for homework. With these resources, they at least have the means to try and gauge what may or may not be useful.

13 03 2010
Elizabeth Anne

I’m probably off-subject, but as regards using corpora in teaching, I’d like to mention a specific concordancer we have created for out students;

Teaching context: 350 second year university students widly varying levels of English, 22 teachers and a common exam at the end of the year. (French National Education)

We have selected 700 words which are in either the first 2000 of the general service list or in Coxhead’s Academic word list which all students have to know by the end of the year (!) … then, with a simple purpose built concordancer http://elang.ujf-grenoble.fr/anglais/accueil/outils/concordancier/ the students (and the teachers) have the whole of the internet as corpus.

25 02 2010
Scott Thornbury

Fair question, Olaf.

The problem of punching ‘I’m loving it’ into a corpus is, of course, that the phrase has become insitutionalised and therefore has skewed the frequency data (a good example of how language change is effected through idiomatic – even idiolectic – usage – analogous to way that evolutionary change is triggered by genetic mutations). You’d need to search for loving as a participle on its own, to see if it occurs in other syntactic constructions. Here’s what I found (using the COCA corpus at http://www.americancorpus.org/) by keying in [BE] + loving:

was loving (66 examples)
are loving (66)
‘m loving (61)
is loving (40)
‘re loving (36)
‘s loving (33)
were loving (18)
been loving (16)
am loving (11)

Some examples:

1. he was loving the sheer physicality of being alive.
2. She belts out tunes with all of her might, and audiences are loving it.
3. Hey, Bob, I ‘m loving the book. I haven’t finished it, but I’m really enjoying it
4. That’s right, and I ‘m loving it here in Atlanta.
5. Suttle is scared to death while Bradley is loving every minute.
6. I guess we’re the new American family, and we ‘re loving it, ” she joked.
7. She went from big business to pro bowling, and she ‘s loving every minute of it.
8. I got out there, the girls were all screaming and I realized they were loving me, ” Kelly says.
9. The woman I have been loving is not you, but another woman in your shape,
10. O’Keeffe wrote to her friend, ” I am loving the plains more than ever it seems – and the SKY

Some comments: 1. the figures are skewed because there are many examples of verb to be + loving as an adjective (they were loving parents)
2. some examples do seem to be reproducing, or playing on, the MacDonald’s slogan, e.g. #6 above.
3. there are some collocations that seem to work particularly well with loving, as in loving every minute [of it]
4. Most of these examples are from spoken English.

But, on the whole, I think we can safely say that the progressive form of love is well established, and like every progressive form, it connotes a dynamic activity that is seen as having stages, as evolving (the last example is a good instance of this).

Now, you see how wonderful corpora are! (I’m loving them!)

11 03 2010
Scott Thornbury

On the subject of “I’m lovin’ it” – here’s some more comment and history:


(thanks to Karenne Sylvester for this link)

4 03 2010

Dear Scott,

I have just found out about your blog and am having a greatest time exploring the posts.

Well, some teachers and I have recently engaged in a discussion about present perfect and present perfect continous and I thought that maybe you could help us. Here is the problem:

With this particular book we are using, we first teach (intermediate) students to use the present perfect continuous with “since” and “for” to talk about actions that started in the past and are still going on. Then, in the following unit we teach them to use the present perfect to talk about repeated actions in the past without specific time. Finally, at the end of this unit, the book has a note saying that both pres. perf. cont. and pres. perf. simple can be used with “since” and “for” to talk about actions that started in the past and are still happening (as in “I have lived/have been living in this city for/since…”).

So our question was: is there any situation in which we can only use the pres. perf. cont. and not the pres. perf. simple? We learned that some grammarians say that if the action is long, the preference is for simple (I have lived here all my life) and if the action is short, there is preference for the continuous form (I’ve been living here for a few months). However, if I started reading a book 5 years ago and am still reading it (it’s a very long book – lol), we feel that we could only say “I have been reading this book for 5 years” and not “I have read it for 5 years”, despite the fact that 5 years reading a book is a long action. Is there any reason why? Can corpora help us here?

Thanks in advance!

Kindest regards from Brazil,


5 03 2010
Scott Thornbury

Ronaldo asked: “…is there any situation in which we can only use the pres. perf. cont. and not the pres. perf. simple?”

Very simply, you might choose the pp. cont. rather than the pp. simple when you DON’T wish to imply that the situation is complete. While the pp.cont. doesn’t necessarily mean that a situation is still in progress (I’ve been sleeping would be an impossible statement, if that were the case!), it DOES leave open the possibility that the situation is still evolving, hence incomplete. So, given the choice of:

1. Who’s been eating my porridge?
2. Who’s eaten my porridge?

both are grammatically ‘correct’ (in that both are well-formed) but only (1) implies that there is some porridge left!

Also, the use of the continuous can change what we might normally think of as a stative verb into a dynamic one, with consequent change in meaning. Thus, there is a big difference between these two sentences:

3. Have you seen my daughter?
4. Have you been seeing my daughter?

where, in (4) the dynamic use of the verb to see can mean ‘to go out with’.

But, in essence, the choice between pp. cont. and pp. simple is no different from the choice between the present continuous and the present simple, or the past continuous and the past simple: in all cases the choice of the continuous “opens up” the situation for inspection (which is why the continuous is called an aspect, not a tense). The implications that follow from this “opening up” depend a lot on the choice of the verb, and on the context – something I’d like to address in a future posting (A is for Aspect). But thanks for reminding me!

5 03 2010

Thanks for the enlightnening and promt response, Scott.


4 12 2011

Dear Mr Thornbury,

sentences like
Have you ever been driving when you get almost hit by a tornado

are described in Renaat Declerk’s book The grammar of the English tense system

Kindest regards from Poland


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: