How many words are in the English language (9
We call a 1-gram ‘common’ if its frequency is greater than one per billion. (This corresponds to the frequency of the words listed in leading dictionaries (7
).) We compiled a list of all common 1-grams in 1900, 1950, and 2000 based on the frequency of each 1-gram in the preceding decade. These lists contained 1,117,997 common 1-grams in 1900, 1,102,920 in 1950, and 1,489,337 in 2000.
Not all common 1-grams are English words. Many fell into three non-word categories: (i) 1-grams with non-alphabetic characters (‘l8r’, ‘3.14159’); (ii) misspellings (‘becuase, ‘abberation’); and (iii) foreign words (‘sensitivo’).
To estimate the number of English words, we manually annotated random samples from the lists of common 1-grams (7
) and determined what fraction were members of the above non-word categories. The result ranged from 51% of all common 1-grams in 1900 to 31% in 2000.
Using this technique, we estimated the number of words in the English lexicon as 544,000 in 1900, 597,000 in 1950, and 1,022,000 in 2000. The lexicon is enjoying a period of enormous growth: the addition of ~8500 words/year has increased the size of the language by over 70% during the last fifty years ().
Fig. 2 Culturomics has profound consequences for the study of language, lexicography, and grammar. (A) The size of the English lexicon over time. Tick marks show the number of single words in three dictionaries (see text). (B) Fraction of words in the lexicon (more ...)
Notably, we found more words than appear in any dictionary. For instance, the 2002 Webster's Third New International Dictionary [W3], which keeps track of the contemporary American lexicon, lists approximately 348,000 single-word wordforms (10
); the American Heritage Dictionary of the English Language, Fourth Edition (AHD4) lists 116,161 (11
). (Both contain additional multi-word entries.) Part of this gap is because dictionaries often exclude proper nouns and compound words (‘whalewatching’). Even accounting for these factors, we found many undocumented words, such as ‘aridification’ (the process by which a geographic region becomes dry), ‘slenthem’ (a musical instrument), and, appropriately, the word ‘deletable’.
This gap between dictionaries and the lexicon results from a balance that every dictionary must strike: it must be comprehensive enough to be a useful reference, but concise enough to be printed, shipped, and used. As such, many infrequent words are omitted. To gauge how well dictionaries reflect the lexicon, we ordered our year 2000 lexicon by frequency, divided it into eight deciles (ranging from 10-9
), and sampled each decile (7
). We manually checked how many sample words were listed in the OED (12
) and in the Merriam-Webster Unabridged Dictionary [MWD]. (We excluded proper nouns, since neither OED nor MWD lists them.) Both dictionaries had excellent coverage of high frequency words, but less coverage for frequencies below 10-6
: 67% of words in the 10-9
range were listed in neither dictionary (). Consistent with Zipf's famous law, a large fraction of the words in our lexicon (63%) were in this lowest frequency bin. As a result, we estimated that 52% of the English lexicon – the majority of the words used in English books – consists of lexical ‘dark matter’ undocumented in standard references (12
To keep up with the lexicon, dictionaries are updated regularly (13
). We examined how well these changes corresponded with changes in actual usage by studying the 2077 1-gram headwords added to AHD4 in 2000. The overall frequency of these words, such as ‘buckyball’ and ‘netiquette’, has soared since 1950: two-thirds exhibited recent, sharp increases in frequency (>2X from 1950-2000) (). Nevertheless, there was a lag between lexicographers and the lexicon. Over half the words added to AHD4 were part of the English lexicon a century ago (frequency >10-9
from 1890-1900). In fact, some newly-added words, such as ‘gypseous’ and ‘amplidyne’, have already undergone a steep decline in frequency ().
Not only must lexicographers avoid adding words that have fallen out of fashion, they must also weed obsolete words from earlier editions. This is an imperfect process. We found 2220 obsolete 1-gram headwords (‘diestock’, ‘alkalescent’) in AHD4. Their mean frequency declined throughout the 20th century, and dipped below 10-9 decades ago (, Inset).
Our results suggest that culturomic tools will aid lexicographers in at least two ways: (i) finding low-frequency words that they do not list; and (ii) providing accurate estimates of current frequency trends to reduce the lag between changes in the lexicon and changes in the dictionary.