Reliable high-quality frequency lists

We are providers of high-quality word frequency lists (also called dictionaries or lexicons) in many languages. The lists are generated from an enormous authentic database of text (text corpora) produced by real users of the language. Our corpora in some languages contain texts with a total length of as many as 60,000,000,000 words.

Wordlist quality

A relatively small corpus is sufficient to generate a list of the 2,000 most frequent words, or a lexicon of 3,000 words or 5,000 words because such words appear frequently enough in any text.

An enormous text database (corpus) is required to ensure reliable word frequency information even for rare and infrequently used words. The only viable option of building corpora of billions of words is using an automatic procedure of downloading content from the web. Lexical Computing developed a sophisticated procedure for collecting only linguistically valuable content from the web. A series of tools are used to focus on the right content and to perform deduplication and cleaning. This ensures that the statistics are not skewed. This blog post gives more details.

Wordlist size

We are able to generate frequency lists of millions of unique words. The actual size depends on the specifications. By default, we will not include any word which appears fewer than 5 times in the corpus. Such words are typically noise without any linguistic value. The client can specify any filtering options.

Enriched frequency wordlists

We are also able to provide lexicons with additional information such as POS tags, lemmas, probabilities of the next word, or any other statistics or morphological information.

Sample data

The easiest is to register a free trial account in Sketch Engine and use the wordlist tool to generate a wordlist or a lexicon to your specifications. The advanced tab of the wordlist tool allows for detailed specifications to be used.

Prices

We will provide a quotation based on the exact specifications of the dictionary or lexicon and its intended use.

Download

The lexicon will be made available for download on a dedicated link within the agreed period of time. It normally takes a week or two to generate the data. Very complex wordlists can be computationally demanding and can take longer to produce.

An example of an Estonian frequency word list showing the word form, lemma, grammatical tag and frequency.

eestlased eestlane S 12529 
esindaja esindaja S 12471 
edukalt edukalt D 12419 
eestlaste eestlane S 12370 
esineb esinema V 12126 
esindajad esindaja S 11809 
ehitada ehitama V 11763

Word database, lexicon or dictionary available in these languages

Please contact us for a language database in a different language.

Supported languages

Afrikaans
Albanian
Amazigh
Amharic
Ancient Greek
Arabic
Armenian
Azerbaijani
Basque
Belarusian
Bengali
Bosnian
Breton
Bulgarian
Burmese
Cantonese
Catalan
Cebuano
Chinese Simplified
Chinese Traditional
Croatian
Czech
Danish
Dutch
English
Esperanto
Estonian
Filipino
Finnish
French
Frisian
Georgian
German
Greek
Gujarati
Hausa (Boko)
Hebrew
Hindi
Hungarian
Icelandic
Igbo
Indonesian
Irish
Italian
Japanese
Kalaamaya
Kannada
Kazakh
Khmer
Korean
Kurdish (Kurmanji)
Kurdish (Sorani)
Kuwarra
Kyrgyz
Lao
Latin
Latvian
Limburgish
Lithuanian
Macedonian
Maduwongga
Malay
Malayalam
Maldivian
Maltese
Mankulatjarra
Manyjiljar
Maori
Marathi
Marlpa
Mirning
Mongolian
Montenegrin N'Ko
Ndebele
Nepali
Newspeak
Ngaanyatjarra
Ngaju
Ngalia
Nganta
Northern Sotho
Norwegian Bokmål
Norwegian
Norwegian Nynorsk
Nyakinyaki
Oromo
Pashto
Pintupi
Pitjantjatjara
Polish
Portuguese
Punjabi (Gurmukhi)
Punjabi (Shahmukhi)
Romanian
Russian
Samoan
Sanskrit (romanised)
Scottish Gaelic
Serbian
Serbian (Latin)
Setswana
Sinhalese
Slovak
Slovenian
Somali
Spanish
Swahili
Swazi
Swedish
Syriac
Tagalog
Tajik
Talysh
Tamil
Tatar
Telugu
Thai
Tibetan
Tigrinya
Tjalkatjarra
Tjupan
Tsonga
Turkish
Turkmen
Ukrainian
Urdu
Uzbek
Vietnamese
Wangkatja
Warlpiri
Welsh
Wudjaarri
Xhosa
Yankunytjatjara
Yiddish
Yoruba
Zulu