Frequency lists
Frequency lists extracted from the WaCky corpora. Lists of words and lemmas are provided, sorted by frequency. All the lists are in '.7z' compressed format.
- Unigram lists. These are the complete lists, i.e. we did not perform any post-processing on them.
- Bigram lists. In order to produce cleaner (and more usable) lists, we post-processed the raw data in this way: 1) we lowercased all words, and 2) we discarded all pairs in which one or both word fields contained non-alphabetical characters, except for dashes and apostrophes (but we kept the pairs in which one of the two fields was a (single) punctuation mark).
deWaC (German)
- deWaC unigrams (lemmas)
- deWaC unigrams (words)
- deWaC bigrams (lemmas)
- deWaC bigrams (words)
frWaC (French)
- frWaC unigrams (lemmas)
- frWaC unigrams (words)
itWaC (Italian)
- itWaC unigrams (lemmas)
- itWaC unigrams (words)
- itWaC bigrams (lemmas)
- itWaC bigrams (words)
ukWaC (English)
- ukWaC unigrams (lemmas)
- ukWaC unigrams (words)
- ukWaC bigrams (lemmas)
- ukWaC bigrams (words)
Repubblica (Italian)
- Repubblica unigrams (lemma)
- Repubblica unigrams (words)