This is an old revision of the document!
Frequency lists
Frequency lists extracted from the WaCky corpora. Lists of words and lemmas are provided, sorted by frequency. All the lists are in '.7z' compressed format.
- Unigram lists. These are the complete lists, i.e. we did not perform any post-processing on them.
- Bigram lists. In order to produce cleaner (and more usable) lists, we post-processed the raw data in this way: 1) we lowercased all words, and 2) we discarded all pairs in which one or both word fields contained non-alphabetical characters, except for dashes and apostrophes (but we kept the pairs in which one of the two fields was a (single) punctuation mark).
deWaC (German)
- deWaC unigrams (lemmas)
- deWaC unigrams (words)
- deWaC bigrams (lemmas)
- deWaC bigrams (words)
frWaC (French)
- frWaC unigrams (lemmas)
- frWaC unigrams (words)
itWaC (Italian)
- itWaC unigrams (lemmas)
- itWaC unigrams (words)
- itWaC bigrams (lemmas)
- itWaC bigrams (words)
ukWaC (English)
- ukWaC unigrams (lemmas)
- ukWaC unigrams (words)
- ukWaC bigrams (lemmas)
- ukWaC bigrams (words)
Repubblica (Italian)
- Repubblica unigrams (words)
- Repubblica unigrams (lemma)