User Tools

Site Tools


frequency_lists

Frequency lists

Frequency lists extracted from the WaCky corpora. Lists of words and lemmas are provided, sorted by frequency. All the lists are in '.7z' compressed format.

  • Unigram lists. These are the complete lists, i.e. we did not perform any post-processing on them.
  • Bigram lists. In order to produce cleaner (and more usable) lists, we post-processed the raw data in this way: 1) we lowercased all words, and 2) we discarded all pairs in which one or both word fields contained non-alphabetical characters, except for dashes and apostrophes (but we kept the pairs in which one of the two fields was a (single) punctuation mark).

deWaC (German)

frWaC (French)

itWaC (Italian)

ukWaC (English)

Repubblica (Italian)

frequency_lists.txt · Last modified: 2014/03/27 10:45 by eros