Next revision | Previous revision |
frequency_lists [2008/02/20 15:59] – created eros | frequency_lists [2014/03/27 10:45] (current) – [Repubblica (Italian)] eros |
---|
===== Frequency lists ===== | ===== Frequency lists ===== |
| |
Frequency lists of unigrams extracted from the three corpora. Lists of words and lemmas are provided, sorted by frequency. All the lists are in [[http://www.7-zip.org/|'.7z']] compressed format. | Frequency lists extracted from the WaCky corpora. Lists of words and lemmas are provided, sorted by frequency. All the lists are in [[http://www.7-zip.org/|'.7z']] compressed format. |
| |
==== deWaC ==== | * Unigram lists. These are the complete lists, i.e. we did not perform any post-processing on them. |
| |
* {{sorted.de.lemma.unigrams.7z|deWaC unigrams}} (lemmas) | * Bigram lists. In order to produce cleaner (and more usable) lists, we post-processed the raw data in this way: 1) we lowercased all words, and 2) we discarded all pairs in which one or both word fields contained non-alphabetical characters, except for dashes and apostrophes (but we kept the pairs in which one of the two fields was a (single) punctuation mark). |
* {{sorted.de.word.unigrams.7z|deWaC unigrams}} (words) | |
| |
==== itWaC ==== | ==== deWaC (German) ==== |
| |
* {{sorted.it.lemma.unigrams.7z|itWaC unigrams}} (lemmas) | * {{:frequency_lists:sorted.de.lemma.unigrams.7z|deWaC unigrams}} (lemmas) |
* {{sorted.it.word.unigrams.7z|itWaC unigrams}} (words) | * {{:frequency_lists:sorted.de.word.unigrams.7z|deWaC unigrams}} (words) |
| * {{:frequency_lists:de.lemma.bigrams.7z|deWaC bigrams}} (lemmas) |
| * {{:frequency_lists:de.word.bigrams.7z|deWaC bigrams}} (words) |
| |
==== ukWaC ==== | ==== frWaC (French) ==== |
| |
| * {{:frequency_lists:sorted.fr.lemma.unigrams.7z|frWaC unigrams}} (lemmas) |
| * {{:frequency_lists:sorted.fr.word.unigrams.7z|frWaC unigrams}} (words) |
| |
| ==== itWaC (Italian) ==== |
| |
| * {{:frequency_lists:sorted.it.lemma.unigrams.7z|itWaC unigrams}} (lemmas) |
| * {{:frequency_lists:sorted.it.word.unigrams.7z|itWaC unigrams}} (words) |
| * {{:frequency_lists:it.lemma.bigrams.7z|itWaC bigrams}} (lemmas) |
| * {{:frequency_lists:it.word.bigrams.7z|itWaC bigrams}} (words) |
| |
| ==== ukWaC (English) ==== |
| |
| * {{:frequency_lists:sorted.uk.lemma.unigrams.7z|ukWaC unigrams}} (lemmas) |
| * {{:frequency_lists:sorted.uk.word.unigrams.7z|ukWaC unigrams}} (words) |
| * {{:frequency_lists:uk.lemma.bigrams.7z|ukWaC bigrams}} (lemmas) |
| * {{:frequency_lists:uk.word.bigrams.7z|ukWaC bigrams}} (words) |
| |
| ==== Repubblica (Italian) ==== |
| |
| * {{:frequency_lists:it.repubblica.lemma.unigrams.7z|Repubblica unigrams}} (lemma) |
| * {{:frequency_lists:it.repubblica.word.unigrams.7z|Repubblica unigrams}} (words) |
| |
* {{sorted.uk.lemma.unigrams.7z|ukWaC unigrams}} (lemmas) | |
* {{sorted.uk.word.unigrams.7z|ukWaC unigrams}} (words) | |