Differences
This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
| frequency_lists [2008/02/20 16:03] – eros | frequency_lists [2014/03/27 10:45] (current) – [Repubblica (Italian)] eros | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ===== Frequency lists ===== | ===== Frequency lists ===== | ||
| - | Frequency lists of unigrams | + | Frequency lists extracted from the WaCky corpora. Lists of words and lemmas are provided, sorted by frequency. All the lists are in [[http:// |
| + | |||
| + | * Unigram lists. These are the complete lists, i.e. we did not perform any post-processing on them. | ||
| + | |||
| + | * Bigram lists. In order to produce cleaner (and more usable) lists, we post-processed the raw data in this way: 1) we lowercased all words, and 2) we discarded all pairs in which one or both word fields contained non-alphabetical characters, except for dashes and apostrophes (but we kept the pairs in which one of the two fields was a (single) punctuation mark). | ||
| ==== deWaC (German) ==== | ==== deWaC (German) ==== | ||
| - | * {{sorted.de.lemma.unigrams.7z|deWaC unigrams}} (lemmas) | + | * {{: |
| - | * {{sorted.de.word.unigrams.7z|deWaC unigrams}} (words) | + | * {{: |
| + | * {{: | ||
| + | * {{: | ||
| + | |||
| + | ==== frWaC (French) ==== | ||
| + | |||
| + | * {{: | ||
| + | * {{: | ||
| ==== itWaC (Italian) ==== | ==== itWaC (Italian) ==== | ||
| - | * {{sorted.it.lemma.unigrams.7z|itWaC unigrams}} | + | * {{: |
| - | * {{sorted.it.word.unigrams.7z|itWaC unigrams}} (words) | + | * {{: |
| + | * {{: | ||
| + | * {{: | ||
| ==== ukWaC (English) ==== | ==== ukWaC (English) ==== | ||
| - | * {{sorted.uk.lemma.unigrams.7z|ukWaC unigrams}} (lemmas) | + | * {{: |
| - | * {{sorted.uk.word.unigrams.7z|ukWaC unigrams}} (words) | + | * {{: |
| + | * {{: | ||
| + | * {{: | ||
| + | |||
| + | ==== Repubblica (Italian) ==== | ||
| + | |||
| + | * {{: | ||
| + | * {{: | ||