frequency_lists

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revisionBoth sides next revision
frequency_lists [2008/02/20 16:03] erosfrequency_lists [2014/03/27 10:45] – [Repubblica (Italian)] eros
Line 1: Line 1:
 ===== Frequency lists ===== ===== Frequency lists =====
  
-Frequency lists of unigrams extracted from the three corpora. Lists of words and lemmas are provided, sorted by frequency. All the lists are in [[http://www.7-zip.org/|'.7z']] compressed format.+Frequency lists extracted from the WaCky corpora. Lists of words and lemmas are provided, sorted by frequency. All the lists are in [[http://www.7-zip.org/|'.7z']] compressed format
 + 
 +  * Unigram lists. These are the complete lists, i.e. we did not perform any post-processing on them. 
 + 
 +  * Bigram lists. In order to produce cleaner (and more usable) lists, we post-processed the raw data in this way: 1) we lowercased all words, and 2) we discarded all pairs in which one or both word fields contained non-alphabetical characters, except for dashes and apostrophes (but we kept the pairs in which one of the two fields was a (single) punctuation mark).
  
 ==== deWaC (German) ==== ==== deWaC (German) ====
  
-  * {{sorted.de.lemma.unigrams.7z|deWaC unigrams}} (lemmas) +  * {{:frequency_lists:sorted.de.lemma.unigrams.7z|deWaC unigrams}} (lemmas) 
-  * {{sorted.de.word.unigrams.7z|deWaC unigrams}} (words)+  * {{:frequency_lists:sorted.de.word.unigrams.7z|deWaC unigrams}} (words) 
 +  * {{:frequency_lists:de.lemma.bigrams.7z|deWaC bigrams}} (lemmas) 
 +  * {{:frequency_lists:de.word.bigrams.7z|deWaC bigrams}} (words) 
 + 
 +==== frWaC (French) ==== 
 + 
 +  * {{:frequency_lists:sorted.fr.lemma.unigrams.7z|frWaC unigrams}} (lemmas) 
 +  * {{:frequency_lists:sorted.fr.word.unigrams.7z|frWaC unigrams}} (words)
  
 ==== itWaC (Italian) ==== ==== itWaC (Italian) ====
  
-  * {{sorted.it.lemma.unigrams.7z|itWaC unigrams}}  (lemmas) +  * {{:frequency_lists:sorted.it.lemma.unigrams.7z|itWaC unigrams}}  (lemmas) 
-  * {{sorted.it.word.unigrams.7z|itWaC unigrams}} (words)+  * {{:frequency_lists:sorted.it.word.unigrams.7z|itWaC unigrams}} (words) 
 +  * {{:frequency_lists:it.lemma.bigrams.7z|itWaC bigrams}} (lemmas) 
 +  * {{:frequency_lists:it.word.bigrams.7z|itWaC bigrams}} (words)
  
 ==== ukWaC (English) ==== ==== ukWaC (English) ====
  
-  * {{sorted.uk.lemma.unigrams.7z|ukWaC unigrams}} (lemmas) +  * {{:frequency_lists:sorted.uk.lemma.unigrams.7z|ukWaC unigrams}} (lemmas) 
-  * {{sorted.uk.word.unigrams.7z|ukWaC unigrams}} (words)+  * {{:frequency_lists:sorted.uk.word.unigrams.7z|ukWaC unigrams}} (words) 
 +  * {{:frequency_lists:uk.lemma.bigrams.7z|ukWaC bigrams}} (lemmas) 
 +  * {{:frequency_lists:uk.word.bigrams.7z|ukWaC bigrams}} (words) 
 + 
 +==== Repubblica (Italian) ==== 
 + 
 +   * {{:frequency_lists:it.repubblica.word.unigrams.7z|Repubblica unigrams}} (words) 
 +   * {{:frequency_lists:it.repubblica.lemma.unigrams.7z|Repubblica unigrams}} (lemma) 
  • frequency_lists.txt
  • Last modified: 2014/03/27 10:45
  • by eros