download

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revisionBoth sides next revision
download [2008/02/19 17:49] adrianodownload [2016/10/28 09:45] eros
Line 1: Line 1:
-====== Download ======+===== Use the corpus directly (no download necessary) =====
  
 +  * The wacky corpora are available on our **official corpus repository** here: http://corpora.dipintra.it
  
-===== Corpora =====+=== Other free web interfaces ===
  
-If you are interested in downloading and using our [[corpora]] (for free, we're not selling themplease [[people|contact us]].+  * the Jožef Stefan Institute hosts a web interface where many of our corpora can be used directly for free: http://nl.ijs.si/noske/wacs.cgi/first_form 
 +  * the University of Lancaster hosts (among other corpora) ItWaC and a 50% sample of UkWaC (registration is required but the service is free): http://cqpweb.lancs.ac.uk/ 
 +  * the Charles University in Prague also hosts DeWaCFrWaC, ItWaC and UkWaC (here again registration is required but the service is free): http://korpus.cz/english/index.php
  
-===== Seed words and tuples =====+===== Download =====
  
-The first step in the creation of our Corpora was coming up with lists of basic words and mid-frequency words collected from other corpora. We then randomly combined these words in pairs and sent each pair as a query to Google in order to obtain seed URLs (which you'll find below).+**NB**: when you download the corpora, you need to use your own tools to consult them. If you don't know what this means, then you probably don't want to download them and should use an online tools instead (see the secion "Free Web Interfaces" above).
  
 +  * if you are interested in **downloading** the corpora described [[corpora|here]] (for free, we're not selling them) please [[people|contact us]]
  
-==== deWaC ====+  * the semantically and syntactically annotated Italian Wikipedia is available for direct download from here: 
 +    * [[http://medialab.di.unipi.it/Project/QA/wikiCoNLL.bz2|CoNLL format]] ([[http://medialab.di.unipi.it/wiki/Tanl_Tagsets|tagset]]) 
 +    * [[http://medialab.di.unipi.it/Project/QA/wikiMT.bz2|MultiTag format]]
  
-  * {{dewac_seed_words.zip|deWaC seed words}} +===== Lists =====
-  * {{dewac_seed_pairs.zip|deWaC seed pairs}}+
  
-==== itWaC ==== +  [[Seed words and tuples]] 
- +  * [[Seed URLs]] 
-  {{itwac_seed_words.zip|itWaC seed words}} +  * [[Frequency lists]] 
-  * {{itwac_seed_words_pairs.zip|itWaC seed pairs}} +  * [[Keyword lists: ukWaC vs. the BNC]]
- +
-==== ukWaC ==== +
- +
-  * {{ukwac_seed_words_bnc.zip|ukWaC seed words}} (collected from the BNC) +
-  * {{ukwac_seed_word_pairs.zip|ukWaC seed pairs}} +
- +
-===== Seed URLs ===== +
- +
-The URLs returned by our queries to Google for the word pairs above were used to initiate the crawls. +
- +
-  * {{dewac_seed_urls.zip|deWaC seed URLs}} +
-  * {{itwac_seed_urls.zip|itWaC seed URLs}} +
-  * {{ukwac_seed_urls.zip|ukWaC seed URLs}} +
- +
- +
- +
- +
- +
- +
-===== Frequency lists ===== +
- +
-Frequency lists of unigrams extracted from the three corpora. Lists of words and lemmas are provided, sorted by frequency. All the lists are in [[http://www.7-zip.org/|'.7z']] compressed format. +
- +
- +
- +
-==== deWaC ==== +
- +
-  * {{sorted.de.lemma.unigrams.7z|deWaC unigrams}} (lemmas) +
-  * {{sorted.de.word.unigrams.7z|deWaC unigrams}} (words) +
- +
-==== itWaC ==== +
- +
-  * {{sorted.it.lemma.unigrams.7z|itWaC unigrams}}  (lemmas) +
-  * {{sorted.it.word.unigrams.7z|itWaC unigrams}} (words) +
- +
-==== ukWaC ==== +
- +
-  * {{sorted.uk.lemma.unigrams.7z|ukWaC unigrams}} (lemmas) +
-  * {{sorted.uk.word.unigrams.7z|ukWaC unigrams}} (words) +
- +
- +
- +
- +
- +
-===== Keyword lists: ukWaC vs. the BNC ===== +
- +
-These lists feature the words most typical of ukWaC when compared to the [[http://www.natcorp.ox.ac.uk/|British National Corpus]] and vice versa, based on the log-likelihood measure. Five distinct lists can be found for each corpus, i.e. lists of nouns, verbs, adjectives, //-ly// adverbs and function words. +
- +
-  * {{ukwac_vs._bnc_scored_lists.7z|Keyword list of ukWaC}} (most typical words of ukWaC vs. the BNC) +
-  * {{bnc_vs._ukwac_scored_lists.7z|Keyword list of the BNC}} (most typical words of the BNC vs. ukWaC) +
- +
- +
- +
- +
- +
- +
-===== Tools ===== +
- +
-  * [[http://dev.sslmit.unibo.it/wac/post_processing.php|Post processing tools]]+
  • download.txt
  • Last modified: 2021/09/13 10:20
  • by eros