Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
download [2008/02/19 17:49] – adriano | download [2021/09/13 10:20] (current) – [Use the corpus directly (no download necessary)] eros | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Download ====== | + | ===== Use the corpus directly (no download necessary) |
+ | * The wacky corpora are available on our **official corpus repository** here: http:// | ||
- | ===== Corpora ===== | + | Other free web interfaces: |
- | If you are interested in downloading and using our [[corpora]] (for free, we're not selling them) please [[people|contact us]]. | + | * the Jožef Stefan Institute hosts a web interface where many of our corpora |
+ | * the University of Lancaster hosts (among other corpora) ItWaC and a 50% sample of UkWaC (registration is required but the service is free): http:// | ||
+ | * the Charles University in Prague also hosts DeWaC, FrWaC, ItWaC and UkWaC (here again registration is required but the service is free): http:// | ||
- | ===== Seed words and tuples | + | ===== Download |
- | The first step in the creation of our Corpora was coming up with lists of basic words and mid-frequency words collected from other corpora. We then randomly combined these words in pairs and sent each pair as a query to Google in order to obtain seed URLs (which | + | **NB**: when you download |
+ | * if you are interested in **downloading** the corpora described [[corpora|here]] (for free, we're not selling them) please [[people|contact us]] | ||
- | ==== deWaC ==== | + | * the semantically and syntactically annotated Italian Wikipedia is available for direct download from here: |
+ | * [[http:// | ||
+ | * [[http:// | ||
- | * {{dewac_seed_words.zip|deWaC seed words}} | + | ===== Lists ===== |
- | * {{dewac_seed_pairs.zip|deWaC seed pairs}} | + | |
- | ==== itWaC ==== | + | |
- | + | * [[Seed URLs]] | |
- | | + | * [[Frequency lists]] |
- | * {{itwac_seed_words_pairs.zip|itWaC seed pairs}} | + | * [[Keyword lists: ukWaC vs. the BNC]] |
- | + | ||
- | ==== ukWaC ==== | + | |
- | + | ||
- | * {{ukwac_seed_words_bnc.zip|ukWaC seed words}} (collected from the BNC) | + | |
- | * {{ukwac_seed_word_pairs.zip|ukWaC seed pairs}} | + | |
- | + | ||
- | ===== Seed URLs ===== | + | |
- | + | ||
- | The URLs returned by our queries to Google for the word pairs above were used to initiate the crawls. | + | |
- | + | ||
- | * {{dewac_seed_urls.zip|deWaC seed URLs}} | + | |
- | * {{itwac_seed_urls.zip|itWaC seed URLs}} | + | |
- | * {{ukwac_seed_urls.zip|ukWaC seed URLs}} | + | |
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | ===== Frequency lists ===== | + | |
- | + | ||
- | Frequency lists of unigrams extracted from the three corpora. Lists of words and lemmas are provided, sorted by frequency. All the lists are in [[http:// | + | |
- | + | ||
- | + | ||
- | + | ||
- | ==== deWaC ==== | + | |
- | + | ||
- | * {{sorted.de.lemma.unigrams.7z|deWaC unigrams}} (lemmas) | + | |
- | * {{sorted.de.word.unigrams.7z|deWaC unigrams}} (words) | + | |
- | + | ||
- | ==== itWaC ==== | + | |
- | + | ||
- | * {{sorted.it.lemma.unigrams.7z|itWaC unigrams}} | + | |
- | * {{sorted.it.word.unigrams.7z|itWaC unigrams}} (words) | + | |
- | + | ||
- | ==== ukWaC ==== | + | |
- | + | ||
- | * {{sorted.uk.lemma.unigrams.7z|ukWaC unigrams}} (lemmas) | + | |
- | * {{sorted.uk.word.unigrams.7z|ukWaC unigrams}} (words) | + | |
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | ===== Keyword lists: ukWaC vs. the BNC ===== | + | |
- | + | ||
- | These lists feature the words most typical of ukWaC when compared to the [[http:// | + | |
- | + | ||
- | * {{ukwac_vs._bnc_scored_lists.7z|Keyword list of ukWaC}} (most typical words of ukWaC vs. the BNC) | + | |
- | * {{bnc_vs._ukwac_scored_lists.7z|Keyword list of the BNC}} (most typical words of the BNC vs. ukWaC) | + | |
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | ===== Tools ===== | + | |
- | + | ||
- | * [[http:// | + |