Differences
This shows you the differences between two versions of the page.
seed_words_and_tuples [2008/02/20 15:58] – created eros | seed_words_and_tuples [2008/02/20 16:02] (current) – eros | ||
---|---|---|---|
Line 3: | Line 3: | ||
The first step in the creation of our Corpora was coming up with lists of basic words and mid-frequency words collected from other corpora. We then randomly combined these words in pairs and sent each pair as a query to Google in order to obtain seed URLs. | The first step in the creation of our Corpora was coming up with lists of basic words and mid-frequency words collected from other corpora. We then randomly combined these words in pairs and sent each pair as a query to Google in order to obtain seed URLs. | ||
- | ==== deWaC ==== | + | ==== deWaC (German) |
* {{dewac_seed_words.zip|deWaC seed words}} | * {{dewac_seed_words.zip|deWaC seed words}} | ||
* {{dewac_seed_pairs.zip|deWaC seed pairs}} | * {{dewac_seed_pairs.zip|deWaC seed pairs}} | ||
- | ==== itWaC ==== | + | ==== itWaC (Italian) |
* {{itwac_seed_words.zip|itWaC seed words}} | * {{itwac_seed_words.zip|itWaC seed words}} | ||
* {{itwac_seed_words_pairs.zip|itWaC seed pairs}} | * {{itwac_seed_words_pairs.zip|itWaC seed pairs}} | ||
- | ==== ukWaC ==== | + | ==== ukWaC (English) |
* {{ukwac_seed_words_bnc.zip|ukWaC seed words}} (collected from the BNC) | * {{ukwac_seed_words_bnc.zip|ukWaC seed words}} (collected from the BNC) | ||
* {{ukwac_seed_word_pairs.zip|ukWaC seed pairs}} | * {{ukwac_seed_word_pairs.zip|ukWaC seed pairs}} |