This is an old revision of the document!
Download
Corpora
If you are interested in downloading and using our Corpora (for free, we're not selling them) please contact us.
Seed words and tuples
The first step in the creation of our Corpora was coming up with lists of basic words and mid-frequency words collected from other corpora. We then randomly combined these words in pairs and sent each pair as a query to Google in order to obtain seed URLs (which you'll find below).
deWaC
itWaC
ukWaC
- ukWaC seed words (collected from the BNC)
Seed URLs
The URLs returned by our queries to Google for the word pairs above were used to initiate the crawls.
Frequency lists
Frequency lists of unigrams extracted from the three corpora. Lists of words and lemmas are provided, sorted by frequency. All the lists are in '.7z' compressed format.
deWaC
- deWaC unigrams (lemmas)
- deWaC unigrams (words)
itWaC
- itWaC unigrams (lemmas)
- itWaC unigrams (words)
ukWaC
- ukWaC unigrams (lemmas)
- ukWaC unigrams (words)
Keyword lists: ukWaC vs. the BNC
These lists feature the words most typical of ukWaC when compared to the British National Corpus and vice versa, based on the log-likelihood measure. Five distinct lists can be found for each corpus, i.e. lists of nouns, verbs, adjectives, -ly adverbs and function words.
- Keyword list of ukWaC (most typical words of ukWaC vs. the BNC)
- Keyword list of the BNC (most typical words of the BNC vs. ukWaC)