Download

This is an old revision of the document!

If you are interested in downloading and using our Corpora (for free, we're not selling them) please contact us.

The first step in the creation of our Corpora was coming up with lists of basic words and mid-frequency words collected from other corpora. We then randomly combined these words in pairs and sent each pair as a query to Google in order to obtain seed URLs (which you'll find below).

ukWaC seed words (collected from the BNC)
ukWaC seed pairs

The URLs returned by our queries to Google for the word pairs above were used to initiate the crawls.

Frequency lists of unigrams extracted from the three corpora. Lists of words and lemmas are provided, sorted by frequency. All the lists are in '.7z' compressed format.

deWaC unigrams (lemmas)
deWaC unigrams (words)

itWaC unigrams (lemmas)
itWaC unigrams (words)

ukWaC unigrams (lemmas)
ukWaC unigrams (words)

These lists feature the words most typical of ukWaC when compared to the British National Corpus and vice versa, based on the log-likelihood measure. Five distinct lists can be found for each corpus, i.e. lists of nouns, verbs, adjectives, -ly adverbs and function words.

Keyword list of ukWaC (most typical words of ukWaC vs. the BNC)
Keyword list of the BNC (most typical words of the BNC vs. ukWaC)

Post processing tools

Download

Corpora

Seed words and tuples

deWaC

itWaC

ukWaC

Seed URLs

Frequency lists

deWaC

itWaC

ukWaC

Keyword lists: ukWaC vs. the BNC

Tools

WaCKy