This is an old revision of the document!


Download

If you are interested in downloading and using our Corpora (for free, we're not selling them) please contact us.

The first step in the creation of our Corpora was coming up with lists of basic words and mid-frequency words collected from other corpora. We then randomly combined these words in pairs and sent each pair as a query to Google in order to obtain seed URLs (which you'll find below).

The URLs returned by our queries to Google for the word pairs above were used to initiate the crawls.

Frequency lists of unigrams extracted from the three corpora. Lists of words and lemmas are provided, sorted by frequency. All the lists are in '.7z' compressed format.

These lists feature the words most typical of ukWaC when compared to the British National Corpus and vice versa, based on the log-likelihood measure. Five distinct lists can be found for each corpus, i.e. lists of nouns, verbs, adjectives, -ly adverbs and function words.

  • download.1203439744.txt.gz
  • Last modified: 2008/02/19 17:49
  • by adriano