corpora

This is an old revision of the document!


Corpora

The first resources we built are three very large corpora, comparable in terms of size, sampling strategy and format. See the Publications section for further details on the construction procedure and an evaluation of the resources, and the Use the corpus directly (no download necessary) section for information on how to get them:

  • deWaC: a 1.7 billion word corpus constructed from the Web limiting the crawl to the .de domain and using medium-frequency words from the SudDeutsche Zeitung corpus and basic German vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the TreeTagger.
  • itWaC: a 2 billion word corpus constructed from the Web limiting the crawl to the .it domain and using medium-frequency words from the Repubblica corpus and basic Italian vocabulary lists as seeds. The corpus was POS-tagged with the TreeTagger, and lemmatized using the Morph-it! lexicon.
  • ukWaC: a 2 billion word corpus constructed from the Web limiting the crawl to the .uk domain and using medium-frequency words from the BNC as seeds. The corpus was POS-tagged and lemmatized with the TreeTagger.
  • We are post-processing data harvested in a crawl of pages in the .fr domain. frWaC is on the way!

If you want your webpage to be removed from our corpora, please contact us.

  • corpora.1258555178.txt.gz
  • Last modified: 2009/11/18 15:39
  • by adriano