corpora

This is an old revision of the document!


Corpora

The resources below are four very large corpora, comparable in terms of size, sampling strategy and format. See the Publications section for further details on the construction procedure and an evaluation of the resources, and the Use the corpus directly (no download necessary) section for information on how to get them:

  • deWaC: a 1.7 billion word corpus constructed from the Web limiting the crawl to the .de domain and using medium-frequency words from the SudDeutsche Zeitung corpus and basic German vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the TreeTagger.
  • sdewac an 0.88 billion word corpus derived from deWaC, duplicate sentences and some noise have been removed. The corpus has been converted to Unicode. SdeWaC comes in two versions, in POS-tagged / lemmatized version or as a one sentence per line format, each supplemented with metadata (e.g. parse error rate).
  • frWaC: a 1.6 billion word corpus constructed from the Web limiting the crawl to the .fr domain and using medium-frequency words from the Le Monde Diplomatique corpus and basic French vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the TreeTagger.
  • itWaC: a 2 billion word corpus constructed from the Web limiting the crawl to the .it domain and using medium-frequency words from the Repubblica corpus and basic Italian vocabulary lists as seeds. The corpus was POS-tagged with the TreeTagger, and lemmatized using the Morph-it! lexicon.
  • ukWaC: a 2 billion word corpus constructed from the Web limiting the crawl to the .uk domain and using medium-frequency words from the BNC as seeds. The corpus was POS-tagged and lemmatized with the TreeTagger.

Other resources we created:

  • PukWaC: the same as ukWaC, but with a further layer of annotation added, i.e. a full dependency parse. The parsing was performed with the MaltParser.
  • WaCkypedia_EN: a 2009 dump of the English Wikipedia (about 800 million tokens), in the same format as PukWaC, including POS/lemma information, as well as a full dependency parse (parsing performed with the MaltParser). The texts were extracted from the dump and cleaned using the Wikipedia extractor.
  • Quite a few things going on at the moment, but none we can disclose ;-) Stay tuned.

If you want your webpage to be removed from our corpora, please contact us.

  • corpora.1322751770.txt.gz
  • Last modified: 2011/12/01 16:02
  • by eros