corpora

This is an old revision of the document!


Corpora

The resources below are large corpora build by downloading text from the web. See the Publications section for further details, and the Use the corpus directly (no download necessary) section for information on how to get them:

English

  • PukWaC: the same as ukWaC, but with a further layer of annotation added, i.e. a full dependency parse. The parsing was performed with the MaltParser.
  • ukWaC: a 2 billion word corpus constructed from the Web limiting the crawl to the .uk domain and using medium-frequency words from the BNC as seeds. The corpus was POS-tagged and lemmatized with the TreeTagger, more information available here.
  • WaCkypedia_EN: a 2009 dump of the English Wikipedia (about 800 million tokens), in the same format as PukWaC, including POS/lemma information, as well as a full dependency parse (parsing performed with the MaltParser). The texts were extracted from the dump and cleaned using the Wikipedia extractor.

French

  • frWaC: a 1.6 billion word corpus constructed from the Web limiting the crawl to the .fr domain and using medium-frequency words from the Le Monde Diplomatique corpus and basic French vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the TreeTagger, more information available here.

German

  • deWaC: a 1.7 billion word corpus constructed from the Web limiting the crawl to the .de domain and using medium-frequency words from the SudDeutsche Zeitung corpus and basic German vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the TreeTagger, more information available here.
  • sdewac a 0.88 billion word corpus derived from deWaC, duplicate sentences and some noise have been removed. The corpus has been converted to Unicode. SdeWaC comes in two versions, in POS-tagged / lemmatized version or as a one sentence per line format, each supplemented with metadata (e.g. parse error rate). More information on sdewac.

Italian

  • itWaC: a 2 billion word corpus constructed from the Web limiting the crawl to the .it domain and using medium-frequency words from the Repubblica corpus and basic Italian vocabulary lists as seeds. The corpus was POS-tagged with the TreeTagger, and lemmatized using the Morph-it! lexicon, more information available here.

If you want your webpage to be removed from our corpora, please contact us.

  • corpora.1363167709.txt.gz
  • Last modified: 2013/03/13 10:41
  • by eros