The first resources we built are three very large corpora, comparable in terms of size, sampling strategy and format. See the publications section for further details on the construction procedure and an evaluation of the resources, and the download section for information on how to get them:
deWaC: a 1.7 billion word corpus constructed from the Web limiting the crawl to the
.de domain and using medium-frequency words from the SudDeutsche Zeitung corpus and basic German vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the
TreeTagger.
itWaC: a 2 billion word corpus constructed from the Web limiting the crawl to the
.it domain and using medium-frequency words from the
Repubblica corpus and basic Italian vocabulary lists as seeds. The corpus was POS-tagged with the
TreeTagger, and lemmatized using the
Morph-it! lexicon.
ukWaC: a 2 billion word corpus constructed from the Web limiting the crawl to the
.uk domain and using medium-frequency words from the
BNC as seeds. The corpus was POS-tagged and lemmatized with the
TreeTagger.
We've recently added two new resources:
PukWaC: the same as ukWaC, but with a further layer of annotation added, i.e. a full dependency parse. The parsing was performed with the
MaltParser.
WaCkypedia_EN: a 2009 dump of the English Wikipedia (about 800 million tokens), in the same format as PukWaC, including POS/lemma information, as well as a full dependency parse (parsing performed with the
MaltParser).
If you want your webpage to be removed from our corpora, please contact us.