Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revisionLast revisionBoth sides next revision | ||
corpora [2010/04/11 15:55] – adriano | corpora [2013/10/11 10:34] – eros | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== Corpora ====== | ====== Corpora ====== | ||
- | The resources below are four very large corpora, comparable in terms of size, sampling strategy and format. See the [[publications]] section for further details | + | The resources below are large corpora |
- | * **deWaC**: a 1.7 billion word corpus constructed from the Web limiting the crawl to the **.de** domain and using medium-frequency words from the SudDeutsche Zeitung corpus and basic German vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the [[http:// | + | ===== English ===== |
- | * **frWaC**: a 1.6 billion word corpus constructed from the Web limiting the crawl to the **.fr** domain and using medium-frequency words from the Le Monde Diplomatique corpus and basic French vocabulary lists as seeds. The corpus | + | * **PukWaC**: the same as ukWaC, but with a further layer of annotation added, i.e. a full dependency parse. The parsing |
- | * **itWaC**: a 2 billion word corpus constructed from the Web limiting the crawl to the **.it** domain and using medium-frequency words from the [[http://sslmit.unibo.it/repubblica|Repubblica]] corpus and basic Italian vocabulary lists as seeds. The corpus was POS-tagged with the [[http:// | + | * **ukWaC**: a 2 billion word corpus constructed from the Web limiting the crawl to the **.uk** domain and using medium-frequency words from the [[http://www.natcorp.ox.ac.uk/|BNC]] as seeds. The corpus was POS-tagged |
- | * **ukWaC**: a 2 billion word corpus constructed from the Web limiting | + | * **WaCkypedia_EN**: a 2009 dump of the English Wikipedia (about 800 million tokens), in the same format as PukWaC, including POS/lemma information, |
- | Other resources we created: | + | ===== French ===== |
- | * **PukWaC**: the same as ukWaC, but with a further layer of annotation added, i.e. a full dependency parse. The parsing | + | * **frWaC**: a 1.6 billion word corpus constructed from the Web limiting the crawl to the **.fr** domain and using medium-frequency words from the Le Monde Diplomatique corpus and basic French vocabulary lists as seeds. The corpus |
- | | + | ===== German ===== |
+ | |||
+ | | ||
+ | * **sdewac** a 0.88 billion word corpus derived from deWaC, duplicate sentences and some noise have been removed. The corpus has been converted to Unicode. SdeWaC comes in two versions, in POS-tagged / lemmatized version or as a one sentence per line format, each supplemented with metadata (e.g. parse error rate). {{: | ||
+ | ===== Italian ===== | ||
- | ===== Work in progress ===== | + | * **itWaC**: a 2 billion word corpus constructed from the Web limiting the crawl to the **.it** domain and using medium-frequency words from the [[http:// |
- | * Quite a few things going on at the moment, but none we can disclose ;-) Stay tuned. | + | * semantically and syntactically annotated **Italian Wikipedia**: |
+ | * [[http:// | ||
+ | * [[http:// | ||
===== Opt-out ===== | ===== Opt-out ===== | ||
If you want your webpage to be removed from our corpora, please [[people|contact us]]. | If you want your webpage to be removed from our corpora, please [[people|contact us]]. |