Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revisionLast revisionBoth sides next revision | ||
corpora [2013/03/20 12:25] – [English] eros | corpora [2013/10/11 10:34] – eros | ||
---|---|---|---|
Line 17: | Line 17: | ||
===== German ===== | ===== German ===== | ||
- | * **deWaC**: a 1.7 billion word corpus constructed from the Web limiting the crawl to the **.de** domain and using medium-frequency words from the SudDeutsche Zeitung corpus and basic German vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the [[http:// | + | * **deWaC**: a 1.7 billion word corpus constructed from the Web limiting the crawl to the **.de** domain and using medium-frequency words from the SudDeutsche Zeitung corpus and basic German vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the [[http:// |
* **sdewac** a 0.88 billion word corpus derived from deWaC, duplicate sentences and some noise have been removed. The corpus has been converted to Unicode. SdeWaC comes in two versions, in POS-tagged / lemmatized version or as a one sentence per line format, each supplemented with metadata (e.g. parse error rate). {{: | * **sdewac** a 0.88 billion word corpus derived from deWaC, duplicate sentences and some noise have been removed. The corpus has been converted to Unicode. SdeWaC comes in two versions, in POS-tagged / lemmatized version or as a one sentence per line format, each supplemented with metadata (e.g. parse error rate). {{: | ||
Line 23: | Line 23: | ||
===== Italian ===== | ===== Italian ===== | ||
- | * **itWaC**: a 2 billion word corpus constructed from the Web limiting the crawl to the **.it** domain and using medium-frequency words from the [[http:// | + | * **itWaC**: a 2 billion word corpus constructed from the Web limiting the crawl to the **.it** domain and using medium-frequency words from the [[http:// |
* semantically and syntactically annotated **Italian Wikipedia**: | * semantically and syntactically annotated **Italian Wikipedia**: |