Differences
This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
| corpora [2013/03/13 10:41] – eros | corpora [2013/11/04 15:37] (current) – [English] eros | ||
|---|---|---|---|
| Line 3: | Line 3: | ||
| The resources below are large corpora build by downloading text from the web. See the [[publications]] section for further details, and the [[download]] section for information on how to get them: | The resources below are large corpora build by downloading text from the web. See the [[publications]] section for further details, and the [[download]] section for information on how to get them: | ||
| - | === English === | + | ===== English |
| - | * **PukWaC**: the same as ukWaC, but with a further layer of annotation added, i.e. a full dependency parse. The parsing was performed with the [[http:// | + | * **PukWaC**: the same as ukWaC, but with a further layer of annotation added, i.e. a full dependency parse. The parsing was performed with the [[http:// |
| - | * **ukWaC**: a 2 billion word corpus constructed from the Web limiting the crawl to the **.uk** domain and using medium-frequency words from the [[http:// | + | * **ukWaC**: a 2 billion word corpus constructed from the Web limiting the crawl to the **.uk** domain and using medium-frequency words from the [[http:// |
| * **WaCkypedia_EN**: | * **WaCkypedia_EN**: | ||
| - | + | ===== French | |
| - | === French === | + | |
| * **frWaC**: a 1.6 billion word corpus constructed from the Web limiting the crawl to the **.fr** domain and using medium-frequency words from the Le Monde Diplomatique corpus and basic French vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the [[http:// | * **frWaC**: a 1.6 billion word corpus constructed from the Web limiting the crawl to the **.fr** domain and using medium-frequency words from the Le Monde Diplomatique corpus and basic French vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the [[http:// | ||
| - | === German === | + | ===== German |
| - | * **deWaC**: a 1.7 billion word corpus constructed from the Web limiting the crawl to the **.de** domain and using medium-frequency words from the SudDeutsche Zeitung corpus and basic German vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the [[http:// | + | * **deWaC**: a 1.7 billion word corpus constructed from the Web limiting the crawl to the **.de** domain and using medium-frequency words from the SudDeutsche Zeitung corpus and basic German vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the [[http:// |
| * **sdewac** a 0.88 billion word corpus derived from deWaC, duplicate sentences and some noise have been removed. The corpus has been converted to Unicode. SdeWaC comes in two versions, in POS-tagged / lemmatized version or as a one sentence per line format, each supplemented with metadata (e.g. parse error rate). {{: | * **sdewac** a 0.88 billion word corpus derived from deWaC, duplicate sentences and some noise have been removed. The corpus has been converted to Unicode. SdeWaC comes in two versions, in POS-tagged / lemmatized version or as a one sentence per line format, each supplemented with metadata (e.g. parse error rate). {{: | ||
| - | === Italian === | + | ===== Italian |
| - | * **itWaC**: a 2 billion word corpus constructed from the Web limiting the crawl to the **.it** domain and using medium-frequency words from the [[http:// | + | * **itWaC**: a 2 billion word corpus constructed from the Web limiting the crawl to the **.it** domain and using medium-frequency words from the [[http:// |
| + | * semantically and syntactically annotated **Italian Wikipedia**: | ||
| + | * [[http:// | ||
| + | * [[http:// | ||
| ===== Opt-out ===== | ===== Opt-out ===== | ||
| If you want your webpage to be removed from our corpora, please [[people|contact us]]. | If you want your webpage to be removed from our corpora, please [[people|contact us]]. | ||