corpora

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
corpora [2011/12/01 16:02] – [Corpora] eroscorpora [2013/11/04 15:37] (current) – [English] eros
Line 1: Line 1:
 ====== Corpora ====== ====== Corpora ======
  
-The resources below are four very large corpora, comparable in terms of size, sampling strategy and format. See  the [[publications]] section for further details on the construction procedure and an evaluation of the resources, and the [[download]] section for information on how to get them:+The resources below are large corpora build by downloading text from the web. See  the [[publications]] section for further details, and the [[download]] section for information on how to get them:
  
-  * **deWaC**: a 1.7 billion word corpus constructed from the Web limiting the crawl to the **.de** domain and using medium-frequency words from the SudDeutsche Zeitung corpus and basic German vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]].+===== English =====
  
-  * **sdewac** an 0.88 billion word corpus derived from deWaCduplicate sentences and some noise have been removedThe corpus has Unicode encodingSdeWaC comes in two versions, in POS-tagged / lemmatized version or as one sentence per line format, each supplemented with metadata (e.gparse error rate).+  * **PukWaC**: the same as ukWaCbut with a further layer of annotation added, i.e. a full dependency parse. The parsing was performed with the [[http://maltparser.org/|MaltParser]]Some useful information about the dependency relations used in PukWaC can be found on pp. 6 and 7 of {{:papers:conll-syntax.pdf|this article}}.
  
-  * **frWaC**: a 1.6 billion word corpus constructed from the Web limiting the crawl to the **.fr** domain and using medium-frequency words from the Le Monde Diplomatique corpus and basic French vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]].+  * **ukWaC**: a billion word corpus constructed from the Web limiting the crawl to the **.uk** domain and using medium-frequency words from the [[http://www.natcorp.ox.ac.uk/|BNC]] as seeds. The corpus was POS-tagged and lemmatized with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]]. The tagset is available {{:tagsets:ukwac_tagset.txt|here}}, more information can be found in this {{:papers:wacky_2008.pdf|paper}}.
  
-  * **itWaC**: a 2 billion word corpus constructed from the Web limiting the crawl to the **.it** domain and using medium-frequency words from the [[http://sslmit.unibo.it/repubblica|Repubblica]] corpus and basic Italian vocabulary lists as seeds. The corpus was POS-tagged with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]], and lemmatized using the [[http://sslmit.unibo.it/morphit|Morph-it!]] lexicon.+  * **WaCkypedia_EN**: a 2009 dump of the English Wikipedia (about 800 million tokens), in the same format as PukWaC, including POS/lemma information, as well as a full dependency parse (parsing performed with the [[http://maltparser.org/|MaltParser]]). The texts were extracted from the dump and cleaned using the [[http://medialab.di.unipi.it/wiki/Wikipedia_extractor|Wikipedia extractor]]. 
 +===== French =====
  
-  * **ukWaC**: a billion word corpus constructed from the Web limiting the crawl to the **.uk** domain and using medium-frequency words from the [[http://www.natcorp.ox.ac.uk/|BNC]] as seeds. The corpus was POS-tagged and lemmatized with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]].+  * **frWaC**: a 1.6 billion word corpus constructed from the Web limiting the crawl to the **.fr** domain and using medium-frequency words from the Le Monde Diplomatique corpus and basic French vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]], more information available {{:papers:wacky_2008.pdf|here}}.
  
-Other resources we created:+===== German =====
  
-  * **PukWaC**: the same as ukWaC, but with further layer of annotation added, i.e. a full dependency parse. The parsing was performed with the [[http://maltparser.org/|MaltParser]]. +  * **deWaC**: a 1.7 billion word corpus constructed from the Web limiting the crawl to the **.de** domain and using medium-frequency words from the SudDeutsche Zeitung corpus and basic German vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]] using this [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/stts_guide.pdf|tagset]], more information available {{:papers:wacky_2008.pdf|here}}.
- +
-  * **WaCkypedia_EN**: a 2009 dump of the English Wikipedia (about 800 million tokens), in the same format as PukWaC, including POS/lemma information, as well as a full dependency parse (parsing performed with the [[http://maltparser.org/|MaltParser]]). The texts were extracted from the dump and cleaned using the [[http://medialab.di.unipi.it/wiki/Wikipedia_extractor|Wikipedia extractor]].+
  
 +  * **sdewac** a 0.88 billion word corpus derived from deWaC, duplicate sentences and some noise have been removed. The corpus has been converted to Unicode. SdeWaC comes in two versions, in POS-tagged / lemmatized version or as a one sentence per line format, each supplemented with metadata (e.g. parse error rate). {{:papers:sdewac-description.pdf|More information on sdewac}}.
  
 +===== Italian =====
  
-===== Work in progress =====+  * **itWaC**: a 2 billion word corpus constructed from the Web limiting the crawl to the **.it** domain and using medium-frequency words from the [[http://sslmit.unibo.it/repubblica|Repubblica]] corpus and basic Italian vocabulary lists as seeds. The corpus was POS-tagged with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]] using this [[http://sslmit.unibo.it/~baroni/collocazioni/itwac.tagset.txt|tagset]], and lemmatized using the [[http://sslmit.unibo.it/morphit|Morph-it!]] lexicon, more information available {{:papers:wacky_2008.pdf|here}}.
  
-  * Quite a few things going on at the moment, but none we can disclose ;-Stay tuned.+  * semantically and syntactically annotated **Italian Wikipedia**: 
 +    * [[http://medialab.di.unipi.it/Project/QA/wikiCoNLL.bz2|CoNLL format]] ([[http://medialab.di.unipi.it/wiki/Tanl_Tagsets|tagset]]) 
 +    * [[http://medialab.di.unipi.it/Project/QA/wikiMT.bz2|MultiTag format]]
  
 ===== Opt-out ===== ===== Opt-out =====
  
 If you want your webpage to be removed from our corpora, please [[people|contact us]]. If you want your webpage to be removed from our corpora, please [[people|contact us]].
  • corpora.txt
  • Last modified: 2013/11/04 15:37
  • by eros