corpora

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
corpora [2010/04/11 15:55]
adriano
corpora [2013/10/11 10:34]
eros
Line 1: Line 1:
 ====== Corpora ====== ====== Corpora ======
  
-The resources below are four very large corpora, comparable in terms of size, sampling strategy and format. See  the [[publications]] section for further details on the construction procedure and an evaluation of the resources, and the [[download]] section for information on how to get them:+The resources below are large corpora build by downloading text from the web. See  the [[publications]] section for further details, and the [[download]] section for information on how to get them:
  
-  * **deWaC**: a 1.7 billion word corpus constructed from the Web limiting the crawl to the **.de** domain and using medium-frequency words from the SudDeutsche Zeitung corpus and basic German vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]].+===== English =====
  
-  * **frWaC**: a 1.6 billion word corpus constructed from the Web limiting the crawl to the **.fr** domain and using medium-frequency words from the Le Monde Diplomatique corpus and basic French vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]].+  * **PukWaC**: the same as ukWaC, but with further layer of annotation added, i.ea full dependency parse. The parsing was performed with the [[http://maltparser.org/|MaltParser]].
  
-  * **itWaC**: a 2 billion word corpus constructed from the Web limiting the crawl to the **.it** domain and using medium-frequency words from the [[http://sslmit.unibo.it/repubblica|Repubblica]] corpus and basic Italian vocabulary lists as seeds. The corpus was POS-tagged with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]], and lemmatized using the [[http://sslmit.unibo.it/morphit|Morph-it!]] lexicon.+  * **ukWaC**: a 2 billion word corpus constructed from the Web limiting the crawl to the **.uk** domain and using medium-frequency words from the [[http://www.natcorp.ox.ac.uk/|BNC]] as seeds. The corpus was POS-tagged and lemmatized with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]]. The tagset is available {{:tagsets:ukwac_tagset.txt|here}}more information can be found in this {{:papers:wacky_2008.pdf|paper}}.
  
-  * **ukWaC**: a 2 billion word corpus constructed from the Web limiting the crawl to the **.uk** domain and using medium-frequency words from the [[http://www.natcorp.ox.ac.uk/|BNC]] as seeds. The corpus was POS-tagged and lemmatized with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]].+  * **WaCkypedia_EN**: a 2009 dump of the English Wikipedia (about 800 million tokens), in the same format as PukWaC, including POS/lemma information, as well as a full dependency parse (parsing performed with the [[http://maltparser.org/|MaltParser]]). The texts were extracted from the dump and cleaned using the [[http://medialab.di.unipi.it/wiki/Wikipedia_extractor|Wikipedia extractor]].
  
-Other resources we created:+===== French =====
  
-  * **PukWaC**: the same as ukWaC, but with further layer of annotation added, i.ea full dependency parse. The parsing was performed with the [[http://maltparser.org/|MaltParser]].+  * **frWaC**: a 1.6 billion word corpus constructed from the Web limiting the crawl to the **.fr** domain and using medium-frequency words from the Le Monde Diplomatique corpus and basic French vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]], more information available {{:papers:wacky_2008.pdf|here}}.
  
-  * **WaCkypedia_EN**: a 2009 dump of the English Wikipedia (about 800 million tokens), in the same format as PukWaC, including POS/lemma information, as well as a full dependency parse (parsing performed with the [[http://maltparser.org/|MaltParser]]). The texts were extracted from the dump and cleaned using the [[http://medialab.di.unipi.it/wiki/Wikipedia_extractor|Wikipedia extractor]].+===== German ===== 
 + 
 +  * **deWaC**: a 1.7 billion word corpus constructed from the Web limiting the crawl to the **.de** domain and using medium-frequency words from the SudDeutsche Zeitung corpus and basic German vocabulary lists as seeds. The corpus was POS-tagged and lemmatized with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]] using this [[http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/stts_guide.pdf|tagset]], more information available {{:papers:wacky_2008.pdf|here}}.
  
 +  * **sdewac** a 0.88 billion word corpus derived from deWaC, duplicate sentences and some noise have been removed. The corpus has been converted to Unicode. SdeWaC comes in two versions, in POS-tagged / lemmatized version or as a one sentence per line format, each supplemented with metadata (e.g. parse error rate). {{:papers:sdewac-description.pdf|More information on sdewac}}.
  
 +===== Italian =====
  
-===== Work in progress =====+  * **itWaC**: a 2 billion word corpus constructed from the Web limiting the crawl to the **.it** domain and using medium-frequency words from the [[http://sslmit.unibo.it/repubblica|Repubblica]] corpus and basic Italian vocabulary lists as seeds. The corpus was POS-tagged with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]] using this [[http://sslmit.unibo.it/~baroni/collocazioni/itwac.tagset.txt|tagset]], and lemmatized using the [[http://sslmit.unibo.it/morphit|Morph-it!]] lexicon, more information available {{:papers:wacky_2008.pdf|here}}.
  
-  * Quite a few things going on at the moment, but none we can disclose ;-Stay tuned.+  * semantically and syntactically annotated **Italian Wikipedia**: 
 +    * [[http://medialab.di.unipi.it/Project/QA/wikiCoNLL.bz2|CoNLL format]] ([[http://medialab.di.unipi.it/wiki/Tanl_Tagsets|tagset]]) 
 +    * [[http://medialab.di.unipi.it/Project/QA/wikiMT.bz2|MultiTag format]]
  
 ===== Opt-out ===== ===== Opt-out =====
  
 If you want your webpage to be removed from our corpora, please [[people|contact us]]. If you want your webpage to be removed from our corpora, please [[people|contact us]].
  • corpora.txt
  • Last modified: 2013/11/04 15:37
  • by eros