Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
corpora [2013/10/11 10:34]
eros
corpora [2013/11/04 15:37] (current)
eros [English]
Line 5: Line 5:
 ===== English ===== ===== English =====
  
-  * **PukWaC**: the same as ukWaC, but with a further layer of annotation added, i.e. a full dependency parse. The parsing was performed with the [[http://​maltparser.org/​|MaltParser]].+  * **PukWaC**: the same as ukWaC, but with a further layer of annotation added, i.e. a full dependency parse. The parsing was performed with the [[http://​maltparser.org/​|MaltParser]]. Some useful information about the dependency relations used in PukWaC can be found on pp. 6 and 7 of {{:​papers:​conll-syntax.pdf|this article}}.
  
   * **ukWaC**: a 2 billion word corpus constructed from the Web limiting the crawl to the **.uk** domain and using medium-frequency words from the [[http://​www.natcorp.ox.ac.uk/​|BNC]] as seeds. The corpus was POS-tagged and lemmatized with the [[http://​www.ims.uni-stuttgart.de/​projekte/​corplex/​TreeTagger/​|TreeTagger]]. The tagset is available {{:​tagsets:​ukwac_tagset.txt|here}},​ more information can be found in this {{:​papers:​wacky_2008.pdf|paper}}.   * **ukWaC**: a 2 billion word corpus constructed from the Web limiting the crawl to the **.uk** domain and using medium-frequency words from the [[http://​www.natcorp.ox.ac.uk/​|BNC]] as seeds. The corpus was POS-tagged and lemmatized with the [[http://​www.ims.uni-stuttgart.de/​projekte/​corplex/​TreeTagger/​|TreeTagger]]. The tagset is available {{:​tagsets:​ukwac_tagset.txt|here}},​ more information can be found in this {{:​papers:​wacky_2008.pdf|paper}}.
  
   * **WaCkypedia_EN**:​ a 2009 dump of the English Wikipedia (about 800 million tokens), in the same format as PukWaC, including POS/lemma information,​ as well as a full dependency parse (parsing performed with the [[http://​maltparser.org/​|MaltParser]]). The texts were extracted from the dump and cleaned using the [[http://​medialab.di.unipi.it/​wiki/​Wikipedia_extractor|Wikipedia extractor]].   * **WaCkypedia_EN**:​ a 2009 dump of the English Wikipedia (about 800 million tokens), in the same format as PukWaC, including POS/lemma information,​ as well as a full dependency parse (parsing performed with the [[http://​maltparser.org/​|MaltParser]]). The texts were extracted from the dump and cleaned using the [[http://​medialab.di.unipi.it/​wiki/​Wikipedia_extractor|Wikipedia extractor]].
- 
 ===== French ===== ===== French =====
  
  • corpora.txt
  • Last modified: 2013/11/04 15:37
  • by eros