This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | |
corpora [2013/10/11 10:34] – eros | corpora [2013/11/04 15:37] (current) – [English] eros |
---|
===== English ===== | ===== English ===== |
| |
* **PukWaC**: the same as ukWaC, but with a further layer of annotation added, i.e. a full dependency parse. The parsing was performed with the [[http://maltparser.org/|MaltParser]]. | * **PukWaC**: the same as ukWaC, but with a further layer of annotation added, i.e. a full dependency parse. The parsing was performed with the [[http://maltparser.org/|MaltParser]]. Some useful information about the dependency relations used in PukWaC can be found on pp. 6 and 7 of {{:papers:conll-syntax.pdf|this article}}. |
| |
* **ukWaC**: a 2 billion word corpus constructed from the Web limiting the crawl to the **.uk** domain and using medium-frequency words from the [[http://www.natcorp.ox.ac.uk/|BNC]] as seeds. The corpus was POS-tagged and lemmatized with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]]. The tagset is available {{:tagsets:ukwac_tagset.txt|here}}, more information can be found in this {{:papers:wacky_2008.pdf|paper}}. | * **ukWaC**: a 2 billion word corpus constructed from the Web limiting the crawl to the **.uk** domain and using medium-frequency words from the [[http://www.natcorp.ox.ac.uk/|BNC]] as seeds. The corpus was POS-tagged and lemmatized with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]]. The tagset is available {{:tagsets:ukwac_tagset.txt|here}}, more information can be found in this {{:papers:wacky_2008.pdf|paper}}. |
| |
* **WaCkypedia_EN**: a 2009 dump of the English Wikipedia (about 800 million tokens), in the same format as PukWaC, including POS/lemma information, as well as a full dependency parse (parsing performed with the [[http://maltparser.org/|MaltParser]]). The texts were extracted from the dump and cleaned using the [[http://medialab.di.unipi.it/wiki/Wikipedia_extractor|Wikipedia extractor]]. | * **WaCkypedia_EN**: a 2009 dump of the English Wikipedia (about 800 million tokens), in the same format as PukWaC, including POS/lemma information, as well as a full dependency parse (parsing performed with the [[http://maltparser.org/|MaltParser]]). The texts were extracted from the dump and cleaned using the [[http://medialab.di.unipi.it/wiki/Wikipedia_extractor|Wikipedia extractor]]. |
| |
===== French ===== | ===== French ===== |
| |