corpora

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revisionBoth sides next revision
corpora [2013/03/20 12:25] – [English] eroscorpora [2013/10/11 09:13] eros
Line 23: Line 23:
 ===== Italian ===== ===== Italian =====
  
-  * **itWaC**: a 2 billion word corpus constructed from the Web limiting the crawl to the **.it** domain and using medium-frequency words from the [[http://sslmit.unibo.it/repubblica|Repubblica]] corpus and basic Italian vocabulary lists as seeds. The corpus was POS-tagged with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]], and lemmatized using the [[http://sslmit.unibo.it/morphit|Morph-it!]] lexicon, more information available {{:papers:wacky_2008.pdf|here}}.+  * **itWaC**: a 2 billion word corpus constructed from the Web limiting the crawl to the **.it** domain and using medium-frequency words from the [[http://sslmit.unibo.it/repubblica|Repubblica]] corpus and basic Italian vocabulary lists as seeds. The corpus was POS-tagged with the [[http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/|TreeTagger]] using this [[http://sslmit.unibo.it/~baroni/collocazioni/itwac.tagset.txt|tagset]], and lemmatized using the [[http://sslmit.unibo.it/morphit|Morph-it!]] lexicon, more information available {{:papers:wacky_2008.pdf|here}}.
  
   * semantically and syntactically annotated **Italian Wikipedia**:   * semantically and syntactically annotated **Italian Wikipedia**:
  • corpora.txt
  • Last modified: 2013/11/04 15:37
  • by eros