Use the corpus directly (no download necessary) [WaCKy]

download

This is an old revision of the document!

the Jožef Stefan Institute hosts a web interface where many of our corpora can be used directly for free: http://nl.ijs.si/noske/wacs.cgi/first_form
the University of Lancaster hosts ItWaC and a sample of UkWaC (registration is required but the service is free): http://cqpweb.lancs.ac.uk/

if you are interested in downloading and using the corpora described here (for free, we're not selling them) please contact us.

the semantically and syntactically annotated Italian Wikipedia is available for direct download from here:
- CoNLL format (tagset)
- MultiTag format

This is an incomplete list of tools used to build corpora from the web

BootCaT toolkit – bootstrap specialized corpora and terms from the web

Shared ngram collector – Perl script useful for near-duplicate detection
Onion – a tool for removing duplicate parts from large collections of texts.

jusText – a tool for removing boilerplate content
WebContentExtractor – a tool for content extraction from web pages for building web corpora
the PotaModule (a Perl module that is intended to perform “boilerplate” stripping and other forms of HTML document filtering and extraction) is available in the BootCaT toolkit (see link above).

download.1363253178.txt.gz
Last modified: 2013/03/14 10:26
by eros