This is an old revision of the document!
Corpora availability
Free web interfaces
- the Jožef Stefan Institute hosts a web interface where many of our corpora can be used directly for free: http://nl.ijs.si/noske/wacs.cgi/first_form
- the University of Lancaster hosts ItWaC and a sample of UkWaC (registration is required but the service is free): http://cqpweb.lancs.ac.uk/
Download
- if you are interested in downloading and using the corpora described here (for free, we're not selling them) please contact us.
- the semantically and syntactically annotated Italian Wikipedia is available for direct download from here:
Lists
Corpus building tools
- BootCaT toolkit – bootstrap specialized corpora and terms from the web
- Shared ngram collector – Perl script useful for near-duplicate detection
- NB: if you're looking for the PotaModule (a Perl module that is intended to perform “boilerplate” stripping and other forms of HTML document filtering and extraction) download the BootCaT toolkit (see link above).