This is an old revision of the document!


  • if you are interested in downloading and using the corpora described here (for free, we're not selling them) please contact us.
  • BootCaT toolkit – bootstrap specialized corpora and terms from the web
  • Shared ngram collector – Perl script useful for near-duplicate detection
  • NB: if you're looking for the PotaModule (a Perl module that is intended to perform “boilerplate” stripping and other forms of HTML document filtering and extraction) download the BootCaT toolkit (see link above).
  • download.1363168652.txt.gz
  • Last modified: 2013/03/13 10:57
  • by eros