This is an old revision of the document!


  • if you are interested in downloading and using the corpora described here (for free, we're not selling them) please contact us.

This is an incomplete list of tools used to build corpora from the web

  • Shared ngram collector – Perl script useful for near-duplicate detection
  • Onion – a tool for removing duplicate parts from large collections of texts.
  • jusText – a tool for removing boilerplate content
  • WebContentExtractor – a tool for content extraction from web pages for building web corpora
  • the PotaModule (a Perl module that is intended to perform “boilerplate” stripping and other forms of HTML document filtering and extraction) is available in the BootCaT toolkit (see link above).
  • download.1363253178.txt.gz
  • Last modified: 2013/03/14 10:26
  • by eros