This is an old revision of the document!


This is an incomplete list of tools you can use to build corpora from the web.

  • Shared ngram collector – Perl script useful for near-duplicate detection
  • Onion – a tool for removing duplicate parts from large collections of texts.
  • jusText – a tool for removing boilerplate content
  • WebContentExtractor – a tool for content extraction from web pages for building web corpora
  • the PotaModule (a Perl module that is intended to perform “boilerplate” stripping and other forms of HTML document filtering and extraction) is available in the BootCaT toolkit (see link above).
  • tools.1363253646.txt.gz
  • Last modified: 2013/03/14 10:34
  • by eros