User Tools

Site Tools



This is an incomplete list of tools you can use to build corpora from the web.

Complete pipelines

  • BootCaT – bootstrap specialized corpora and terms from the web


  • Shared ngram collector – Perl script useful for near-duplicate detection
  • Onion – a tool for removing duplicate parts from large collections of texts.

Boilerplate removal

  • jusText – a tool for removing boilerplate content
  • Web Content Extractor by Nikola Ljubešić – a tool for extracting content from web pages
  • the PotaModule (a Perl module that is intended to perform “boilerplate” stripping and other forms of HTML document filtering and extraction) is available in the BootCaT toolkit (see link above).
tools.txt · Last modified: 2016/02/25 15:20 by eros