This is an old revision of the document!
Tools
This is an incomplete list of tools used to build corpora from the web
Complete pipelines
- BootCaT toolkit – bootstrap specialized corpora and terms from the web
De-duplication
- Shared ngram collector – Perl script useful for near-duplicate detection
- Onion – a tool for removing duplicate parts from large collections of texts.
Boilerplate removal
- jusText – a tool for removing boilerplate content
- WebContentExtractor – a tool for content extraction from web pages for building web corpora
- the PotaModule (a Perl module that is intended to perform “boilerplate” stripping and other forms of HTML document filtering and extraction) is available in the BootCaT toolkit (see link above).