This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revisionBoth sides next revision |
tools [2013/03/14 10:34] – eros | tools [2013/03/15 09:14] – [Boilerplate removal] eros |
---|
===== Complete pipelines ===== | ===== Complete pipelines ===== |
| |
* [[http://bootcat.sslmit.unibo.it/|BootCaT toolkit]] -- bootstrap specialized corpora and terms from the web | * [[http://bootcat.sslmit.unibo.it/|BootCaT]] -- bootstrap specialized corpora and terms from the web |
| |
===== De-duplication ===== | ===== De-duplication ===== |
| |
* [[http://code.google.com/p/justext/|jusText]] -- a tool for removing boilerplate content | * [[http://code.google.com/p/justext/|jusText]] -- a tool for removing boilerplate content |
* [[http://www.nljubesic.net/resources/tools/webcontentextractor/|WebContentExtractor]] -- a tool for content extraction from web pages for building web corpora | * [[http://www.nljubesic.net/resources/tools/webcontentextractor/|WebContentExtractor]] -- a tool for extracting content from web pages |
* the **PotaModule** (a Perl module that is intended to perform "boilerplate" stripping and other forms of HTML document filtering and extraction) is available in the BootCaT toolkit (see link above). | * the **PotaModule** (a Perl module that is intended to perform "boilerplate" stripping and other forms of HTML document filtering and extraction) is available in the BootCaT toolkit (see link above). |
| |