Next revision | Previous revision |
tools [2013/03/14 10:32] – created eros | tools [2016/02/25 15:20] (current) – [Boilerplate removal] eros |
---|
===== Tools ===== | ====== Tools ====== |
| |
This is an incomplete list of tools used to build corpora from the web | This is an **incomplete** list of tools you can use to build corpora from the web. |
| |
==== Complete pipelines ==== | ===== Complete pipelines ===== |
| |
* [[http://bootcat.sslmit.unibo.it/|BootCaT toolkit]] -- bootstrap specialized corpora and terms from the web | * [[http://bootcat.sslmit.unibo.it/|BootCaT]] -- bootstrap specialized corpora and terms from the web |
| |
==== De-duplication ==== | ===== De-duplication ===== |
| |
* {{:shared_ngram_collector.tgz|Shared ngram collector}} -- Perl script useful for near-duplicate detection | * {{:shared_ngram_collector.tgz|Shared ngram collector}} -- Perl script useful for near-duplicate detection |
* [[http://code.google.com/p/onion/|Onion]] -- a tool for removing duplicate parts from large collections of texts. | * [[http://code.google.com/p/onion/|Onion]] -- a tool for removing duplicate parts from large collections of texts. |
| |
==== Boilerplate removal ==== | ===== Boilerplate removal ===== |
| |
* [[http://code.google.com/p/justext/|jusText]] -- a tool for removing boilerplate content | * [[http://code.google.com/p/justext/|jusText]] -- a tool for removing boilerplate content |
* [[http://www.nljubesic.net/resources/tools/webcontentextractor/|WebContentExtractor]] -- a tool for content extraction from web pages for building web corpora | * [[http://metashare.elda.org/repository/browse/web-content-extractor/9e14ee4a663d11e28a985ef2e4e6c59e51a55e76bd4b47f39338db609624ff54/|Web Content Extractor]] by Nikola Ljubešić -- a tool for extracting content from web pages |
* the **PotaModule** (a Perl module that is intended to perform "boilerplate" stripping and other forms of HTML document filtering and extraction) is available in the BootCaT toolkit (see link above). | * the **PotaModule** (a Perl module that is intended to perform "boilerplate" stripping and other forms of HTML document filtering and extraction) is available in the BootCaT toolkit (see link above). |
| |