Both sides previous revision Previous revision Next revision | Previous revisionLast revisionBoth sides next revision |
tools [2013/03/14 10:32] – [Tools] eros | tools [2013/03/20 09:40] – [Tools] eros |
---|
===== Tools ===== | ====== Tools ====== |
| |
This is an incomplete list of tools you can use to build corpora from the web. | This is an **incomplete** list of tools you can use to build corpora from the web. |
| |
==== Complete pipelines ==== | ===== Complete pipelines ===== |
| |
* [[http://bootcat.sslmit.unibo.it/|BootCaT toolkit]] -- bootstrap specialized corpora and terms from the web | * [[http://bootcat.sslmit.unibo.it/|BootCaT]] -- bootstrap specialized corpora and terms from the web |
| |
==== De-duplication ==== | ===== De-duplication ===== |
| |
* {{:shared_ngram_collector.tgz|Shared ngram collector}} -- Perl script useful for near-duplicate detection | * {{:shared_ngram_collector.tgz|Shared ngram collector}} -- Perl script useful for near-duplicate detection |
* [[http://code.google.com/p/onion/|Onion]] -- a tool for removing duplicate parts from large collections of texts. | * [[http://code.google.com/p/onion/|Onion]] -- a tool for removing duplicate parts from large collections of texts. |
| |
==== Boilerplate removal ==== | ===== Boilerplate removal ===== |
| |
* [[http://code.google.com/p/justext/|jusText]] -- a tool for removing boilerplate content | * [[http://code.google.com/p/justext/|jusText]] -- a tool for removing boilerplate content |
* [[http://www.nljubesic.net/resources/tools/webcontentextractor/|WebContentExtractor]] -- a tool for content extraction from web pages for building web corpora | * [[http://www.nljubesic.net/resources/tools/webcontentextractor/|WebContentExtractor]] -- a tool for extracting content from web pages |
* the **PotaModule** (a Perl module that is intended to perform "boilerplate" stripping and other forms of HTML document filtering and extraction) is available in the BootCaT toolkit (see link above). | * the **PotaModule** (a Perl module that is intended to perform "boilerplate" stripping and other forms of HTML document filtering and extraction) is available in the BootCaT toolkit (see link above). |
| |