tools

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
tools [2013/03/14 10:32] – [Tools] erostools [2016/02/25 15:20] (current) – [Boilerplate removal] eros
Line 1: Line 1:
-===== Tools =====+====== Tools ======
  
-This is an incomplete list of tools you can use to build corpora from the web.+This is an **incomplete** list of tools you can use to build corpora from the web.
  
-==== Complete pipelines ====+===== Complete pipelines =====
  
-  * [[http://bootcat.sslmit.unibo.it/|BootCaT toolkit]] -- bootstrap specialized corpora and terms from the web+  * [[http://bootcat.sslmit.unibo.it/|BootCaT]] -- bootstrap specialized corpora and terms from the web
  
-==== De-duplication ====+===== De-duplication =====
  
   * {{:shared_ngram_collector.tgz|Shared ngram collector}} -- Perl script useful for near-duplicate detection   * {{:shared_ngram_collector.tgz|Shared ngram collector}} -- Perl script useful for near-duplicate detection
   * [[http://code.google.com/p/onion/|Onion]] -- a tool for removing duplicate parts from large collections of texts.   * [[http://code.google.com/p/onion/|Onion]] -- a tool for removing duplicate parts from large collections of texts.
  
-==== Boilerplate removal ====+===== Boilerplate removal =====
  
   * [[http://code.google.com/p/justext/|jusText]] -- a tool for removing boilerplate content   * [[http://code.google.com/p/justext/|jusText]] -- a tool for removing boilerplate content
-  * [[http://www.nljubesic.net/resources/tools/webcontentextractor/|WebContentExtractor]] -- a tool for content extraction from web pages for building web corpora+  * [[http://metashare.elda.org/repository/browse/web-content-extractor/9e14ee4a663d11e28a985ef2e4e6c59e51a55e76bd4b47f39338db609624ff54/|Web Content Extractor]] by Nikola Ljubešić -- a tool for extracting content from web pages
   * the **PotaModule** (a Perl module that is intended to perform "boilerplate" stripping and other forms of HTML document filtering and extraction) is available in the BootCaT toolkit (see link above).   * the **PotaModule** (a Perl module that is intended to perform "boilerplate" stripping and other forms of HTML document filtering and extraction) is available in the BootCaT toolkit (see link above).
- 
  • tools.1363253553.txt.gz
  • Last modified: 2013/03/14 10:32
  • by eros