tools

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revisionBoth sides next revision
tools [2013/03/14 10:32] – [Tools] erostools [2013/03/14 10:34] – [Complete pipelines] eros
Line 1: Line 1:
-===== Tools =====+====== Tools ======
  
 This is an incomplete list of tools you can use to build corpora from the web. This is an incomplete list of tools you can use to build corpora from the web.
  
-==== Complete pipelines ====+===== Complete pipelines =====
  
-  * [[http://bootcat.sslmit.unibo.it/|BootCaT toolkit]] -- bootstrap specialized corpora and terms from the web+  * [[http://bootcat.sslmit.unibo.it/|BootCaT]] -- bootstrap specialized corpora and terms from the web
  
-==== De-duplication ====+===== De-duplication =====
  
   * {{:shared_ngram_collector.tgz|Shared ngram collector}} -- Perl script useful for near-duplicate detection   * {{:shared_ngram_collector.tgz|Shared ngram collector}} -- Perl script useful for near-duplicate detection
   * [[http://code.google.com/p/onion/|Onion]] -- a tool for removing duplicate parts from large collections of texts.   * [[http://code.google.com/p/onion/|Onion]] -- a tool for removing duplicate parts from large collections of texts.
  
-==== Boilerplate removal ====+===== Boilerplate removal =====
  
   * [[http://code.google.com/p/justext/|jusText]] -- a tool for removing boilerplate content   * [[http://code.google.com/p/justext/|jusText]] -- a tool for removing boilerplate content
  • tools.txt
  • Last modified: 2016/02/25 15:20
  • by eros