WaCKy

Corpora

Anonymous (anonymous@undisclosed.example.com) — 2013-11-04T14:37:43+00:00

Corpora The resources below are large corpora build by downloading text from the web. See the Publications section for further details, and the Use the corpus directly (no download necessary) section for information on how to get them: English * PukWaC: the same as ukWaC, but with a further layer of annotation added, i.e. a full dependency parse. The parsing was performed with the

Use the corpus directly (no download necessary)

Anonymous (anonymous@undisclosed.example.com) — 2021-09-13T08:20:06+00:00

Use the corpus directly (no download necessary) * The wacky corpora are available on our official corpus repository here: Other free web interfaces: * the Jožef Stefan Institute hosts a web interface where many of our corpora can be used directly for free:

Frequency lists

Anonymous (anonymous@undisclosed.example.com) — 2014-03-27T09:45:38+00:00

Frequency lists Frequency lists extracted from the WaCky corpora. Lists of words and lemmas are provided, sorted by frequency. All the lists are in '.7z' compressed format. * Unigram lists. These are the complete lists, i.e. we did not perform any post-processing on them.

LiMiNe

Anonymous (anonymous@undisclosed.example.com) — 2009-02-03T11:24:39+00:00

LiMiNe The LiMiNe (Linguistic Mining of the Net) project intends to set up a European network of linguists and Information Technology specialists for the development of methodologies and resources for the assembly, annotation and indexing of several language-specific subsets of the web to be used in the analysis and teaching of (general and specialised) languages.

People

Anonymous (anonymous@undisclosed.example.com) — 2013-03-20T09:02:12+00:00

People Contacts Contact the Wacky Bunch at Members Name Current affiliation Giuseppe Attardi University of Pisa Marco Baroni University of Trento Silvia Bernardini University of Bologna (Forlì) Gabriele “Bilo” Carioli University of Bologna (Forlì)

Publications

Anonymous (anonymous@undisclosed.example.com) — 2022-07-11T08:05:00+00:00

Publications * M. Baroni, S. Bernardini, A. Ferraresi and E. Zanchetta. 2009. The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora. Language Resources and Evaluation 43(3): 209-226 ([PDF]). * M. Baroni and S. Bernardini (eds.). 2006.

Seed URLs

Anonymous (anonymous@undisclosed.example.com) — 2008-02-20T14:58:55+00:00

Seed URLs The URLs returned by our queries to Google for the word pairs above were used to initiate the crawls. * [deWaC seed URLs] * [itWaC seed URLs] * [ukWaC seed URLs]

Seed words and tuples

Anonymous (anonymous@undisclosed.example.com) — 2008-02-20T15:02:28+00:00

Seed words and tuples The first step in the creation of our Corpora was coming up with lists of basic words and mid-frequency words collected from other corpora. We then randomly combined these words in pairs and sent each pair as a query to Google in order to obtain seed URLs.

Anonymous (anonymous@undisclosed.example.com) — 2013-03-14T09:31:52+00:00

* Home * Corpora * Download/Use * Tools * Publications * People

WaCky - The Web-As-Corpus Kool Yinitiative

Anonymous (anonymous@undisclosed.example.com) — 2022-12-05T10:57:49+00:00

WaCky - The Web-As-Corpus Kool Yinitiative Welcome to WaCky! We are a community of linguists and information technology specialists who got together to develop a set of tools (and interfaces to existing tools) that will allow linguists to crawl a section of the web, process the data, index and search them.

Tools

Anonymous (anonymous@undisclosed.example.com) — 2016-02-25T14:20:37+00:00

Tools This is an incomplete list of tools you can use to build corpora from the web. Complete pipelines * BootCaT -- bootstrap specialized corpora and terms from the web De-duplication * [Shared ngram collector] -- Perl script useful for near-duplicate detection * Onion -- a tool for removing duplicate parts from large collections of texts.