WaCky - The Web-As-Corpus Kool Yinitiative

Welcome to WaCky!

We are a community of linguists and information technology specialists who got together to develop a set of tools (and interfaces to existing tools) that will allow linguists to crawl a section of the web, process the data, index and search them.

We try to keep everything very laid-back and flexible (minimal constraint on data representation, programming language, etc.) to make it easier for people with different backgrounds and goals to use our resources and/or contribute to the project.

We built a few Corpora you can download or use directly, we described in great detail the procedure we followed to create our first corpora (DeWaC, UkWaC and ItWaC) in the paper:

M. Baroni, S. Bernardini, A. Ferraresi and E. Zanchetta. 2009. The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora. Language Resources and Evaluation 43 (3): 209-226. (PDF).

There, we also present a qualitative evaluation of our resources. Please cite this article when using our corpora, or any of the materials we are making available in the Download/Use section.

The project (including this website) is currently being sponsored by the LiMiNe project.

Private section