We are a community of linguists and information technology specialists who got together to develop a set of tools (and interfaces to existing tools) that will allow linguists to crawl a section of the web, process the data, index and search them.

We try to keep everything very laid-back and flexible (minimal constraint on data representation, programming language, etc.) to make it easier for people with different backgrounds and goals to use our resources and/or contribute to the project.

We built a few Corpora you can Use the corpus directly (no download necessary), and in the near future we'll have a web interface for direct online use of the corpora. While we wait for that (and the documentation), we described in great detail the procedure we followed to create our corpora in the paper:

M. Baroni, S. Bernardini, A. Ferraresi and E. Zanchetta. 2009. The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora. Language Resources and Evaluation 43 (3): 209-226. (PDF).

There, we also present a qualitative evaluation of our resources. Please cite this article when using our corpora, or any of the materials we are making available in the Use the corpus directly (no download necessary) section.

NEW: we've recently finished dependency-parsing ukWaC (with the MaltParser). This new version of the corpus is now available for Use the corpus directly (no download necessary).

The project (including this website) is currently being sponsored by the LiMiNe project.

