The LiMiNe (Linguistic Mining of the Net) project intends to set up a European network of linguists and Information Technology specialists for the development of methodologies and resources for the assembly, annotation and indexing of several language-specific subsets of the web to be used in the analysis and teaching of (general and specialised) languages.
LiMiNe builds on the work done by the WaCKy community and is funded by the University of Bologna strategic research programme (2006, Canale Giovani).
Between the 1970s and the 1990s, linguists have constructed and queried language corpora to draw generalisations about a variety of aspects of several languages (most notably English) and to use these generalisations (and/or the corpora themselves) for a number of practical applications (e.g. language teaching, self-access language learning, natural language processing, machine translation, lexicography etc. — see e.g. McEnery & Wilson 2001, Manning & Schütze 1999). This body of work has contributed to reshaping the descriptive, methodological and theoretical bases of language studies, as well as having crucial fall-back effects on a number of sister disciplines.
In the late 1990s, with the advent of the World Wide Web, a number of researchers worldwide started to investigate the potential and the limits of this resource as an alternative to “traditional” corpora. Advocates of the web as “the corpus of the new millennium” have pointed at the immense quantity and variety of language data readily available and at the potential for studying new genres (e.g. blogs, forums, emails), or for observing language change as it takes place (see e.g. Kilgarriff & Grefenstette 2003, Bergh & Zanchetta in press). Opponents have objected that the scientific study of language requires a more finely-tuned model, allowing control over sources and replicability (hence stability; e.g. Sinclair 2004).
While the debate is still open at the theoretical level, a number of researchers have started to investigate the issue empirically. The results provided by corpora have been compared with those provided by the web for the same language phenomena or NLP tasks (e.g., Turney 2001, Keller & Lapata 2003); the usability of current indexing and search engines for linguistic purposes has been questioned (e.g., Lüdeling et al. in press), and the legal implications and technical prerequisites for using the web as a corpus have been discussed (e.g., Fletcher 2004, Bernardini et al. 2006). Within this body of work, approaches vary substantially: from using Google counts to derive data about the use of certain words or structures to adding (simple) concordance-like features to search engines, from downloading quick-and-dirty specialised corpora for documentation purposes to building register-controlled corpora and/or evaluating the language of the web (see, e.g., the papers in Baroni & Bernardini 2006). While these works have provided interesting tools and new insights, none so far has succeeded in making available to the scientific community a tool that combines the advantages of the web in terms of size and variety and the advantages of corpora in terms of stability, control and annotation. Nor have they made available a resource that allows one to study the (Italian, English, Japanese, etc.) Web, i.e. to find out what kinds of texts and textual genres are available, and to study the characteristics of the more innovative among these.
The present project aims to fill this gap by furthering work already completed in this area by members of the research group and gaining full benefit from our central position within a community of researchers of international renown (linguists, IT specialists, and cognitive scientists), with an interest in contributing their diverse interdisciplinary expertise to the project.
In the medium term we foresee the following milestones: