=======================================
MOST TYPICAL WORDS OF THE BNC vs. ukWaC
=======================================

The wordlists featured in this archive contain the words (divided into nouns, verbs, adjectives, -ly adverbs and function words) that emerge as being the most characteristic of the BNC when compared to ukWaC, based on the log-likelihood measure.

The lists were created taking as input a POS-tagged and lemmatized version of the two corpora. POS-tagging and lemmatization were performed using the TreeTagger. In order to reduce noise in the lists, all words containing non-alphabetic characters (like the at sign, slashes, word-interior hyphens, etc.) were discarded. All results were lowercased and sorted according to their log-likelihood score.

For further details on the wordlist creation procedure, see: Ferraresi, A. (2007) Building a very large corpus of English obtained by Web crawling: ukWaC [Available online at: http://wacky.sslmit.unibo.it/doku.php?id=publications].