The first step in the creation of our Corpora was coming up with lists of basic words and mid-frequency words collected from other corpora. We then randomly combined these words in pairs and sent each pair as a query to Google in order to obtain seed URLs.

deWaC (German)

itWaC (Italian)

ukWaC (English)

  • seed_words_and_tuples.txt
  • Last modified: 2008/02/20 16:02
  • by eros