Seed words and tuples
The first step in the creation of our Corpora was coming up with lists of basic words and mid-frequency words collected from other corpora. We then randomly combined these words in pairs and sent each pair as a query to Google in order to obtain seed URLs.
deWaC (German)
itWaC (Italian)
ukWaC (English)
- ukWaC seed words (collected from the BNC)