This is an old revision of the document!
Seed words and tuples
The first step in the creation of our Corpora was coming up with lists of basic words and mid-frequency words collected from other corpora. We then randomly combined these words in pairs and sent each pair as a query to Google in order to obtain seed URLs.
deWaC
itWaC
ukWaC
- ukWaC seed words (collected from the BNC)