Seed words and tuples

The first step in the creation of our Corpora was coming up with lists of basic words and mid-frequency words collected from other corpora. We then randomly combined these words in pairs and sent each pair as a query to Google in order to obtain seed URLs.

deWaC (German)

itWaC (Italian)

ukWaC (English)

