BiWeC: Big Web Corpora

In the BiWeC project we aim to create text corpora for various languages with target sizes around 10 billion words. We use the World Wide Web as the source of the texts. The project is run by the Natural Language Processing Centre of the Masaryk University in Brno.

Web crawling

We run a web crawler to download the texts from the Web. The software we use is SpiderLing developed by the Natural Language Processing Centre.

What do we do with the downloaded data?

We are interested in a language use rather than the content of the downloaded texts. The retrieved text will be cleaned, annotated with morphological information (POS tags, lemmas) and used with the Sketch Engine software for various kinds of computational linguistic research.

What if I don't want my website to be crawled?

Our crawler adheres to the Robots exclusion protocol. You can restrict access to some or all of the pages on your website either by creating a robots.txt file or by adding HTML meta tags for robots to your web pages. The user-agent identification of our crawler is SpiderLing. This is what to include in your robots.txt if you want to prevent our crawler from crawling your website:

    User-agent: SpiderLing
    Disallow: /