corpora/ClueWeb09 – NLP Centre

Context Navigation

A collection of web corpora from 2009 based on data from http://lemurproject.org/clueweb09/. The corpora were

The corpus consists of 49,814,309 pages.
Tagged by RFTagger and TreeTagger.

Tagged by TreeTagger.

The corpus consists of 79,333,950 pages.
Tagged by FreeLing.

The corpus consists of 50,883,172 pages.
Tagged by TreeTagger.

The corpus consists of 67,337,717 pages.
Tokenized and tagged by MeCab + Unidic 2 + Comainu (long unit words).

The corpus consists of 177,489,357 pages.
Tokenized and tagged by Stanford Segmenter and Stanford POS tagger.

Last modified 11 years ago Last modified on Feb 3, 2014, 11:41:40 AM