A collection of web corpora from 2009 based on data from The corpora were

  • encoded in UTF-8,
  • cleaned (boilerplate removed using Justext),
  • deduplicated (near duplicate paragraphs removed using Onion),
  • tokenized by unitok (unless stated otherwise),
  • tagged by a state of the art morphological analyzer in 2013.

German ‒ deClueWeb09

The corpus consists of 49,814,309 pages.
Tagged by RFTagger and TreeTagger.

English ‒ enClueWeb09

Tagged by TreeTagger.

Spanish ‒ esClueWeb09

The corpus consists of 79,333,950 pages.
Tagged by FreeLing.

French ‒ frClueWeb09

The corpus consists of 50,883,172 pages.
Tagged by TreeTagger.

Japanese ‒ jaClueWeb09

The corpus consists of 67,337,717 pages.
Tokenized and tagged by MeCab + Unidic 2 + Comainu (long unit words).

Chinese (simplified, traditional) ‒ zhClueWeb09

The corpus consists of 177,489,357 pages.
Tokenized and tagged by Stanford Segmenter and Stanford POS tagger.

Last modified 10 years ago Last modified on Feb 3, 2014, 11:41:40 AM