== ClueWeb09 ==
A collection of web corpora from 2009 based on data from [[http://lemurproject.org/clueweb09/]].
The corpora were 
  - encoded in UTF-8,
  - cleaned (boilerplate removed using [http://nlp.fi.muni.cz/projekty/justext/ Justext]),
  - deduplicated (near duplicate paragraphs removed using [http://nlp.fi.muni.cz/projects/onion/ Onion]),
  - tokenized by unitok (unless stated otherwise),
  - tagged by a state of the art morphological analyzer in 2013.

=== German ‒ deClueWeb09 ===
The corpus consists of 49,814,309 pages.\\
Tagged by RFTagger and [http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ TreeTagger].

=== English ‒ enClueWeb09 ===
Tagged by [http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ TreeTagger].

=== Spanish ‒ esClueWeb09 ===
The corpus consists of 79,333,950 pages.\\
Tagged by [http://nlp.lsi.upc.edu/freeling/ FreeLing].

=== French ‒ frClueWeb09 ===
The corpus consists of 50,883,172 pages.\\
Tagged by [http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ TreeTagger].

=== Japanese ‒ jaClueWeb09 ===
The corpus consists of 67,337,717 pages.\\
Tokenized and tagged by [http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html MeCab] + Unidic 2 + Comainu (long unit words).

=== Chinese (simplified, traditional) ‒ zhClueWeb09 ===
The corpus consists of 177,489,357 pages.\\
Tokenized and tagged by [http://nlp.stanford.edu/software/segmenter.shtml Stanford Segmenter] and [http://nlp.stanford.edu/software/tagger.shtml Stanford POS tagger].