== ClueWeb09 == A collection of web corpora from 2009 based on data from [[http://lemurproject.org/clueweb09/]]. The corpora were - encoded in UTF-8, - cleaned (boilerplate removed using [http://nlp.fi.muni.cz/projekty/justext/ Justext]), - deduplicated (near duplicate paragraphs removed using [http://nlp.fi.muni.cz/projects/onion/ Onion]), - tokenized by unitok (unless stated otherwise), - tagged by a state of the art morphological analyzer in 2013. === German ‒ deClueWeb09 === The corpus consists of 49,814,309 pages.\\ Tagged by RFTagger and [http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ TreeTagger]. === English ‒ enClueWeb09 === Tagged by [http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ TreeTagger]. === Spanish ‒ esClueWeb09 === The corpus consists of 79,333,950 pages.\\ Tagged by [http://nlp.lsi.upc.edu/freeling/ FreeLing]. === French ‒ frClueWeb09 === The corpus consists of 50,883,172 pages.\\ Tagged by [http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ TreeTagger]. === Japanese ‒ jaClueWeb09 === The corpus consists of 67,337,717 pages.\\ Tokenized and tagged by [http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html MeCab] + Unidic 2 + Comainu (long unit words). === Chinese (simplified, traditional) ‒ zhClueWeb09 === The corpus consists of 177,489,357 pages.\\ Tokenized and tagged by [http://nlp.stanford.edu/software/segmenter.shtml Stanford Segmenter] and [http://nlp.stanford.edu/software/tagger.shtml Stanford POS tagger].