ClueWeb09
A collection of web corpora from 2009 based on data from http://lemurproject.org/clueweb09/. The corpora were
- encoded in UTF-8,
- cleaned (boilerplate removed using Justext),
- deduplicated (near duplicate paragraphs removed using Onion),
- tokenized by unitok (unless stated otherwise),
- tagged by a state of the art morphological analyzer in 2013.
German ‒ deClueWeb09
The corpus consists of 49,814,309 pages.
Tagged by RFTagger and TreeTagger.
English ‒ enClueWeb09
Tagged by TreeTagger.
Spanish ‒ esClueWeb09
The corpus consists of 79,333,950 pages.
Tagged by FreeLing.
French ‒ frClueWeb09
The corpus consists of 50,883,172 pages.
Tagged by TreeTagger.
Japanese ‒ jaClueWeb09
The corpus consists of 67,337,717 pages.
Tokenized and tagged by MeCab + Unidic 2 + Comainu (long unit words).
Chinese (simplified, traditional) ‒ zhClueWeb09
The corpus consists of 177,489,357 pages.
Tokenized and tagged by Stanford Segmenter and Stanford POS tagger.
Last modified 10 years ago
Last modified on Feb 3, 2014, 11:41:40 AM