| 1 | == ClueWeb09 == |
| 2 | A collection of web corpora from 2009 based on data from [[http://lemurproject.org/clueweb09/]]. |
| 3 | The corpora were |
| 4 | - encoded in UTF-8, |
| 5 | - cleaned (boilerplate removed using [http://nlp.fi.muni.cz/projekty/justext/ Justext]), |
| 6 | - deduplicated (near duplicate paragraphs removed using [http://nlp.fi.muni.cz/projects/onion/ Onion]), |
| 7 | - tokenized by unitok (unless stated otherwise), |
| 8 | - tagged by a state of the art morphological analyzer in 2013. |
| 9 | |
| 10 | === German ‒ deClueWeb09 === |
| 11 | The corpus consists of 49,814,309 pages.\\ |
| 12 | Tagged by RFTagger and [http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ TreeTagger]. |
| 13 | |
| 14 | === English ‒ enClueWeb09 === |
| 15 | Tagged by [http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ TreeTagger]. |
| 16 | |
| 17 | === Spanish ‒ esClueWeb09 === |
| 18 | The corpus consists of 79,333,950 pages.\\ |
| 19 | Tagged by [http://nlp.lsi.upc.edu/freeling/ FreeLing]. |
| 20 | |
| 21 | === French ‒ frClueWeb09 === |
| 22 | The corpus consists of 50,883,172 pages.\\ |
| 23 | Tagged by [http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ TreeTagger]. |
| 24 | |
| 25 | === Japanese ‒ jaClueWeb09 === |
| 26 | The corpus consists of 67,337,717 pages.\\ |
| 27 | Tokenized and tagged by [http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html MeCab] + Unidic 2 + Comainu (long unit words). |
| 28 | |
| 29 | === Chinese (simplified, traditional) ‒ zhClueWeb09 === |
| 30 | The corpus consists of 177,489,357 pages.\\ |
| 31 | Tokenized and tagged by [http://nlp.stanford.edu/software/segmenter.shtml Stanford Segmenter] and [http://nlp.stanford.edu/software/tagger.shtml Stanford POS tagger]. |