Context Navigation

← Previous Change
Wiki History
Next Change →

Changes between Initial Version and Version 1 of corpora/ClueWeb09

Timestamp:: Feb 3, 2014, 11:41:40 AM (11 years ago)
Author:: xsuchom2
Comment:: created

Legend:

: Unmodified
: Added
: Removed
: Modified

corpora/ClueWeb09

                       v1
+== ClueWeb09 ==
+A collection of web corpora from 2009 based on data from [[http://lemurproject.org/clueweb09/]].
+The corpora were
+  - encoded in UTF-8,
+  - cleaned (boilerplate removed using [http://nlp.fi.muni.cz/projekty/justext/ Justext]),
+  - deduplicated (near duplicate paragraphs removed using [http://nlp.fi.muni.cz/projects/onion/ Onion]),
+  - tokenized by unitok (unless stated otherwise),
+  - tagged by a state of the art morphological analyzer in 2013.
+=== German ‒ deClueWeb09 ===
+The corpus consists of 49,814,309 pages.\\
+Tagged by RFTagger and [http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ TreeTagger].
+=== English ‒ enClueWeb09 ===
+Tagged by [http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ TreeTagger].
+=== Spanish ‒ esClueWeb09 ===
+The corpus consists of 79,333,950 pages.\\
+Tagged by [http://nlp.lsi.upc.edu/freeling/ FreeLing].
+=== French ‒ frClueWeb09 ===
+The corpus consists of 50,883,172 pages.\\
+Tagged by [http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ TreeTagger].
+=== Japanese ‒ jaClueWeb09 ===
+The corpus consists of 67,337,717 pages.\\
+Tokenized and tagged by [http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html MeCab] + Unidic 2 + Comainu (long unit words).
+=== Chinese (simplified, traditional) ‒ zhClueWeb09 ===
+The corpus consists of 177,489,357 pages.\\
+Tokenized and tagged by [http://nlp.stanford.edu/software/segmenter.shtml Stanford Segmenter] and [http://nlp.stanford.edu/software/tagger.shtml Stanford POS tagger].