Changes between Initial Version and Version 1 of corpora/ClueWeb09


Ignore:
Timestamp:
Feb 3, 2014, 11:41:40 AM (10 years ago)
Author:
xsuchom2
Comment:

created

Legend:

Unmodified
Added
Removed
Modified
  • corpora/ClueWeb09

    v1 v1  
     1== ClueWeb09 ==
     2A collection of web corpora from 2009 based on data from [[http://lemurproject.org/clueweb09/]].
     3The corpora were
     4  - encoded in UTF-8,
     5  - cleaned (boilerplate removed using [http://nlp.fi.muni.cz/projekty/justext/ Justext]),
     6  - deduplicated (near duplicate paragraphs removed using [http://nlp.fi.muni.cz/projects/onion/ Onion]),
     7  - tokenized by unitok (unless stated otherwise),
     8  - tagged by a state of the art morphological analyzer in 2013.
     9
     10=== German ‒ deClueWeb09 ===
     11The corpus consists of 49,814,309 pages.\\
     12Tagged by RFTagger and [http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ TreeTagger].
     13
     14=== English ‒ enClueWeb09 ===
     15Tagged by [http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ TreeTagger].
     16
     17=== Spanish ‒ esClueWeb09 ===
     18The corpus consists of 79,333,950 pages.\\
     19Tagged by [http://nlp.lsi.upc.edu/freeling/ FreeLing].
     20
     21=== French ‒ frClueWeb09 ===
     22The corpus consists of 50,883,172 pages.\\
     23Tagged by [http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ TreeTagger].
     24
     25=== Japanese ‒ jaClueWeb09 ===
     26The corpus consists of 67,337,717 pages.\\
     27Tokenized and tagged by [http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html MeCab] + Unidic 2 + Comainu (long unit words).
     28
     29=== Chinese (simplified, traditional) ‒ zhClueWeb09 ===
     30The corpus consists of 177,489,357 pages.\\
     31Tokenized and tagged by [http://nlp.stanford.edu/software/segmenter.shtml Stanford Segmenter] and [http://nlp.stanford.edu/software/tagger.shtml Stanford POS tagger].