Corpus of contemporary blogs


Description

In NLP Centre, dividing text into sentences is currently done with a tool which uses rule-based system. In order to make enough training data for machine learning, we split the corpus of contemporary text CBB.blog (1 million tokens) with annotators into senteces. Each file contains one hundredth of the whole corpus and all data were processed in parallel by two annotators.

The corpus was created from ten contemporary blogs:

  • hintzu.otaku.cz
  • modnipeklo.cz
  • bloc.cz
  • alenaprokopova.blogspot.com
  • blog.aktualne.cz
  • fuchsova.blog.ona.idnes.cz
  • havlik.blog.idnes.cz
  • blog.aktualne.centrum.cz
  • klusak.blogspot.cz
  • myego.cz

Publisher

Natural Language Processing Centre (NLP Centre)
Faculty of Informatics Masaryk University
Botanicka 68a, 602 00 Brno, Czech Republic
nlp.fi.muni.cz

Format

each line consists of one sentence in XML format
(starting with tag <s> and ending with </s>)

Language

Czech language

Licence

Annotators sticked to the following rules for sentence tagging

  • an intuitive view: the sentence begins with a capital letter (if an interpunctions precedes) and ends with a period, exclamation mark, question mark or three dots or quotes (thus, a simple sentence and complex sentences, autonomous unit)
  • the sentence does not include initial "mess" such as *, 1), ...
  • special cases:
    • TITLES
      • stand alone, therefore they are separate sentences
      • usually are not separated from the text by a punctuation but they are on a separate line
      • sometimes, a clause separated by a colon can be considered as a title
    • CONTENT OF BRACKETS
      • deeper division than a comma, to some extent it isolates, so therefore it separates two sentences
      • an exception: in case there is an enumeration (list) and the semicolon is in the middle of coordinate related terms (it divides them into groups) - the whole enumeration is taken as a part of the clause
    • COLON, ENUMERATIONS
      • non-sentential: are part of a sentence (along with the introductory sentence before the colon)
      • sentential: simple sentences are clauses and if they are separated by periods, they are sentences - in case the colon separates bigger individual parts, they both are sentences (i.e., when the colon can be replaced by a full stop or a semicolon)
    • DASHES
      • one dash - similiar to a colon: when it is possible to replace by a period or to separate
      • sometimes, they are used as commas (eventually as parentheses)

NOTE: some punctuation marks in the text may appear in parentheses - then they have a function other than separation