Corpus of contemporary blogs


In NLP Centre, dividing text into sentences is currently done with a tool which uses rule-based system. In order to make enough training data for machine learning, we split the corpus of contemporary text (1 million tokens) with annotators into senteces. Each file contains one hundredth of the whole corpus and all data were processed in parallel by two annotators.

The corpus was created from ten contemporary blogs:



Natural Language Processing Centre (NLP Centre)
Faculty of Informatics Masaryk University
Botanicka 68a, 602 00 Brno, Czech Republic


each line consists of one sentence in XML format
(starting with tag <s> and ending with </s>)


Czech language


Annotators sticked to the following rules for sentence tagging

  • an intuitive view: the sentence begins with a capital letter (if an interpunctions precedes) and ends with a period, exclamation mark, question mark or three dots or quotes (thus, a simple sentence and complex sentences, autonomous unit)
  • the sentence does not include initial "mess" such as *, 1), ...
  • special cases:
    • TITLES
      • stand alone, therefore they are separate sentences
      • usually are not separated from the text by a punctuation but they are on a separate line
      • sometimes, a clause separated by a colon can be considered as a title
      • deeper division than a comma, to some extent it isolates, so therefore it separates two sentences
      • an exception: in case there is an enumeration (list) and the semicolon is in the middle of coordinate related terms (it divides them into groups) - the whole enumeration is taken as a part of the clause
      • non-sentential: are part of a sentence (along with the introductory sentence before the colon)
      • sentential: simple sentences are clauses and if they are separated by periods, they are sentences - in case the colon separates bigger individual parts, they both are sentences (i.e., when the colon can be replaced by a full stop or a semicolon)
    • DASHES
      • one dash - similiar to a colon: when it is possible to replace by a period or to separate
      • sometimes, they are used as commas (eventually as parentheses)

NOTE: some punctuation marks in the text may appear in parentheses - then they have a function other than separation