Czech grammar agreement task and data set

# AGREE The *Czech grammar agreement* (AGREE) task and data set have been created as a benchmark for evaluating language models. It was introduced in my doctoral thesis *Byte level language models*. The task itself has been inspired by [Microsoft Research Sentence Completion Challenge task](http://research.microsoft.com/en-us/projects/scc/) where the goal for language models is to fill in one missing word in a sentence given five semantically similar words. The goal of AGREE task is similar but on a lower language level: to fill in missing grammar suffixes of past tense verbs. In Czech language, the word forms of these verbs need to correspond to gender and number of their subjects. In the case of the past tense verbs, this is exhibited by grammar suffixes *a*, *o*, *i*, *y* and an empty suffix (*null* or *zero morpheme*). These correspond to the following gender and number grammar categories: + -a, e.g. žila (she lived or they lived): subject is feminine singular, f. plural, neuter pl., + -o, e.g. žilo (it lived): n. sg., + -i, e.g. žili (they lived): masculine pl. or subjects are f. and m., + -y, e.g. žily (they lived): f. pl., n. pl., + -, žil (he lived): f. sg. Subjects and predicates might be far away from each other in Czech language with relatively free word order. Despite the task is involving morphology level, technically, it is word-based since the five suffixes might form five different word forms. So the procedure is the same as in the Sentence Completion Challenge (SCC) task. The main difference is that in AGREE, sentences might contain more than one verb in past tense. In SCC there are 1,040 test sentences i.e. 5,200 sentences after expanding the five options. In AGREE, there are 996 sentence and after expanding, 17,940 sentences. ## Evaluation of the task Language models are to assign scores (logsum probabilities) to the expanded sentences and the sentence with the highest score is then selected. Auxiliary scripts (see below) counts verb accuracy (how many verbs have been selected correctly) and sentence accuracy (how many sentences have all past tense verbs in correct form, with the right suffix). For the results of various language models, see the [dissertation thesis](https://is.muni.cz/th/139654/fi_d/). ## Data description The data set consists of 10 million Czech sentences from a [Czech Web Corpus](https://www.sketchengine.co.uk/cztenten-corpus/) with marked verbs in past tense. It is split into three parts: + TRAIN with 9,900,000 sentences (`agree.train`) + VALID with 99,000 sentences (`agree.valid`, `agree.valid.q`) + TEST with 996 sentences (`agree.eval`, `agree.eval.q`, `agree.eval.expanded.char.lower.txt`) Each sentence is on a separate line and is tokenized, in a similar way as other standard language modeling data sets. The TRAIN part contains sentences with marked past tense verbs, the VALID and TEST parts come in two variants. The first (`agree.valid`, `agree.eval`) with the marked verbs as in TRAIN, the second (`agree.valid.q`, `agree.eval.q`) with suffixes replaced by underscore. An additional file `agree.eval.expanded.char.lower.txt` contains sentence in a special format required by SRILM and RNN toolkits: each space is replaced by underscore and each character is separated by space. Also, the sentences are already expanded so it can be used as a direct input file for testing. ### Data preparation The source corpus was morphologically tagged so the past tense verbs could be marked. The pipeline used to prepare the data is released so it can be reused. It consists of these criteria for including sentences: + only sentences with past tense verbs, + with length between 40 and 120 characters and + starting with uppercase letters. Dashes and quotes have been normalized. Sentences have been shuffled. ### Example of TRAIN sentences + Tento večer a noc byli\*\*\* opravdu zajímavé . + Ráno jsme se postavili\*\*\* na místa a začala\*\*\* nekonečná anabáze srandy heců na všechny co jsme tam znali\*\*\* . + Jenomže mince se otočila\*\*\* a Lyon Barošovi slíbil\*\*\* , že v Ligue 1 má budoucnost . + Veškeré myšlenky na Jesseho ofenzívu Desmond rozprášil\*\*\* dalším útokem na oči a Corner Uppercutem . + Nejen nepříjemnost vyšetření byla\*\*\* neveselou věcí úterního dne . ### Example of VALID sentences with replaced suffixes + Zjistil\_\*\*\* jsem , že u většiny psů je sed klidová pozice , zatímco stoj pohotovostní pozice . + A jiní zase vytáhl\_\*\*\* drakovy vlajky a jásal\_\*\*\* s mávátky u příkopů , že se Drak vrátil\_\*\*\* . + Během jejich vystoupení se několikrát konal\_\*\*\* Circle pit a já měl\_\*\*\* nehoráznou touhu jít tam taky : D ach . + Popravdě by mu většina z nás s chutí plivl\_\*\*\* do tváře , než aby ho uctivě vítal\_\*\*\* . + Také je třeba padák podrovnat , tedy zatáhnout , aby i vertikální rychlost přistání byl\_\*\*\* skoro nulová . ## Auxiliary scripts Auxiliary scripts are released to ease the process of language model evaluation. `make.sh` contains the commands for creating the three parts of AGREE data set. The original Czech corpus is needed so it does not work but is included so it is clear how the data was made. `filter.py` reads the special [vertical](https://www.sketchengine.co.uk/preparing-a-text-corpus-for-the-sketch-engine-overview/) format of the original Czech web corpus and selects only the sentences containing past tense verbs. It marks them with \*\*\* for the future processing. It outputs one sentence per line. `expand.py` reads standard input and expects data where suffixes are replaced by underscore (VALID and TEST parts, files with .q extension suffix). For each sentences it outputs all combinations of suffixes and puts one sentence per line. The output is then used to feed a language model which assigns each sentence a score / probability. `baseline.py` implements the baseline method from the thesis. It needs [manatee](https://nlp.fi.muni.cz/trac/noske) package to access indexed corpora to get raw counts of word forms from the test sentences. The most frequent word form is selected. Czech corpus czTenTen is used for the frequency statistics in the script so again, it is not going to work right out-of-the-box. The output is generated so it can be directly evaluated with `eval.py` script (see below). `bestof.py [-r]` reads standard input with two columns separated by TABULATOR. The first column contains a sentence and the second its score assigned by a language model. This script then selects the sentence with the highest score and outputs it (just the sentence, not the score). The `-r` parameter can be used so the sentence is selected randomly, not based on the scores. If there are more than one maximum score for a given sentence (before the expansion), a random one is output. The same holds for the two following variants of this script. Since SRILM and RNN toolkits used for evaluating n-gram and RNN language models are using different output from testing phase, two other scripts were written. `bestof_rnn.py SCORES [-c]` reads standard input with expanded sentences and an additional file with scores on separate lines. In the case of OOV in a testing sentence, RNN cannot assign a score / probability. In this case, very low probability is assigned to it so it is not likely to be selected i.e. so non-OOV sentences will be preferred. If all expanded sentences have OOV issue, a random one is output. Character RNN models works with a different tokenization where space are replaced by underscores and each character is separated by white space in data. When `-c` parameter is used, the script converts this format into the original format compatible with AGREE data set so it can be evaluated. `bestof_srilm.py SCORES [-c]` reads standard input with sentences and scores. SRILM toolkit outputs a sentence and then on the following line its score and other statistics and this script copes with this different output format. OOV sentences are treated the same way as in the previous script. `eval.sh` will not work out-of-the-box because it contains paths specific to our faculty servers. It is included so it is clear what parameters were used for various models from SRILM and RNN toolkits. `eval.py EVAL TEST` is the main script for evaluation. It compares two files `EVAL` which must be the test file from the dataset, `agree.eval`. Example command is `python eval.py agree.eval model.eval.scores.txt`. The file `model.eval.scores.txt` must contain the same number of lines otherwise the accuracy cannot be computed. The output of the script is in this format: ``` 1368 past tense verbs in 14088 words in 996 sentences. 480 good answers in 261 good sentences. Verb accuracy: 35.0877 Sent accuracy: 26.2048 ``` ## Download + [AGREE data set v1.0](agree.data.tar.gz) + [AGREE scripts v1.0](agree.scripts.tar.gz) ## Changelog + 12/6/2016 v1.0 ## Licence The data is released under [Creative Commons BY-NC 2.0](https://creativecommons.org/licenses/by-nc/2.0/). The scripts under [GNU GPL](https://www.gnu.org/licenses/gpl-3.0.txt) v3.0. ******** Copyright 2016 [Vít Baisa](http://nlp.fi.muni.cz/~xbaisa/)