= Evaluation of the output of GPT-2 abstract summarization = [[Image(VyhodnoceniSumarizaceManual:sum_anot2.png,width=50%,right)]] == Annotation Manual == The goal is to find and classify errors in machine-generated summarizations of Czech newspaper articles. Thus, we are not concerned with evaluating the quality of the summarization in the sense of conciseness, we are only concerned with the mechanism and nature of the potential error. === Technical assumptions === The annotation is performed using the Qualtrics questionnaire platform (possible on both desktop and mobile devices). The assignment consists of Input Text, Gold, and Generated sections. We evaluate only the Generated summarization in relation to the Input Text. The Gold summarization can provide some context for a better understanding, but we must note that it did not participate in the generation of the Generated, so it must not interfere with the evaluation. The answer table contains four columns and 1 or 3 rows (depending on whether it is generating an abstract or a title). The rows "Sentence1", "Sentence2", ... refer to the corresponding sentences in the Generated section marked with "•". ! Each column (Special cases, Mapping, Meaning) may have at most one checkbox checked (e.g. OK or Repetitive or Sentence missing) ! ! For each sentence, fill in either the first column (Special cases) OR the remaining ones (Mapping, Meaning) ! (unfortunately this behaviour cannot be forced, so please be careful, otherwise the answer will not be valid) After processing all the texts, send the result to the system using the [[Image(VyhodnoceniSumarizaceManual:button.png)]] button === Explanation of annotation values === `Special cases`: - if specifying an error does not make sense - ! in case we fill in, we have to leave the other columns empty for the sentence 1. `OK`: we found no grammatical or factual error in the sentence given the Input Text and the rest of the Generated summarization. 2. `Repetitive`: the sentence has already occurred in the Generated summarization or one of the previous sentences of the summarization had a completely SAME meaning. Apart from the repetition, the sentence contains no errors of fact or grammar. 3. `Sentence missing`: the Generated summarization has the wrong number of sentences (e.g. the abstract has only two sentences (•) => the line for Sentence3 is marked with the special case `Sentence missing`) `Mapping`: - helps to detect the CAUSE of the error - surface level - how the summarizer uses words and sentences to create errors in the abstract 1. `Omission`: copying a sentence/phrase but omitting a word/phrase - e.g.: - Input: (...) ''Trenér Nigel Pearson se obává dalších zranění, zatímco jeho mužstvo pokračuje v boji o přežití **v Premiere League**.'' (...) - Generated: ''Trenér Nigel Pearson se obává dalších zranění, zatímco jeho mužstvo pokračuje v boji o přežití.'' 2. `Wrong combination`: copying parts of several different sentences and combining them incorrectly. - e.g.: - Input: (...) ''Hráči musí házet jídlo na dívku, která se objeví v jedné z devíti děr, a následně zmizí. Pokud hráč dívku mine, začne dívka ztrácet na váze, až nakonec zemře.'' (...) - Generated: ''Hráči musí házet jídlo na dívku, která se objeví v jedné z devíti děr, a následně **zemře**.'' 3. `Fabrication`: adding one or more new words (they do not appear in the Input text, so it is not a Wrong combination) that causes an error - e.g.: - Input: (...) ''Mauresmo, která by měla v srpnu porodit, bude zhruba v osmém měsíci během Wimbledonu toto léto.'' (...) - Generated: ''Mauresmo bude v osmém měsíci těhotenství **se svým prvním dítětem**.'' 4. `Lack of rewriting`: incorrect rewriting of sentences (e.g. insufficient context, incorrect substitution of a referring phrase with a non-original object) - E.g.: - Input: (...) ''**Ukázalo se, že korporace může být skutečně stíhána jako osoba.** Je to praxe, kterou Nejvyšší soud prosazuje již více než století.'' (...) - Generated: ''Je to praxe, kterou Nejvyšší soud prosazuje již více než století.'' `Meaning`: - EFFECT of error - ! `Malformed` takes precedence over `Misleading` (it is less common) - categories and types: 1. `Malformed`: the reader is puzzled by the quality, but the sentence is neither misleading nor false a. `Ungrammatical`: syntactically damaged/unnatural sentence, the speaker would not have said it that way b. `Semantically implausible`: a semantically (meaningfully) nonsensical/unnatural sentence c. `No meaning can be inferred`: - a grammatically correct sentence to which no meaning can be assigned - Usually associated with `Lack of rewriting` - context is missing and the sentence loses meaning - e.g.: - Input: (...) ''**Ukázalo se, že korporace může být skutečně stíhána jako osoba.** Je to praxe, kterou Nejvyšší soud prosazuje již více než století.'' (...) - Generated: ''Je to praxe, kterou Nejvyšší soud prosazuje již více než století.'' 2. `Misleading`: they may induce incorrect beliefs, not inferred from the article a. `Meaning changed, not entailed`: the meaning of the sentence cannot be inferred from the article (in the context of summarization) - e.g.: - Input: (...) ''Mauresmo, která by měla v srpnu porodit, bude zhruba v osmém měsíci během Wimbledonu toto léto.'' (...) - Generated: ''Mauresmo bude v osmém měsíci těhotenství **se svým prvním dítětem**.'' b. `Meaning changed, contradiction`: the meaning of the sentence is reversed or OTHER meaning than we infer from the article (in the context of summarization) - e.g.: - Input: (...) ''Hráči musí házet jídlo na dívku, která se objeví v jedné z devíti děr, a následně zmizí. Pokud hráč dívku mine, začne dívka ztrácet na váze, až nakonec zemře.'' (...) - Generated: ''Hráči musí házet jídlo na dívku, která se objeví v jedné z devíti děr, a následně **zemře**.'' c. `Pragmatic meaning changed`: the sentence takes on the PRAGMATIC meaning that the article is not present, or the PRAGMATIC meaning disappears (in the context of summarization) = e.g., was a figurative sentence was used and its meaning changed or disappeared in the summarization (it sounds like it is meant literally) - e.g.: - Input: (...) ''Trenér Nigel Pearson se obává dalších zranění, zatímco jeho mužstvo pokračuje v boji o přežití **v Premiere League**.'' (...) - Generated: ''Trenér Nigel Pearson se obává dalších zranění, zatímco jeho mužstvo pokračuje v boji o přežití.'' `Mistake explanation` - A field for a more detailed textual specification of the error to compare the approach of each annotator to the evaluation - is not machine-checked, but it will help us substantially in assessing the consistency of the responses - e.g. the sentence about Coach Nigel (above) - Mistake explanation: omitting the words "v Premiere League (in the Premiere League)" changes the meaning of the phrase "boj o přežití (struggle to survive)". More practical examples can be found in the [https://aclanthology.org/2020.eval4nlp-1.1.pdf original article]. == Possible problems == We have encountered the following possible problems while filling in the form: - we need to check that the displayed questionnaire has check boxes in the shape of a FOUR and not a CIRCLE (i.e. multiple answer and not single answer) => SOLUTION: use a browser other than Chrome (if it displays wrong) - Mozzila should work, the mobile version of Chrome worked for me too. I don't really have a way to test the bug further. - Although completing **Mistake explanation** is not mandatory (in case the sentence, does not contain an error), the system requires it for some questions and refuses to un-question the user (noted for INPUT 575) => SOLUTION: if the situation arises, fill in the text fields with any text (e.g. OK in case of error-free sentences), we will deal with it during the evaluation.