Changes between Initial Version and Version 1 of en/NlpInPracticeCourse/2023/AutomaticCorrection


Ignore:
Timestamp:
Sep 3, 2024, 2:52:02 PM (11 months ago)
Author:
Ales Horak
Comment:

copied from private/NlpInPracticeCourse/AutomaticCorrection

Legend:

Unmodified
Added
Removed
Modified
  • en/NlpInPracticeCourse/2023/AutomaticCorrection

    v1 v1  
     1= Automatic language correction =
     2[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/NlpInPracticeCourse|NLP in Practice Course]], Course Guarantee: Aleš Horák
     3
     4Prepared by: Aleš Horák, Ján Švec
     5
     6== State of the Art ==
     7Language correction nowadays has many potential applications with large amounts of informal and unedited text generated online: web forums, tweets, blogs, or emails. Automatic language correction can include several tasks: spell checking, grammar checking and word completion.
     8
     9In the theoretical lesson, we will introduce and compare various methods to automatically propose and choose a correction for an incorrectly written word. Spell checking is the process of detecting and sometimes providing spelling suggestions for incorrectly spelled words in a text. The lesson will also focus on  grammatical checking problems, which are the most difficult and complex type of language errors, because grammar is made up of a very extensive number of rules and exceptions. We will also say a few words about word completion.
     10
     11The lesson will also answer a question "How difficult it is to develop a spell-checker?" and present a tool that performs spell-checking and autocorrection.
     12
     13=== References ===
     14 1. Gupta, Prabhakar. "A context-sensitive real-time Spell Checker with language adaptability." 2020 IEEE 14th International Conference on Semantic Computing (ICSC). IEEE, 2020. [https://ieeexplore.ieee.org/document/9031515 link]
     15 1. Rothe, Sascha, et al. "A simple recipe for multilingual grammatical error correction." ACL-IJCNLP 2021. [https://aclanthology.org/2021.acl-short.89 link]
     16 1. Didenko, Bohdan, and Andrii Sameliuk. "!RedPenNet for Grammatical Error Correction: Outputs to Tokens, Attentions to Spans." Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP). 2023. [https://aclanthology.org/2023.unlp-1.15/ link]
     17 1. Náplava, Jakub, et al. "Czech grammar error correction with a large and diverse corpus." Transactions of the Association for Computational Linguistics 10 (2022): 452-467. [https://aclanthology.org/2022.tacl-1.26/ link]
     18
     19
     20== Practical Session ==
     21{{{
     22#!div class="wiki-toc" style="width: 40%"
     23**Note:** If you are new to the [https://en.wikipedia.org/wiki/Command-line_interface command line interface] via a [https://en.wikipedia.org/wiki/Terminal_emulator terminal window], you may find the **[https://ubuntu.com/tutorials/command-line-for-beginners#3-opening-a-terminal tutorial for working in terminal]** useful.
     24}}}
     25
     26There are 2 tasks, you may choose one or both:
     27 1. [wiki:/en/NlpInPracticeCourse/AutomaticCorrection#task1 statistical spell checker for English]
     28 2. [wiki:/en/NlpInPracticeCourse/AutomaticCorrection#task2 rule based grammar checker (punctuation) for Czech]
     29
     30== Task 1: Statistical spell checker for English == #task1
     31
     32In the theoretical lesson, we have become acquainted with various approaches how spelling correctors work. Now we will get to know how a simple spellchecker based on '''edit distance''' works.
     33
     34The example is based on Peter Norvig's [[http://norvig.com/spell-correct.html|Spelling Corrector]] in Python. The spelling corrector will be trained with a large text file consisting of about one million words.
     35
     36We will test this tool with prepared data. Your goal will be to enhance the spellchecker's accuracy.
     37
     38
     39 1. Download [htdocs:bigdata/task_ia161-spell.zip task_ia161-spell.zip] with a prepared script  `spell.py` and a training data collection  `big.txt`. Unzip it and change to the contained directory.
     40 {{{
     41wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/task_ia161-spell.zip
     42unzip task_ia161-spell.zip
     43cd task_ia161-spell
     44}}}
     45 1. Test the script by running
     46 {{{
     47python spell.py
     48}}}
     49 1. Open `spell.py` in your favourite editor and we will walk through its functionality.
     50
     51
     52=== Spellchecker functionality with examples ===
     53
     541. Spellchecker is '''trained''' from file `big.txt` which is a concatenation of several public domain books from '''Project Gutenberg''' and lists of most frequent words from '''Wiktionary''' and the '''British National Corpus'''. Function `train` stores how many times each word occurs in the text file. `NWORDS[w]` holds a count of how many times the word '''w has been seen'''. 
     55 {{{
     56def words(text): return re.findall('[a-z]+', text.lower())
     57
     58def train(features):
     59    model = collections.defaultdict(lambda: 1)
     60    for f in features:
     61        model[f] += 1
     62    return model
     63
     64NWORDS = train(words(file('big.txt').read()))
     65}}}
     661. '''Edit distance 1''' is represented as function `edits1` - it represents deletion (remove one letter), transposition (swap adjacent letters), alteration (change one letter to another), or an insertion (add a letter). For a word of length '''n''', there will be '''n deletions''', '''n-1 transpositions''', '''26n alterations''', and '''26(n+1) insertions''', for a '''total of 54n+25'''. Example: `len(edits1('something')) = 494` words.
     67 {{{
     68def edits1(word):
     69   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
     70   deletes    = [a + b[1:] for a, b in splits if b]
     71   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
     72   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
     73   inserts    = [a + c + b     for a, b in splits for c in alphabet]
     74   return set(deletes + transposes + replaces + inserts)
     75}}}
     761. '''Edit distance 2''' (`edits2`) - applies `edits1()` to all the results of `edits1()`. Example: `len(edits2('something')) = 114 324` words, which is a high number. To enhance speed we can only keep the candidates that are actually known words (`known_edits2()`). Now `known_edits2('something')` is a set of just 4 words: `{'smoothing', 'seething', 'something', 'soothing'}`.
     771. The function `correct()` chooses as the set of candidate words the set with the '''shortest edit distance''' to the original word.
     78 {{{
     79def known(words): return set(w for w in words if w in NWORDS)
     80
     81def correct(word):
     82    candidates = known([word]) or known(edits1(word)) or \
     83        known_edits2(word) or [word]
     84    return max(candidates, key=NWORDS.get)
     85}}}
     861. For '''evaluation''' there are two test sets prepared - development(`test1`) and final test set(`test2`).
     87
     88
     89=== Task 1 ===
     90 1. Create a text file named `spell.txt`.
     91
     92 2. Run `spell.py` with development and final test sets (`tests1` and `tests2` within the script), write the results in `spell.txt`.
     93
     94 3. Explain the given results in few words and write it in `spell.txt`.
     95
     96 4. Modify the code of `spell.py` to increase accuracy (`pct`) at `tests2` by 10 %. You may take an inspiration from the ''Future work'' section of [http://norvig.com/spell-correct.html the Norvig's article]. Describe your changes and write your new accuracy results to `spell.txt`.
     97
     98 5. Upload `spell.txt` and the edited `spell.py` to the [wiki:en/NlpInPracticeCourse homework vault (odevzdávárna)].
     99
     100
     101== Task 2: Rule based grammar checker (punctuation) for Czech == #task2
     102
     103The second task choice consists in adapting specific syntactic grammar of Czech to improve the results of ''punctuation detection'', i.e. placement of ''commas'' in the requested position in a sentence.
     104
     105=== Task 2 ===
     106
     1071. login to asteria04: `ssh asteria04`
     1081. download [htdocs:bigdata/task_ia161-grammar.zip task_ia161-grammar.zip] containing:
     109   1. `punct.set`, the syntactic grammar for punctuation detection for the [http://nlp.fi.muni.cz/projects/set SET parser]
     110   1. `test-nopunct.txt` - testing text with no commas
     111   1. `eval-gold.txt` - evaluation text with correct punctuation
     112   1. `evalpunct_robust.py` - evaluation script which computes recall and precision with both texts
     113 {{{
     114wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/task_ia161-grammar.zip
     115unzip task_ia161-grammar.zip
     116cd task_ia161-grammar
     117}}}
     1181. run the parser to fill punctuation to the testing text
     119 {{{
     120cat test-nopunct.txt | sed 's/^\s*$/<\/s><s>/' \
     121    | /nlp/projekty/set/unitok.py \
     122    | /nlp/projekty/rule_ind/stat/desamb.utf8.majka.sh -skipdis \
     123    | /nlp/projekty/set/set/set.py --commas --grammar=punct.set \
     124    > test.txt
     125}}}
     126 It takes several seconds to finish, nothing is printed. Output is stored in `test.txt`.
     1271. evaluate the result
     128 {{{
     129python evalpunct_robust.py eval-gold.txt test.txt > results.txt; \
     130cat results.txt
     131}}}
     1321. edit the grammar `punct.set` and add 1-2 rules to increase the F-score (combined recall and precision) of 10%.
     133
     134 You may need to go through general information about the [https://nlp.fi.muni.cz/trac/set/wiki/documentation#Rulesstructure SET grammar format]. Information about adapting the grammar for the task of ''punctuation detection'' can be found the this [raw-attachment:tsd2014.pdf published paper].
     135
     136 Current best results achieved with an extended grammar are 91.2 % of precision and 55 % recall, i.e. F-score of 68.6 %.
     1376. upload the modified `punct.set` and the respective `results.txt`.
     138
     139
     140Do not forget to upload your resulting files to the [wiki:en/NlpInPracticeCourse homework vault (odevzdávárna)].