= Automatic language correction =
[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/NlpInPracticeCourse|NLP in Practice Course]], Course Guarantee: Aleš Horák

Prepared by: Aleš Horák, Ján Švec

== State of the Art ==
Language correction nowadays has many potential applications with large amounts of informal and unedited text generated online: web forums, tweets, blogs, or emails. Automatic language correction can include several tasks: spell checking, grammar checking and word completion.

In the theoretical lesson, we will introduce and compare various methods to automatically propose and choose a correction for an incorrectly written word. Spell checking is the process of detecting and sometimes providing spelling suggestions for incorrectly spelled words in a text. The lesson will also focus on  grammatical checking problems, which are the most difficult and complex type of language errors, because grammar is made up of a very extensive number of rules and exceptions. We will also say a few words about word completion.

The lesson will also answer a question "How difficult it is to develop a spell-checker?" and present a tool that performs spell-checking and autocorrection.

=== References ===
 1. Gupta, Prabhakar. "A context-sensitive real-time Spell Checker with language adaptability." 2020 IEEE 14th International Conference on Semantic Computing (ICSC). IEEE, 2020. [https://ieeexplore.ieee.org/document/9031515 link]
 1. Rothe, Sascha, et al. "A simple recipe for multilingual grammatical error correction." ACL-IJCNLP 2021. [https://aclanthology.org/2021.acl-short.89 link]
 1. Didenko, Bohdan, and Andrii Sameliuk. "!RedPenNet for Grammatical Error Correction: Outputs to Tokens, Attentions to Spans." Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP). 2023. [https://aclanthology.org/2023.unlp-1.15/ link]
 1. Náplava, Jakub, et al. "Czech grammar error correction with a large and diverse corpus." Transactions of the Association for Computational Linguistics 10 (2022): 452-467. [https://aclanthology.org/2022.tacl-1.26/ link]


== Practical Session ==
{{{
#!div class="wiki-toc" style="width: 40%"
**Note:** If you are new to the [https://en.wikipedia.org/wiki/Command-line_interface command line interface] via a [https://en.wikipedia.org/wiki/Terminal_emulator terminal window], you may find the **[https://ubuntu.com/tutorials/command-line-for-beginners#3-opening-a-terminal tutorial for working in terminal]** useful.
}}} 

There are 2 tasks, you may choose one or both:
 1. [wiki:/en/NlpInPracticeCourse/AutomaticCorrection#task1 statistical spell checker for English]
 2. [wiki:/en/NlpInPracticeCourse/AutomaticCorrection#task2 rule based grammar checker (punctuation) for Czech]

== Task 1: Statistical spell checker for English == #task1

In the theoretical lesson, we have become acquainted with various approaches how spelling correctors work. Now we will get to know how a simple spellchecker based on '''edit distance''' works. 

The example is based on Peter Norvig's [[http://norvig.com/spell-correct.html|Spelling Corrector]] in Python. The spelling corrector will be trained with a large text file consisting of about one million words. 

We will test this tool with prepared data. Your goal will be to enhance the spellchecker's accuracy.


 1. Download [htdocs:bigdata/task_ia161-spell.zip task_ia161-spell.zip] with a prepared script  `spell.py` and a training data collection  `big.txt`. Unzip it and change to the contained directory.
 {{{
wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/task_ia161-spell.zip
unzip task_ia161-spell.zip
cd task_ia161-spell
}}}
 1. Test the script by running 
 {{{
python spell.py
}}}
 1. Open `spell.py` in your favourite editor and we will walk through its functionality.


=== Spellchecker functionality with examples ===

1. Spellchecker is '''trained''' from file `big.txt` which is a concatenation of several public domain books from '''Project Gutenberg''' and lists of most frequent words from '''Wiktionary''' and the '''British National Corpus'''. Function `train` stores how many times each word occurs in the text file. `NWORDS[w]` holds a count of how many times the word '''w has been seen'''.  
 {{{
def words(text): return re.findall('[a-z]+', text.lower()) 

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(words(file('big.txt').read()))
}}}
1. '''Edit distance 1''' is represented as function `edits1` - it represents deletion (remove one letter), transposition (swap adjacent letters), alteration (change one letter to another), or an insertion (add a letter). For a word of length '''n''', there will be '''n deletions''', '''n-1 transpositions''', '''26n alterations''', and '''26(n+1) insertions''', for a '''total of 54n+25'''. Example: `len(edits1('something')) = 494` words. 
 {{{
def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)
}}}
1. '''Edit distance 2''' (`edits2`) - applies `edits1()` to all the results of `edits1()`. Example: `len(edits2('something')) = 114 324` words, which is a high number. To enhance speed we can only keep the candidates that are actually known words (`known_edits2()`). Now `known_edits2('something')` is a set of just 4 words: `{'smoothing', 'seething', 'something', 'soothing'}`.
1. The function `correct()` chooses as the set of candidate words the set with the '''shortest edit distance''' to the original word.
 {{{
def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or \
        known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)
}}}
1. For '''evaluation''' there are two test sets prepared - development(`test1`) and final test set(`test2`).


=== Task 1 ===
 1. Create a text file named `spell.txt`.

 2. Run `spell.py` with development and final test sets (`tests1` and `tests2` within the script), write the results in `spell.txt`.

 3. Explain the given results in few words and write it in `spell.txt`.

 4. Modify the code of `spell.py` to increase accuracy (`pct`) at `tests2` by 10 %. You may take an inspiration from the ''Future work'' section of [http://norvig.com/spell-correct.html the Norvig's article]. Describe your changes and write your new accuracy results to `spell.txt`.

 5. Upload `spell.txt` and the edited `spell.py` to the [wiki:en/NlpInPracticeCourse homework vault (odevzdávárna)].


== Task 2: Rule based grammar checker (punctuation) for Czech == #task2

The second task choice consists in adapting specific syntactic grammar of Czech to improve the results of ''punctuation detection'', i.e. placement of ''commas'' in the requested position in a sentence.

=== Task 2 ===

1. login to asteria04: `ssh asteria04`
1. download [htdocs:bigdata/task_ia161-grammar.zip task_ia161-grammar.zip] containing:
   1. `punct.set`, the syntactic grammar for punctuation detection for the [http://nlp.fi.muni.cz/projects/set SET parser]
   1. `test-nopunct.txt` - testing text with no commas
   1. `eval-gold.txt` - evaluation text with correct punctuation
   1. `evalpunct_robust.py` - evaluation script which computes recall and precision with both texts
 {{{
wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/task_ia161-grammar.zip
unzip task_ia161-grammar.zip
cd task_ia161-grammar
}}}
1. run the parser to fill punctuation to the testing text
 {{{
cat test-nopunct.txt | sed 's/^\s*$/<\/s><s>/' \
    | /nlp/projekty/set/unitok.py \
    | /nlp/projekty/rule_ind/stat/desamb.utf8.majka.sh -skipdis \
    | /nlp/projekty/set/set/set.py --commas --grammar=punct.set \
    > test.txt
}}}
 It takes several seconds to finish, nothing is printed. Output is stored in `test.txt`.
1. evaluate the result
 {{{
python evalpunct_robust.py eval-gold.txt test.txt > results.txt; \
cat results.txt
}}}
1. edit the grammar `punct.set` and add 1-2 rules to increase the F-score (combined recall and precision) of 10%.

 You may need to go through general information about the [https://nlp.fi.muni.cz/trac/set/wiki/documentation#Rulesstructure SET grammar format]. Information about adapting the grammar for the task of ''punctuation detection'' can be found the this [raw-attachment:tsd2014.pdf published paper].

 Current best results achieved with an extended grammar are 91.2 % of precision and 55 % recall, i.e. F-score of 68.6 %.
6. upload the modified `punct.set` and the respective `results.txt`.


Do not forget to upload your resulting files to the [wiki:en/NlpInPracticeCourse homework vault (odevzdávárna)].