Automatic language correction

IA161 NLP in Practice Course, Course Guarantee: Aleš Horák

Prepared by: Aleš Horák, Ján Švec

State of the Art

Language correction nowadays has many potential applications with large amounts of informal and unedited text generated online: web forums, tweets, blogs, or emails. Automatic language correction can include several tasks: spell checking, grammar checking and word completion.

In the theoretical lesson, we will introduce and compare various methods to automatically propose and choose a correction for an incorrectly written word. Spell checking is the process of detecting and sometimes providing spelling suggestions for incorrectly spelled words in a text. The lesson will also focus on grammatical checking problems, which are the most difficult and complex type of language errors, because grammar is made up of a very extensive number of rules and exceptions. We will also say a few words about word completion.

The lesson will also answer a question "How difficult it is to develop a spell-checker?" and present a tool that performs spell-checking and autocorrection.

References

Gupta, Prabhakar. "A context-sensitive real-time Spell Checker with language adaptability." 2020 IEEE 14th International Conference on Semantic Computing (ICSC). IEEE, 2020. link
Rothe, Sascha, et al. "A simple recipe for multilingual grammatical error correction." ACL-IJCNLP 2021. link
Didenko, Bohdan, and Andrii Sameliuk. "RedPenNet for Grammatical Error Correction: Outputs to Tokens, Attentions to Spans." Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP). 2023. link
Náplava, Jakub, et al. "Czech grammar error correction with a large and diverse corpus." Transactions of the Association for Computational Linguistics 10 (2022): 452-467. link

Practical Session

Note: If you are new to the command line interface via a terminal window, you may find the tutorial for working in terminal useful.

There are 2 tasks, you may choose one or both:

Task 1: Statistical spell checker for English

In the theoretical lesson, we have become acquainted with various approaches how spelling correctors work. Now we will get to know how a simple spellchecker based on edit distance works.

The example is based on Peter Norvig's Spelling Corrector in Python. The spelling corrector will be trained with a large text file consisting of about one million words.

We will test this tool with prepared data. Your goal will be to enhance the spellchecker's accuracy.

Download task_ia161-spell.zip with a prepared script spell.py and a training data collection big.txt. Unzip it and change to the contained directory.
```
wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/task_ia161-spell.zip
unzip task_ia161-spell.zip
cd task_ia161-spell
```
Test the script by running
```
python spell.py
```
Open spell.py in your favourite editor and we will walk through its functionality.

Spellchecker functionality with examples

Spellchecker is trained from file big.txt which is a concatenation of several public domain books from Project Gutenberg and lists of most frequent words from Wiktionary and the British National Corpus. Function train stores how many times each word occurs in the text file. NWORDS[w] holds a count of how many times the word w has been seen.
```
def words(text): return re.findall('[a-z]+', text.lower()) 

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(words(file('big.txt').read()))
```

Edit distance 1 is represented as function edits1 - it represents deletion (remove one letter), transposition (swap adjacent letters), alteration (change one letter to another), or an insertion (add a letter). For a word of length n, there will be n deletions, n-1 transpositions, 26n alterations, and 26(n+1) insertions, for a total of 54n+25. Example: len(edits1('something')) = 494 words.

def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

Edit distance 2 (edits2) - applies edits1() to all the results of edits1(). Example: len(edits2('something')) = 114 324 words, which is a high number. To enhance speed we can only keep the candidates that are actually known words (known_edits2()). Now known_edits2('something') is a set of just 4 words: {'smoothing', 'seething', 'something', 'soothing'}.

The function correct() chooses as the set of candidate words the set with the shortest edit distance to the original word.

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or \
        known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

For evaluation there are two test sets prepared - development(test1) and final test set(test2).

Task 1

Create a text file named spell.txt.

Run spell.py with development and final test sets (tests1 and tests2 within the script), write the results in spell.txt.

Explain the given results in few words and write it in spell.txt.

Modify the code of spell.py to increase accuracy (pct) at tests2 by 10 %. You may take an inspiration from the Future work section of the Norvig's article. Describe your changes and write your new accuracy results to spell.txt.

Upload spell.txt and the edited spell.py to the homework vault (odevzdávárna).

Task 2: Rule based grammar checker (punctuation) for Czech

The second task choice consists in adapting specific syntactic grammar of Czech to improve the results of punctuation detection, i.e. placement of commas in the requested position in a sentence.

Task 2

login to asteria04: ssh asteria04
download task_ia161-grammar.zip containing:
1. punct.set, the syntactic grammar for punctuation detection for the SET parser
2. test-nopunct.txt - testing text with no commas
3. eval-gold.txt - evaluation text with correct punctuation
4. evalpunct_robust.py - evaluation script which computes recall and precision with both texts
```
wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/task_ia161-grammar.zip
unzip task_ia161-grammar.zip
cd task_ia161-grammar
```

run the parser to fill punctuation to the testing text

cat test-nopunct.txt | sed 's/^\s*$/<\/s><s>/' \
    | /nlp/projekty/set/unitok.py \
    | /nlp/projekty/rule_ind/stat/desamb.utf8.majka.sh -skipdis \
    | /nlp/projekty/set/set/set.py --commas --grammar=punct.set \
    > test.txt

It takes several seconds to finish, nothing is printed. Output is stored in test.txt.

evaluate the result

python evalpunct_robust.py eval-gold.txt test.txt > results.txt; \
cat results.txt

edit the grammar punct.set and add 1-2 rules to increase the F-score (combined recall and precision) of 10%. You may consult e.g. the comma rules at LocalLingo.

You may need to go through general information about the SET grammar format. Information about adapting the grammar for the task of punctuation detection can be found the this published paper.

Current best results achieved with an extended grammar are 91.2 % of precision and 55 % recall, i.e. F-score of 68.6 %.

upload the modified punct.set and the respective results.txt.

Do not forget to upload your resulting files to the homework vault (odevzdávárna).