Changes between Initial Version and Version 1 of en/NlpInPracticeCourse/2021/AutomaticCorrection


Ignore:
Timestamp:
Aug 30, 2022, 10:38:48 AM (19 months ago)
Author:
Ales Horak
Comment:

copied from private/NlpInPracticeCourse/AutomaticCorrection

Legend:

Unmodified
Added
Removed
Modified
  • en/NlpInPracticeCourse/2021/AutomaticCorrection

    v1 v1  
     1= Automatic language correction =
     2[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/NlpInPracticeCourse|NLP in Practice Course]], Course Guarantee: Aleš Horák
     3
     4Prepared by: Aleš Horák, Ján Švec
     5
     6== State of the Art ==
     7Language correction nowadays has many potential applications with large amounts of informal and unedited text generated online: web forums, tweets, blogs, or emails. Automatic language correction can include several tasks: spell checking, grammar checking and word completion.
     8
     9In the theoretical lesson, we will introduce and compare various methods to automatically propose and choose a correction for an incorrectly written word. Spell checking is the process of detecting and sometimes providing spelling suggestions for incorrectly spelled words in a text. The lesson will also focus on  grammatical checking problems, which are the most difficult and complex type of language errors, because grammar is made up of a very extensive number of rules and exceptions. We will also say a few words about word completion.
     10
     11The lesson will also answer a question "How difficult it is to develop a spell-checker?" and present a tool that performs spell-checking and autocorrection.
     12
     13=== References ===
     14 1. CHOUDHURY, Monojit, et al. "How Difficult is it to Develop a Perfect Spell-checker? A Cross-linguistic Analysis through Complex Network Approach" Graph-Based Algorithms for Natural Language Processing, pages 81–88, Rochester, 2007. [[http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=52A3B869596656C9DA285DCE83A0339F?doi=10.1.1.146.4390&rep=rep1&type=pdf|Source]]
     15 1. Sakaguchi, Keisuke, et al. "Robsut wrod reocginiton via semi-character recurrent neural network." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 31. No. 1. 2017. [[https://ojs.aaai.org/index.php/AAAI/article/view/10970/10829|Source]]
     16 1. HLADEK, Daniel, STAS, Jan, JUHAR, Jozef. "Unsupervised Spelling Correction for the Slovak Text." Advances in Electrical and Electronic Engineering 11 (5), pages 392-397, 2013.  [[http://advances.utc.sk/index.php/AEEE/article/view/898|Source]]
     17 1. Grundkiewicz, Roman, and Marcin Junczys-Dowmunt. "Near human-level performance in grammatical error correction with hybrid machine translation." arXiv preprint arXiv:1804.05945, 2018.  [[https://arxiv.org/pdf/1804.05945|Source]]
     18
     19
     20== Practical Session ==
     21
     22There are 2 tasks, you may choose one or both:
     23 1. [wiki:/en/NlpInPracticeCourse/AutomaticCorrection#task1 statistical spell checker for English]
     24 2. [wiki:/en/NlpInPracticeCourse/AutomaticCorrection#task2 rule based grammar checker (punctuation) for Czech]
     25
     26== Task 1: Statistical spell checker for English == #task1
     27
     28In the theoretical lesson, we have become acquainted with various approaches how spelling correctors work. Now we will get to know how a simple spellchecker based on '''edit distance''' works.
     29
     30The example is based on Peter Norvig's [[http://norvig.com/spell-correct.html|Spelling Corrector]] in Python. The spelling corrector will be trained with a large text file consisting of about one million words.
     31
     32We will test this tool with prepared data. Your goal will be to enhance the spellchecker's accuracy.
     33
     34
     35 1. Download the prepared script  [[raw-attachment:spell.py|spell.py]] and the training data collection  [[raw-attachment:big.txt|big.txt]].
     36 1. Test the script by running `python ./spell.py` in your working directory.
     37 1. Open it in your favourite editor and we will walk through its functionality.
     38
     39
     40=== Spellchecker functionality with examples ===
     41
     421. Spellchecker is '''trained''' from file `big.txt` which is a concatenation of several public domain books from '''Project Gutenberg''' and lists of most frequent words from '''Wiktionary''' and the '''British National Corpus'''. Function `train` stores how many times each word occurs in the text file. `NWORDS[w]` holds a count of how many times the word '''w has been seen'''. 
     43 {{{
     44def words(text): return re.findall('[a-z]+', text.lower())
     45
     46def train(features):
     47    model = collections.defaultdict(lambda: 1)
     48    for f in features:
     49        model[f] += 1
     50    return model
     51
     52NWORDS = train(words(file('big.txt').read()))
     53}}}
     541. '''Edit distance 1''' is represented as function `edits1` - it represents deletion (remove one letter), transposition (swap adjacent letters), alteration (change one letter to another), or an insertion (add a letter). For a word of length '''n''', there will be '''n deletions''', '''n-1 transpositions''', '''26n alterations''', and '''26(n+1) insertions''', for a '''total of 54n+25'''. Example: `len(edits1('something')) = 494` words.
     55 {{{
     56def edits1(word):
     57   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
     58   deletes    = [a + b[1:] for a, b in splits if b]
     59   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
     60   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
     61   inserts    = [a + c + b     for a, b in splits for c in alphabet]
     62   return set(deletes + transposes + replaces + inserts)
     63}}}
     641. '''Edit distance 2'''(`edits2`) - applies `edits1()` to all the results of `edits1()`. Example: `len(edits2('something')) = 114 324` words, which is a high number. To enhance speed we can only keep the candidates that are actually known words (`known_edits2()`). Now `known_edits2('something')` is a set of just 4 words: `{'smoothing', 'seething', 'something', 'soothing'}`.
     651. The function `correct()` chooses as the set of candidate words the set with the '''shortest edit distance''' to the original word.
     66 {{{
     67def known(words): return set(w for w in words if w in NWORDS)
     68
     69def correct(word):
     70    candidates = known([word]) or known(edits1(word)) or \
     71        known_edits2(word) or [word]
     72    return max(candidates, key=NWORDS.get)
     73}}}
     741. For '''evaluation''' there are two test sets prepared - development(`test1`) and final test set(`test2`).
     75
     76
     77=== Task 1 ===
     78 1. Create `<YOUR_FILE>`, a text file named `ia161-UCO-13.txt` where UCO is your university ID.
     79
     80 2. Run `spell.py` with development and final test sets (`tests1` and `tests2` within the script), write the results in `<YOUR_FILE>`.
     81
     82 3. Explain the given results in few words and write it in `<YOUR_FILE>`.
     83
     84 4. Modify the code of `spell.py` to increase accuracy (`pct`) at `tests2` by 10 %. You may take an inspiration from the ''Future work'' section of [http://norvig.com/spell-correct.html the Norvig's article]. Describe your changes and write your new accuracy results to `<YOUR_FILE>`.
     85
     86
     87=== Upload `<YOUR_FILE>` and the edited `spell.py` ===
     88
     89== Task 2: Rule based grammar checker (punctuation) for Czech == #task2
     90
     91The second task choice consists in adapting specific syntactic grammar of Czech to improve the results of ''punctuation detection'', i.e. placement of ''commas'' in the requested position in a sentence.
     92
     93=== Task 2 ===
     94
     951. login to aurora: `ssh aurora`
     961. download:
     97   1. [raw-attachment:punct.set syntactic grammar] for punctuation detection for the [http://nlp.fi.muni.cz/projects/set SET parser]
     98   1. [raw-attachment:test-nopunct.txt testing text with no commas]
     99   1. [raw-attachment:eval-gold.txt evaluation text with correct punctuation]
     100   1. [raw-attachment:evalpunct_robust.py evaluation script] which computes recall and precision with both texts
     101{{{
     102mkdir ia161-grammar
     103cd ia161-grammar
     104wget https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/NlpInPracticeCourse/AutomaticCorrection/punct.set
     105wget https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/NlpInPracticeCourse/AutomaticCorrection/test-nopunct.txt
     106wget https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/NlpInPracticeCourse/AutomaticCorrection/eval-gold.txt
     107wget https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/NlpInPracticeCourse/AutomaticCorrection/evalpunct_robust.py
     108}}}
     1091. run the parser to fill punctuation to the testing text
     110 {{{
     111cat test-nopunct.txt | sed 's/^\s*$/<\/s><s>/' \
     112    | /nlp/projekty/set/unitok.py \
     113    | /nlp/projekty/rule_ind/stat/desamb.utf8.majka.sh -skipdis \
     114    | /nlp/projekty/set/set/set.py --commas --grammar=punct.set \
     115    > test.txt
     116}}}
     117 (it takes a long time, about 30 s)
     1181. evaluate the result
     119 {{{
     120PYTHONIOENCODING=UTF-8 python evalpunct_robust.py eval-gold.txt test.txt > results.txt; \
     121cat results.txt
     122}}}
     1231. edit the grammar `punct.set` and add 1-2 rules to increase the F-score (combined recall and precision) of 10%.
     124
     125 You may need to go through general information about the [https://nlp.fi.muni.cz/trac/set/wiki/documentation#Rulesstructure SET grammar format]. Information about adapting the grammar for the task of ''punctuation detection'' can be found the this [raw-attachment:tsd2014.pdf published paper].
     126
     127 Current best results achieved with an extended grammar are 91.2 % of precision and 55 % recall, i.e. F-score of 68.6 %.
     1286. upload the modified `punct.set` and the respective `results.txt`.
     129
     130
     131Do not forget to upload your resulting files to the [/en/NlpInPracticeCourse homework vault (odevzdávárna)].