| 1 | = Automatic language correction = |
| 2 | [[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák |
| 3 | |
| 4 | Prepared by: Aleš Horák, Ján Švec |
| 5 | |
| 6 | == State of the Art == |
| 7 | Language correction nowadays has many potential applications with large amounts of informal and unedited text generated online: web forums, tweets, blogs, or emails. Automatic language correction can include several tasks: spell checking, grammar checking and word completion. |
| 8 | |
| 9 | In the theoretical lesson, we will introduce and compare various methods to automatically propose and choose a correction for an incorrectly written word. Spell checking is the process of detecting and sometimes providing spelling suggestions for incorrectly spelled words in a text. The lesson will also focus on grammatical checking problems, which are the most difficult and complex type of language errors, because grammar is made up of a very extensive number of rules and exceptions. We will also say a few words about word completion. |
| 10 | |
| 11 | The lesson will also answer a question "How difficult it is to develop a spell-checker?" and present a tool that performs spell-checking and autocorrection. |
| 12 | |
| 13 | === References === |
| 14 | 1. CHOUDHURY, Monojit, et al. "How Difficult is it to Develop a Perfect Spell-checker? A Cross-linguistic Analysis through Complex Network Approach" Graph-Based Algorithms for Natural Language Processing, pages 81–88, Rochester, 2007. [[http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=52A3B869596656C9DA285DCE83A0339F?doi=10.1.1.146.4390&rep=rep1&type=pdf|Source]] |
| 15 | 1. Sakaguchi, Keisuke, et al. "Robsut wrod reocginiton via semi-character recurrent neural network." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 31. No. 1. 2017. [[https://ojs.aaai.org/index.php/AAAI/article/view/10970/10829|Source]] |
| 16 | 1. HLADEK, Daniel, STAS, Jan, JUHAR, Jozef. "Unsupervised Spelling Correction for the Slovak Text." Advances in Electrical and Electronic Engineering 11 (5), pages 392-397, 2013. [[http://advances.utc.sk/index.php/AEEE/article/view/898|Source]] |
| 17 | 1. Grundkiewicz, Roman, and Marcin Junczys-Dowmunt. "Near human-level performance in grammatical error correction with hybrid machine translation." arXiv preprint arXiv:1804.05945, 2018. [[https://arxiv.org/pdf/1804.05945|Source]] |
| 18 | |
| 19 | |
| 20 | == Practical Session == |
| 21 | |
| 22 | There are 2 tasks, you may choose one or both: |
| 23 | 1. [wiki:/en/AdvancedNlpCourse/AutomaticCorrection#task1 statistical spell checker for English] |
| 24 | 2. [wiki:/en/AdvancedNlpCourse/AutomaticCorrection#task2 rule based grammar checker (punctuation) for Czech] |
| 25 | |
| 26 | == Task 1: Statistical spell checker for English == #task1 |
| 27 | |
| 28 | In the theoretical lesson, we have become acquainted with various approaches how spelling correctors work. Now we will get to know how a simple spellchecker based on '''edit distance''' works. |
| 29 | |
| 30 | The example is based on Peter Norvig's [[http://norvig.com/spell-correct.html|Spelling Corrector]] in Python. The spelling corrector will be trained with a large text file consisting of about one million words. |
| 31 | |
| 32 | We will test this tool with prepared data. Your goal will be to enhance the spellchecker's accuracy. |
| 33 | |
| 34 | |
| 35 | 1. Download the prepared script [[raw-attachment:spell.py|spell.py]] and the training data collection [[raw-attachment:big.txt|big.txt]]. |
| 36 | 1. Test the script by running `python ./spell.py` in your working directory. |
| 37 | 1. Open it in your favourite editor and we will walk through its functionality. |
| 38 | |
| 39 | |
| 40 | === Spellchecker functionality with examples === |
| 41 | |
| 42 | 1. Spellchecker is '''trained''' from file `big.txt` which is a concatenation of several public domain books from '''Project Gutenberg''' and lists of most frequent words from '''Wiktionary''' and the '''British National Corpus'''. Function `train` stores how many times each word occurs in the text file. `NWORDS[w]` holds a count of how many times the word '''w has been seen'''. |
| 43 | {{{ |
| 44 | def words(text): return re.findall('[a-z]+', text.lower()) |
| 45 | |
| 46 | def train(features): |
| 47 | model = collections.defaultdict(lambda: 1) |
| 48 | for f in features: |
| 49 | model[f] += 1 |
| 50 | return model |
| 51 | |
| 52 | NWORDS = train(words(file('big.txt').read())) |
| 53 | }}} |
| 54 | 1. '''Edit distance 1''' is represented as function `edits1` - it represents deletion (remove one letter), transposition (swap adjacent letters), alteration (change one letter to another), or an insertion (add a letter). For a word of length '''n''', there will be '''n deletions''', '''n-1 transpositions''', '''26n alterations''', and '''26(n+1) insertions''', for a '''total of 54n+25'''. Example: `len(edits1('something')) = 494` words. |
| 55 | {{{ |
| 56 | def edits1(word): |
| 57 | splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] |
| 58 | deletes = [a + b[1:] for a, b in splits if b] |
| 59 | transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1] |
| 60 | replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b] |
| 61 | inserts = [a + c + b for a, b in splits for c in alphabet] |
| 62 | return set(deletes + transposes + replaces + inserts) |
| 63 | }}} |
| 64 | 1. '''Edit distance 2'''(`edits2`) - applies `edits1()` to all the results of `edits1()`. Example: `len(edits2('something')) = 114 324` words, which is a high number. To enhance speed we can only keep the candidates that are actually known words (`known_edits2()`). Now `known_edits2('something')` is a set of just 4 words: `{'smoothing', 'seething', 'something', 'soothing'}`. |
| 65 | 1. The function `correct()` chooses as the set of candidate words the set with the '''shortest edit distance''' to the original word. |
| 66 | {{{ |
| 67 | def known(words): return set(w for w in words if w in NWORDS) |
| 68 | |
| 69 | def correct(word): |
| 70 | candidates = known([word]) or known(edits1(word)) or \ |
| 71 | known_edits2(word) or [word] |
| 72 | return max(candidates, key=NWORDS.get) |
| 73 | }}} |
| 74 | 1. For '''evaluation''' there are two test sets prepared - development(`test1`) and final test set(`test2`). |
| 75 | |
| 76 | |
| 77 | === Task 1 === |
| 78 | 1. Create `<YOUR_FILE>`, a text file named `ia161-UCO-13.txt` where UCO is your university ID. |
| 79 | |
| 80 | 2. Run `spell.py` with development and final test sets (`tests1` and `tests2` within the script), write the results in `<YOUR_FILE>`. |
| 81 | |
| 82 | 3. Explain the given results in few words and write it in `<YOUR_FILE>`. |
| 83 | |
| 84 | 4. Modify the code of `spell.py` to increase accuracy (`pct`) at `tests2` by 10 %. You may take an inspiration from the ''Future work'' section of [http://norvig.com/spell-correct.html the Norvig's article]. Describe your changes and write your new accuracy results to `<YOUR_FILE>`. |
| 85 | |
| 86 | |
| 87 | === Upload `<YOUR_FILE>` and the edited `spell.py` === |
| 88 | |
| 89 | == Task 2: Rule based grammar checker (punctuation) for Czech == #task2 |
| 90 | |
| 91 | The second task choice consists in adapting specific syntactic grammar of Czech to improve the results of ''punctuation detection'', i.e. placement of ''commas'' in the requested position in a sentence. |
| 92 | |
| 93 | === Task 2 === |
| 94 | |
| 95 | 1. login to aurora: `ssh aurora` |
| 96 | 1. download: |
| 97 | 1. [raw-attachment:punct.set syntactic grammar] for punctuation detection for the [http://nlp.fi.muni.cz/projects/set SET parser] |
| 98 | 1. [raw-attachment:test-nopunct.txt testing text with no commas] |
| 99 | 1. [raw-attachment:eval-gold.txt evaluation text with correct punctuation] |
| 100 | 1. [raw-attachment:evalpunct_robust.py evaluation script] which computes recall and precision with both texts |
| 101 | {{{ |
| 102 | mkdir ia161-grammar |
| 103 | cd ia161-grammar |
| 104 | wget https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/AdvancedNlpCourse/AutomaticCorrection/punct.set |
| 105 | wget https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/AdvancedNlpCourse/AutomaticCorrection/test-nopunct.txt |
| 106 | wget https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/AdvancedNlpCourse/AutomaticCorrection/eval-gold.txt |
| 107 | wget https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/AdvancedNlpCourse/AutomaticCorrection/evalpunct_robust.py |
| 108 | }}} |
| 109 | 1. run the parser to fill punctuation to the testing text |
| 110 | {{{ |
| 111 | cat test-nopunct.txt | sed 's/^\s*$/<\/s><s>/' \ |
| 112 | | /nlp/projekty/set/unitok.py \ |
| 113 | | /nlp/projekty/rule_ind/stat/desamb.utf8.majka.sh -skipdis \ |
| 114 | | /nlp/projekty/set/set/set.py --commas --grammar=punct.set \ |
| 115 | > test.txt |
| 116 | }}} |
| 117 | (it takes a long time, about 30 s) |
| 118 | 1. evaluate the result |
| 119 | {{{ |
| 120 | PYTHONIOENCODING=UTF-8 python evalpunct_robust.py eval-gold.txt test.txt > results.txt; \ |
| 121 | cat results.txt |
| 122 | }}} |
| 123 | 1. edit the grammar `punct.set` and add 1-2 rules to increase the F-score (combined recall and precision) of 10%. |
| 124 | |
| 125 | You may need to go through general information about the [https://nlp.fi.muni.cz/trac/set/wiki/documentation#Rulesstructure SET grammar format]. Information about adapting the grammar for the task of ''punctuation detection'' can be found the this [raw-attachment:tsd2014.pdf published paper]. |
| 126 | |
| 127 | Current best results achieved with an extended grammar are 91.2 % of precision and 55 % recall, i.e. F-score of 68.6 %. |
| 128 | 6. upload the modified `punct.set` and the respective `results.txt`. |
| 129 | |
| 130 | |
| 131 | Do not forget to upload your resulting files to the [/en/AdvancedNlpCourse homework vault (odevzdávárna)]. |