Changes between Version 37 and Version 38 of private/NlpInPracticeCourse/AutomaticCorrection
- Timestamp:
- Jan 13, 2021, 6:29:10 PM (3 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
private/NlpInPracticeCourse/AutomaticCorrection
v37 v38 5 5 6 6 == State of the Art == 7 Language correction nowadays has many potential applications on large amount of informal and unedited text generated online, among other things: web forums, tweets, blogs, and email. Automatic language correction can consist of many areas including: spell checking, grammar checking and word completion.7 Language correction nowadays has many potential applications with large amounts of informal and unedited text generated online: web forums, tweets, blogs, or emails. Automatic language correction can include several tasks: spell checking, grammar checking and word completion. 8 8 9 In the theoretical lesson we will introduce and compare various methods to automatically propose and choose a correction for an incorrectly written word. Spell checking is the process of detecting and sometimes providing spelling suggestions for incorrectly spelled words in a text. The lesson will also focus on grammatical checking problems, which are the most difficult and complex type of language errors, because grammar is made up of a very extensive number of rules and exceptions. We will also say a few words about word completion.9 In the theoretical lesson, we will introduce and compare various methods to automatically propose and choose a correction for an incorrectly written word. Spell checking is the process of detecting and sometimes providing spelling suggestions for incorrectly spelled words in a text. The lesson will also focus on grammatical checking problems, which are the most difficult and complex type of language errors, because grammar is made up of a very extensive number of rules and exceptions. We will also say a few words about word completion. 10 10 11 The lesson will also answer a question "How difficult i s to develop a spell-checker?". And also describe a systemthat performs spell-checking and autocorrection.11 The lesson will also answer a question "How difficult it is to develop a spell-checker?" and present a tool that performs spell-checking and autocorrection. 12 12 13 13 === References === … … 26 26 == Task 1: Statistical spell checker for English == #task1 27 27 28 In the oretical lessonwe have become acquainted with various approaches how spelling correctors work. Now we will get to know how a simple spellchecker based on '''edit distance''' works.28 In the theoretical lesson, we have become acquainted with various approaches how spelling correctors work. Now we will get to know how a simple spellchecker based on '''edit distance''' works. 29 29 30 The example is based on Peter Norvig's [[http://norvig.com/spell-correct.html|Spelling Corrector]] in python. The spelling corrector will be trained with a large text file consisting of about amillion words.30 The example is based on Peter Norvig's [[http://norvig.com/spell-correct.html|Spelling Corrector]] in Python. The spelling corrector will be trained with a large text file consisting of about one million words. 31 31 32 We will test this tool on prepared data. Your goal will be to enhance spellchecker's accuracy.32 We will test this tool with prepared data. Your goal will be to enhance the spellchecker's accuracy. 33 33 34 34 35 1. Download prepared script [[raw-attachment:spell.py|spell.py]] andtraining data collection [[raw-attachment:big.txt|big.txt]].36 1. Test the script ` python ./spell.py` in your working directory.35 1. Download the prepared script [[raw-attachment:spell.py|spell.py]] and the training data collection [[raw-attachment:big.txt|big.txt]]. 36 1. Test the script by running `python ./spell.py` in your working directory. 37 37 1. Open it in your favourite editor and we will walk through its functionality. 38 38 … … 52 52 NWORDS = train(words(file('big.txt').read())) 53 53 }}} 54 1. '''Edit distance 1''' is represented as function `edits1` - it represents deletion (remove one letter), a transposition (swap adjacent letters), an alteration (change one letter to another) or an insertion (add a letter). For a word of length '''n''', there will be '''n deletions''', '''n-1 transpositions''', '''26n alterations''', and '''26(n+1) insertions''', for a '''total of 54n+25'''. Example: len(edits1('something')) = 494words.54 1. '''Edit distance 1''' is represented as function `edits1` - it represents deletion (remove one letter), transposition (swap adjacent letters), alteration (change one letter to another), or an insertion (add a letter). For a word of length '''n''', there will be '''n deletions''', '''n-1 transpositions''', '''26n alterations''', and '''26(n+1) insertions''', for a '''total of 54n+25'''. Example: `len(edits1('something')) = 494` words. 55 55 {{{ 56 56 def edits1(word): … … 62 62 return set(deletes + transposes + replaces + inserts) 63 63 }}} 64 1. '''Edit distance 2'''(`edits2`) - applie d edits1 to all the results of edits1. Example: len(edits2('something')) = 114 324 words, which is a high number. To enhance speed we can only keep the candidates that are actually known words (`known_edits2`). Now known_edits2('something') is a set of just 4 words: {'smoothing', 'seething', 'something', 'soothing'}.65 1. The function `correct ` chooses as the set of candidate words the set with the '''shortest edit distance''' to the original word.64 1. '''Edit distance 2'''(`edits2`) - applies `edits1()` to all the results of `edits1()`. Example: `len(edits2('something')) = 114 324` words, which is a high number. To enhance speed we can only keep the candidates that are actually known words (`known_edits2()`). Now `known_edits2('something')` is a set of just 4 words: `{'smoothing', 'seething', 'something', 'soothing'}`. 65 1. The function `correct()` chooses as the set of candidate words the set with the '''shortest edit distance''' to the original word. 66 66 {{{ 67 67 def known(words): return set(w for w in words if w in NWORDS) … … 72 72 return max(candidates, key=NWORDS.get) 73 73 }}} 74 1. For '''evaluation''' there are prepared two test sets- development(`test1`) and final test set(`test2`).74 1. For '''evaluation''' there are two test sets prepared - development(`test1`) and final test set(`test2`). 75 75 76 76 … … 82 82 3. Explain the given results in few words and write it in `<YOUR_FILE>`. 83 83 84 4. Modify the code of `spell.py` to increase accuracy at tests2by 10 %. You may take an inspiration from the ''Future work'' section of [http://norvig.com/spell-correct.html the Norvig's article]. Describe your changes and write your new accuracy results to `<YOUR_FILE>`.84 4. Modify the code of `spell.py` to increase accuracy at `tests2` by 10 %. You may take an inspiration from the ''Future work'' section of [http://norvig.com/spell-correct.html the Norvig's article]. Describe your changes and write your new accuracy results to `<YOUR_FILE>`. 85 85 86 86 87 === Upload `<YOUR_FILE>` and edited `spell.py` ===87 === Upload `<YOUR_FILE>` and the edited `spell.py` === 88 88 89 89 == Task 2: Rule based grammar checker (punctuation) for Czech == #task2 … … 115 115 > test.txt 116 116 }}} 117 ( takes a long time, about 30 s)117 (it takes a long time, about 30 s) 118 118 1. evaluate the result 119 119 {{{