Changes between Version 37 and Version 38 of private/AdvancedNlpCourse/AutomaticCorrection


Ignore:
Timestamp:
Jan 13, 2021, 6:29:10 PM (8 months ago)
Author:
Ales Horak
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • private/AdvancedNlpCourse/AutomaticCorrection

    v37 v38  
    55
    66== State of the Art ==
    7 Language correction nowadays has many potential applications on large amount of informal and unedited text generated online, among other things: web forums, tweets, blogs, and email. Automatic language correction can consist of many areas including: spell checking, grammar checking and word completion.
     7Language correction nowadays has many potential applications with large amounts of informal and unedited text generated online: web forums, tweets, blogs, or emails. Automatic language correction can include several tasks: spell checking, grammar checking and word completion.
    88
    9 In the theoretical lesson we will introduce and compare various methods to automatically propose and choose a correction for an incorrectly written word. Spell checking is the process of detecting and sometimes providing spelling suggestions for incorrectly spelled words in a text. The lesson will also focus on  grammatical checking problems, which are the most difficult and complex type of language errors, because grammar is made up of a very extensive number of rules and exceptions. We will also say a few words about word completion.
     9In the theoretical lesson, we will introduce and compare various methods to automatically propose and choose a correction for an incorrectly written word. Spell checking is the process of detecting and sometimes providing spelling suggestions for incorrectly spelled words in a text. The lesson will also focus on  grammatical checking problems, which are the most difficult and complex type of language errors, because grammar is made up of a very extensive number of rules and exceptions. We will also say a few words about word completion.
    1010
    11 The lesson will also answer a question "How difficult is to develop a spell-checker?". And also describe a system that performs spell-checking and autocorrection.
     11The lesson will also answer a question "How difficult it is to develop a spell-checker?" and present a tool that performs spell-checking and autocorrection.
    1212
    1313=== References ===
     
    2626== Task 1: Statistical spell checker for English == #task1
    2727
    28 In theoretical lesson we have become acquainted with various approaches how spelling correctors work. Now we will get to know how a simple spellchecker based on '''edit distance''' works.
     28In the theoretical lesson, we have become acquainted with various approaches how spelling correctors work. Now we will get to know how a simple spellchecker based on '''edit distance''' works.
    2929
    30 The example is based on Peter Norvig's [[http://norvig.com/spell-correct.html|Spelling Corrector]] in python. The spelling corrector will be trained with a large text file consisting of about a million words.
     30The example is based on Peter Norvig's [[http://norvig.com/spell-correct.html|Spelling Corrector]] in Python. The spelling corrector will be trained with a large text file consisting of about one million words.
    3131
    32 We will test this tool on prepared data. Your goal will be to enhance spellchecker's accuracy.
     32We will test this tool with prepared data. Your goal will be to enhance the spellchecker's accuracy.
    3333
    3434
    35  1. Download prepared script  [[raw-attachment:spell.py|spell.py]] and training data collection  [[raw-attachment:big.txt|big.txt]].
    36  1. Test the script ` python ./spell.py ` in your working directory.
     35 1. Download the prepared script  [[raw-attachment:spell.py|spell.py]] and the training data collection  [[raw-attachment:big.txt|big.txt]].
     36 1. Test the script by running `python ./spell.py` in your working directory.
    3737 1. Open it in your favourite editor and we will walk through its functionality.
    3838
     
    5252NWORDS = train(words(file('big.txt').read()))
    5353}}}
    54 1. '''Edit distance 1''' is represented as function `edits1` - it represents deletion (remove one letter), a transposition (swap adjacent letters), an alteration (change one letter to another) or an insertion (add a letter). For a word of length '''n''', there will be '''n deletions''', '''n-1 transpositions''', '''26n alterations''', and '''26(n+1) insertions''', for a '''total of 54n+25'''. Example: len(edits1('something')) = 494 words.
     541. '''Edit distance 1''' is represented as function `edits1` - it represents deletion (remove one letter), transposition (swap adjacent letters), alteration (change one letter to another), or an insertion (add a letter). For a word of length '''n''', there will be '''n deletions''', '''n-1 transpositions''', '''26n alterations''', and '''26(n+1) insertions''', for a '''total of 54n+25'''. Example: `len(edits1('something')) = 494` words.
    5555 {{{
    5656def edits1(word):
     
    6262   return set(deletes + transposes + replaces + inserts)
    6363}}}
    64 1. '''Edit distance 2'''(`edits2`) - applied edits1 to all the results of edits1. Example: len(edits2('something')) = 114 324 words, which is a high number. To enhance speed we can only keep the candidates that are actually known words (`known_edits2`). Now known_edits2('something') is a set of just 4 words: {'smoothing', 'seething', 'something', 'soothing'}.
    65 1. The function `correct` chooses as the set of candidate words the set with the '''shortest edit distance''' to the original word.
     641. '''Edit distance 2'''(`edits2`) - applies `edits1()` to all the results of `edits1()`. Example: `len(edits2('something')) = 114 324` words, which is a high number. To enhance speed we can only keep the candidates that are actually known words (`known_edits2()`). Now `known_edits2('something')` is a set of just 4 words: `{'smoothing', 'seething', 'something', 'soothing'}`.
     651. The function `correct()` chooses as the set of candidate words the set with the '''shortest edit distance''' to the original word.
    6666 {{{
    6767def known(words): return set(w for w in words if w in NWORDS)
     
    7272    return max(candidates, key=NWORDS.get)
    7373}}}
    74 1. For '''evaluation''' there are prepared two test sets - development(`test1`) and final test set(`test2`).
     741. For '''evaluation''' there are two test sets prepared - development(`test1`) and final test set(`test2`).
    7575
    7676
     
    8282 3. Explain the given results in few words and write it in `<YOUR_FILE>`.
    8383
    84  4. Modify the code of `spell.py` to increase accuracy at tests2 by 10 %. You may take an inspiration from the ''Future work'' section of [http://norvig.com/spell-correct.html the Norvig's article]. Describe your changes and write your new accuracy results to `<YOUR_FILE>`.
     84 4. Modify the code of `spell.py` to increase accuracy at `tests2` by 10 %. You may take an inspiration from the ''Future work'' section of [http://norvig.com/spell-correct.html the Norvig's article]. Describe your changes and write your new accuracy results to `<YOUR_FILE>`.
    8585
    8686
    87 === Upload `<YOUR_FILE>` and edited `spell.py` ===
     87=== Upload `<YOUR_FILE>` and the edited `spell.py` ===
    8888
    8989== Task 2: Rule based grammar checker (punctuation) for Czech == #task2
     
    115115    > test.txt
    116116}}}
    117  (takes a long time, about 30 s)
     117 (it takes a long time, about 30 s)
    1181181. evaluate the result
    119119 {{{