Changes between Initial Version and Version 1 of en/AdvancedNlpCourse2015/AutomaticCorrection


Ignore:
Timestamp:
Sep 11, 2017, 4:38:48 PM (7 years ago)
Author:
Ales Horak
Comment:

copied from private/AdvancedNlpCourse/AutomaticCorrection

Legend:

Unmodified
Added
Removed
Modified
  • en/AdvancedNlpCourse2015/AutomaticCorrection

    v1 v1  
     1= Automatic language correction =
     2[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák
     3
     4Prepared by: Ján Švec
     5
     6== State of the Art ==
     7Language correction nowadays has many potential applications on large amount of informal and unedited text generated online, among other things: web forums, tweets, blogs, and email. Automatic language correction can consist of many areas including: spell checking, grammar checking and word completion.
     8
     9In the theoretical lesson we will introduce and compare various methods to automatically propose and choose a correction for an incorrectly written word. Spell checking is the process of detecting and sometimes providing spelling suggestions for incorrectly spelled words in a text. The lesson will also focus on  grammatical checking problems, which are the most difficult and complex type of language errors, because grammar is made up of a very extensive number of rules and exceptions. We will also say a few words about word completion.
     10
     11The lesson will also answer a question "How difficult is to develop a spell-checker?". And also describe a system that performs spell-checking and autocorrection.
     12
     13=== References ===
     14 1. CHOUDHURY, Monojit, et al. "How Difficult is it to Develop a Perfect Spell-checker? A Cross-linguistic Analysis through Complex Network Approach" Graph-Based Algorithms for Natural Language Processing, pages 81–88, Rochester, 2007. [[http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=52A3B869596656C9DA285DCE83A0339F?doi=10.1.1.146.4390&rep=rep1&type=pdf|Source]]
     15 1. WHITELAW, Casey, et al. "Using the Web for Language Independent Spellchecking and Autocorrection" Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 890–899, Singapore, 2009. [[http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/36180.pdf|Source]]
     16 1. GUPTA, Neha, MATHUR, Pratistha. "Spell Checking Techniques in NLP: A Survey" International Journal of Advanced Research in Computer Science and Software Engineering, volume 2, issue 12, pages 217-221, 2012. [[http://www.ijarcsse.com/docs/papers/12_December2012/Volume_2_issue_12_December2012/V2I12-0164.pdf|Source]]
     17 1. HLADEK, Daniel, STAS, Jan, JUHAR, Jozef. "Unsupervised Spelling Correction for the Slovak Text." Advances in Electrical and Electronic Engineering 11 (5), pages 392-397, 2013.  [[http://advances.utc.sk/index.php/AEEE/article/view/898|Source]]
     18
     19== Slides ==
     20[http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/AdvancedNlpCourse/anlp-14-AutomaticCorrection.pdf]
     21
     22== Practical Session ==
     23In theoretical lesson we have become acquainted with various approaches how spelling correctors work. Now we will get to know how a simple spellchecker based on '''edit distance''' works.
     24
     25The example is based on Peter Norvig's [[http://norvig.com/spell-correct.html|Spelling Corrector]] in python. The spelling corrector will be trained with a large text file consisting of about a million words.
     26
     27We will test this tool on prepared data. Your goal will be to enhance spellchecker's accuracy. If you finish early, there is a bonus question in the `task` section. 
     28
     29
     30 1. Download prepared script  [[https://nlp.fi.muni.cz/trac/research/attachment/wiki/private/AdvancedNlpCourse/AutomaticCorrection/spell.py|spell.py]] and training data collection  [[https://nlp.fi.muni.cz/trac/research/attachment/wiki/private/AdvancedNlpCourse/AutomaticCorrection/big.txt|big.txt]].
     31 1. Test the script ` python ./spell.py ` in your working directory.
     32 1. Open it in your favourite editor and we will walk through its functionality.
     33
     34
     35=== Spellchecker functionality with examples ===
     36
     371. Spellchecker is '''trained''' from file `big.txt` which is a concatenation of several public domain books from '''Project Gutenberg''' and lists of most frequent words from '''Wiktionary''' and the '''British National Corpus'''. Function `train` stores how many times each word occurs in the text file. `NWORDS[w]` holds a count of how many times the word '''w has been seen'''. 
     38
     39{{{
     40def words(text): return re.findall('[a-z]+', text.lower())
     41
     42def train(features):
     43    model = collections.defaultdict(lambda: 1)
     44    for f in features:
     45        model[f] += 1
     46    return model
     47
     48NWORDS = train(words(file('big.txt').read()))
     49}}}
     50
     512. '''Edit distance 1''' is represented as function `edits1` - it represents deletion (remove one letter), a transposition (swap adjacent letters), an alteration (change one letter to another) or an insertion (add a letter). For a word of length '''n''', there will be '''n deletions''', '''n-1 transpositions''', '''26n alterations''', and '''26(n+1) insertions''', for a '''total of 54n+25'''. Example: len(edits1('something')) = 494 words.
     52
     53{{{
     54def edits1(word):
     55   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
     56   deletes    = [a + b[1:] for a, b in splits if b]
     57   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
     58   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
     59   inserts    = [a + c + b     for a, b in splits for c in alphabet]
     60   return set(deletes + transposes + replaces + inserts)
     61}}}
     62
     63
     64
     653. '''Edit distance 2'''(`edits2`) - applied edits1 to all the results of edits1. Example: len(edits2('something')) = 114 324 words, which is a high number. To enhance speed we can only keep the candidates that are actually known words (`known_edits2`). Now known_edits2('something') is a set of just 4 words: {'smoothing', 'seething', 'something', 'soothing'}.
     66
     674. The function `correct` chooses as the set of candidate words the set with the '''shortest edit distance''' to the original word.
     68{{{
     69def known(words): return set(w for w in words if w in NWORDS)
     70
     71def correct(word):
     72    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
     73    return max(candidates, key=NWORDS.get)
     74}}}
     75
     765. For '''evaluation''' there are prepared two test sets - developement(`test1`) and final test set(`test2`).
     77
     78
     79
     80
     81=== Task ===
     82 1. Create `<YOUR_FILE>`, a text file named ia161-UCO-14.txt where UCO is your university ID.
     83
     84 2. Run `spell.py` with developement and final test sets (test1 and test2), write the results in `<YOUR_FILE>`.
     85
     86 3. Explain the given results in few words and write it in `<YOUR_FILE>`.
     87
     88 4. Modify the code of `spell.py` to increase accuraccy by 10 %. Write your new accuracy results to `<YOUR_FILE>`.
     89
     90 5. Run the script with `verbose=True` and examine given results. Try to suggest at least one adjustment how to enhance spellchecker's accuracy. Write your suggestions to `<YOUR_FILE>`.
     91
     92 -Bonus question- How could you make the implementation faster without changing the results? Write your suggestions to `<YOUR_FILE>`.
     93
     94=== Upload `<YOUR_FILE>` and edited `spell.py` ===
     95Do not forget to upload your resulting files to the [https://is.muni.cz/auth/el/1433/podzim2015/IA161/ode/59241116/ homework vault (odevzdávárna)].