Changes between Initial Version and Version 1 of en/AdvancedNlpCourse2019/AutomaticCorrection

Oct 1, 2020, 3:34:37 PM (23 months ago)
Ales Horak

copied from private/AdvancedNlpCourse/AutomaticCorrection


  • en/AdvancedNlpCourse2019/AutomaticCorrection

    v1 v1  
     1= Automatic language correction =
     2[[|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák
     4Prepared by: Aleš Horák, Ján Švec
     6== State of the Art ==
     7Language correction nowadays has many potential applications on large amount of informal and unedited text generated online, among other things: web forums, tweets, blogs, and email. Automatic language correction can consist of many areas including: spell checking, grammar checking and word completion.
     9In the theoretical lesson we will introduce and compare various methods to automatically propose and choose a correction for an incorrectly written word. Spell checking is the process of detecting and sometimes providing spelling suggestions for incorrectly spelled words in a text. The lesson will also focus on  grammatical checking problems, which are the most difficult and complex type of language errors, because grammar is made up of a very extensive number of rules and exceptions. We will also say a few words about word completion.
     11The lesson will also answer a question "How difficult is to develop a spell-checker?". And also describe a system that performs spell-checking and autocorrection.
     13=== References ===
     14 1. CHOUDHURY, Monojit, et al. "How Difficult is it to Develop a Perfect Spell-checker? A Cross-linguistic Analysis through Complex Network Approach" Graph-Based Algorithms for Natural Language Processing, pages 81–88, Rochester, 2007. [[;jsessionid=52A3B869596656C9DA285DCE83A0339F?doi=|Source]]
     15 1. WHITELAW, Casey, et al. "Using the Web for Language Independent Spellchecking and Autocorrection" Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 890–899, Singapore, 2009. [[|Source]]
     16 1. GUPTA, Neha, MATHUR, Pratistha. "Spell Checking Techniques in NLP: A Survey" International Journal of Advanced Research in Computer Science and Software Engineering, volume 2, issue 12, pages 217-221, 2012. [[|Source]]
     17 1. HLADEK, Daniel, STAS, Jan, JUHAR, Jozef. "Unsupervised Spelling Correction for the Slovak Text." Advances in Electrical and Electronic Engineering 11 (5), pages 392-397, 2013.  [[|Source]]
     20== Practical Session ==
     22There are 2 tasks, you may choose one or both:
     23 1. [wiki:/en/AdvancedNlpCourse/AutomaticCorrection#task1 statistical spell checker for English]
     24 2. [wiki:/en/AdvancedNlpCourse/AutomaticCorrection#task2 rule based grammar checker (punctuation) for Czech]
     26== Task 1: Statistical spell checker for English == #task1
     28In theoretical lesson we have become acquainted with various approaches how spelling correctors work. Now we will get to know how a simple spellchecker based on '''edit distance''' works.
     30The example is based on Peter Norvig's [[|Spelling Corrector]] in python. The spelling corrector will be trained with a large text file consisting of about a million words.
     32We will test this tool on prepared data. Your goal will be to enhance spellchecker's accuracy.
     35 1. Download prepared script  [[|]] and training data collection  [[raw-attachment:big.txt|big.txt]].
     36 1. Test the script ` python ./ ` in your working directory.
     37 1. Open it in your favourite editor and we will walk through its functionality.
     40=== Spellchecker functionality with examples ===
     421. Spellchecker is '''trained''' from file `big.txt` which is a concatenation of several public domain books from '''Project Gutenberg''' and lists of most frequent words from '''Wiktionary''' and the '''British National Corpus'''. Function `train` stores how many times each word occurs in the text file. `NWORDS[w]` holds a count of how many times the word '''w has been seen'''. 
     43 {{{
     44def words(text): return re.findall('[a-z]+', text.lower())
     46def train(features):
     47    model = collections.defaultdict(lambda: 1)
     48    for f in features:
     49        model[f] += 1
     50    return model
     52NWORDS = train(words(file('big.txt').read()))
     541. '''Edit distance 1''' is represented as function `edits1` - it represents deletion (remove one letter), a transposition (swap adjacent letters), an alteration (change one letter to another) or an insertion (add a letter). For a word of length '''n''', there will be '''n deletions''', '''n-1 transpositions''', '''26n alterations''', and '''26(n+1) insertions''', for a '''total of 54n+25'''. Example: len(edits1('something')) = 494 words.
     55 {{{
     56def edits1(word):
     57   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
     58   deletes    = [a + b[1:] for a, b in splits if b]
     59   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
     60   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
     61   inserts    = [a + c + b     for a, b in splits for c in alphabet]
     62   return set(deletes + transposes + replaces + inserts)
     641. '''Edit distance 2'''(`edits2`) - applied edits1 to all the results of edits1. Example: len(edits2('something')) = 114 324 words, which is a high number. To enhance speed we can only keep the candidates that are actually known words (`known_edits2`). Now known_edits2('something') is a set of just 4 words: {'smoothing', 'seething', 'something', 'soothing'}.
     651. The function `correct` chooses as the set of candidate words the set with the '''shortest edit distance''' to the original word.
     66 {{{
     67def known(words): return set(w for w in words if w in NWORDS)
     69def correct(word):
     70    candidates = known([word]) or known(edits1(word)) or \
     71        known_edits2(word) or [word]
     72    return max(candidates, key=NWORDS.get)
     741. For '''evaluation''' there are prepared two test sets - development(`test1`) and final test set(`test2`).
     77=== Task 1 ===
     78 1. Create `<YOUR_FILE>`, a text file named `ia161-UCO-13.txt` where UCO is your university ID.
     80 2. Run `` with development and final test sets (`tests1` and `tests2` within the script), write the results in `<YOUR_FILE>`.
     82 3. Explain the given results in few words and write it in `<YOUR_FILE>`.
     84 4. Modify the code of `` to increase accuracy at tests2 by 10 %. You may take an inspiration from the ''Future work'' section of [ the Norvig's article]. Describe your changes and write your new accuracy results to `<YOUR_FILE>`.
     87=== Upload `<YOUR_FILE>` and edited `` ===
     89== Task 2: Rule based grammar checker (punctuation) for Czech == #task2
     91The second task choice consists in adapting specific syntactic grammar of Czech to improve the results of ''punctuation detection'', i.e. placement of ''commas'' in the requested position in a sentence.
     93=== Task 2 ===
     951. login to aurora: `ssh aurora`
     961. download:
     97   1. [raw-attachment:punct.set syntactic grammar] for punctuation detection for the [ SET parser]
     98   1. [raw-attachment:test-nopunct.txt testing text with no commas]
     99   1. [raw-attachment:eval-gold.txt evaluation text with correct punctuation]
     100   1. [ evaluation script] which computes recall and precision with both texts
     102mkdir ia161-grammar
     103cd ia161-grammar
     1091. run the parser to fill punctuation to the testing text
     110 {{{
     111cat test-nopunct.txt \
     112    | /nlp/projekty/set/ \
     113    | /nlp/projekty/rule_ind/stat/ -skipdis \
     114    | /nlp/projekty/set/set/ --commas --grammar=punct.set \
     115    > test.txt
     117 (takes a long time, about 30 s)
     1181. evaluate the result
     119 {{{
     120PYTHONIOENCODING=UTF-8 python eval-gold.txt test.txt > results.txt
     121cat results.txt
     1231. edit the grammar `punct.set` and add 1-2 rules to increase the F-score (combined recall and precision) of 10%.
     125 You may need to go through general information about the [ SET grammar format]. Information about adapting the grammar for the task of ''punctuation detection'' can be found the this [raw-attachment:tsd2014.pdf published paper].
     127 Current best results achieved with an extended grammar are 91.2 % of precision and 55 % recall, i.e. F-score of 68.6 %.
     1286. upload the modified `punct.set` and the respective `results.txt`.
     131Do not forget to upload your resulting files to the [/en/AdvancedNlpCourse homework vault (odevzdávárna)].