Changes between Version 18 and Version 19 of private/NlpInPracticeCourse/AutomaticCorrection


Ignore:
Timestamp:
Dec 8, 2017, 6:46:36 PM (6 years ago)
Author:
Ales Horak
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • private/NlpInPracticeCourse/AutomaticCorrection

    v18 v19  
    22[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák
    33
    4 Prepared by: Ján Švec
     4Prepared by: Aleš Horák, Ján Švec
    55
    66== State of the Art ==
     
    1717 1. HLADEK, Daniel, STAS, Jan, JUHAR, Jozef. "Unsupervised Spelling Correction for the Slovak Text." Advances in Electrical and Electronic Engineering 11 (5), pages 392-397, 2013.  [[http://advances.utc.sk/index.php/AEEE/article/view/898|Source]]
    1818
    19 == Slides ==
    20 [http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/AdvancedNlpCourse/anlp-14-AutomaticCorrection.pdf]
    2119
    2220== Practical Session ==
     21
     22There are 2 tasks, you may choose one or both:
     23 1. [#task1 statistical spell checker for English]
     24 2. [#task2 rule based grammar checker (punctuation) for Czech]
     25
     26=== Statistical spell checker for English === #task1
     27
    2328In theoretical lesson we have become acquainted with various approaches how spelling correctors work. Now we will get to know how a simple spellchecker based on '''edit distance''' works.
    2429
     
    3338
    3439
    35 === Spellchecker functionality with examples ===
     40==== Spellchecker functionality with examples ====
    3641
    37421. Spellchecker is '''trained''' from file `big.txt` which is a concatenation of several public domain books from '''Project Gutenberg''' and lists of most frequent words from '''Wiktionary''' and the '''British National Corpus'''. Function `train` stores how many times each word occurs in the text file. `NWORDS[w]` holds a count of how many times the word '''w has been seen'''. 
    38 
    39 {{{
     43 {{{
    4044def words(text): return re.findall('[a-z]+', text.lower())
    4145
     
    4852NWORDS = train(words(file('big.txt').read()))
    4953}}}
    50 
    51 2. '''Edit distance 1''' is represented as function `edits1` - it represents deletion (remove one letter), a transposition (swap adjacent letters), an alteration (change one letter to another) or an insertion (add a letter). For a word of length '''n''', there will be '''n deletions''', '''n-1 transpositions''', '''26n alterations''', and '''26(n+1) insertions''', for a '''total of 54n+25'''. Example: len(edits1('something')) = 494 words.
    52 
    53 {{{
     541. '''Edit distance 1''' is represented as function `edits1` - it represents deletion (remove one letter), a transposition (swap adjacent letters), an alteration (change one letter to another) or an insertion (add a letter). For a word of length '''n''', there will be '''n deletions''', '''n-1 transpositions''', '''26n alterations''', and '''26(n+1) insertions''', for a '''total of 54n+25'''. Example: len(edits1('something')) = 494 words.
     55 {{{
    5456def edits1(word):
    5557   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
     
    6062   return set(deletes + transposes + replaces + inserts)
    6163}}}
    62 
    63 
    64 
    65 3. '''Edit distance 2'''(`edits2`) - applied edits1 to all the results of edits1. Example: len(edits2('something')) = 114 324 words, which is a high number. To enhance speed we can only keep the candidates that are actually known words (`known_edits2`). Now known_edits2('something') is a set of just 4 words: {'smoothing', 'seething', 'something', 'soothing'}.
    66 
    67 4. The function `correct` chooses as the set of candidate words the set with the '''shortest edit distance''' to the original word.
    68 {{{
     641. '''Edit distance 2'''(`edits2`) - applied edits1 to all the results of edits1. Example: len(edits2('something')) = 114 324 words, which is a high number. To enhance speed we can only keep the candidates that are actually known words (`known_edits2`). Now known_edits2('something') is a set of just 4 words: {'smoothing', 'seething', 'something', 'soothing'}.
     651. The function `correct` chooses as the set of candidate words the set with the '''shortest edit distance''' to the original word.
     66 {{{
    6967def known(words): return set(w for w in words if w in NWORDS)
    7068
     
    7371    return max(candidates, key=NWORDS.get)
    7472}}}
    75 
    76 5. For '''evaluation''' there are prepared two test sets - developement(`test1`) and final test set(`test2`).
     731. For '''evaluation''' there are prepared two test sets - development(`test1`) and final test set(`test2`).
    7774
    7875
    79 
    80 
    81 === Task ===
     76==== Task 1 ====
    8277 1. Create `<YOUR_FILE>`, a text file named ia161-UCO-14.txt where UCO is your university ID.
    8378
     
    9287 -Bonus question- How could you make the implementation faster without changing the results? Write your suggestions to `<YOUR_FILE>`.
    9388
    94 === Upload `<YOUR_FILE>` and edited `spell.py` ===
    95 Do not forget to upload your resulting files to the [https://is.muni.cz/auth/el/1433/podzim2015/IA161/ode/59241116/ homework vault (odevzdávárna)].
     89==== Upload `<YOUR_FILE>` and edited `spell.py` ====
     90
     91=== Rule based grammar checker (punctuation) for Czech === #task2
     92
     93The second task choice consists in adapting specific syntactic grammar of Czech to improve the results of ''punctuation detection'', i.e. placement of ''commas'' in the requested position in a sentence.
     94
     95==== Task 2 ====
     96
     971. login to aurora: `ssh aurora`
     981. download:
     99   1. [raw-attachment:punct.set syntactic grammar] for punctuation detection for the [http://nlp.fi.muni.cz/projects/set SET parser]
     100   1. [raw-attachment:test-nopunct.txt testing text with no commas]
     101   1. [raw-attachment:eval-gold.txt evaluation text with correct punctuation]
     102   1. [raw-attachment:evalpunct_robust.py evaluation script] which computes recall and precision with both texts
     1031. run the parser to fill punctuation to the testing text
     104 {{{
     105cat test-nopunct.txt \
     106    | /nlp/projekty/set/unitok.py \
     107    | /nlp/projekty/rule_ind/stat/desamb.utf8.majka.sh \
     108    | /nlp/projekty/set/set/set.py --commas --grammar=punct.set \
     109    > test.txt
     110}}}
     111 (takes a long time, about 30 s)
     1121. evaluate the result
     113 {{{
     114./evalpunct_robust.py eval-gold.txt test.txt > results.txt
     115cat results.txt
     116}}}
     1171. edit the grammar `punct.set` and add 1-2 rules to increase the coverage of 10%
     1181. upload the modified `punct.set` and the respective `results.txt`.
     119
     120
     121Do not forget to upload your resulting files to the [/en/AdvancedNlpCourse homework vault (odevzdávárna)].