50 | | |
51 | | 2. '''Edit distance 1''' is represented as function `edits1` - it represents deletion (remove one letter), a transposition (swap adjacent letters), an alteration (change one letter to another) or an insertion (add a letter). For a word of length '''n''', there will be '''n deletions''', '''n-1 transpositions''', '''26n alterations''', and '''26(n+1) insertions''', for a '''total of 54n+25'''. Example: len(edits1('something')) = 494 words. |
52 | | |
53 | | {{{ |
| 54 | 1. '''Edit distance 1''' is represented as function `edits1` - it represents deletion (remove one letter), a transposition (swap adjacent letters), an alteration (change one letter to another) or an insertion (add a letter). For a word of length '''n''', there will be '''n deletions''', '''n-1 transpositions''', '''26n alterations''', and '''26(n+1) insertions''', for a '''total of 54n+25'''. Example: len(edits1('something')) = 494 words. |
| 55 | {{{ |
62 | | |
63 | | |
64 | | |
65 | | 3. '''Edit distance 2'''(`edits2`) - applied edits1 to all the results of edits1. Example: len(edits2('something')) = 114 324 words, which is a high number. To enhance speed we can only keep the candidates that are actually known words (`known_edits2`). Now known_edits2('something') is a set of just 4 words: {'smoothing', 'seething', 'something', 'soothing'}. |
66 | | |
67 | | 4. The function `correct` chooses as the set of candidate words the set with the '''shortest edit distance''' to the original word. |
68 | | {{{ |
| 64 | 1. '''Edit distance 2'''(`edits2`) - applied edits1 to all the results of edits1. Example: len(edits2('something')) = 114 324 words, which is a high number. To enhance speed we can only keep the candidates that are actually known words (`known_edits2`). Now known_edits2('something') is a set of just 4 words: {'smoothing', 'seething', 'something', 'soothing'}. |
| 65 | 1. The function `correct` chooses as the set of candidate words the set with the '''shortest edit distance''' to the original word. |
| 66 | {{{ |
94 | | === Upload `<YOUR_FILE>` and edited `spell.py` === |
95 | | Do not forget to upload your resulting files to the [https://is.muni.cz/auth/el/1433/podzim2015/IA161/ode/59241116/ homework vault (odevzdávárna)]. |
| 89 | ==== Upload `<YOUR_FILE>` and edited `spell.py` ==== |
| 90 | |
| 91 | === Rule based grammar checker (punctuation) for Czech === #task2 |
| 92 | |
| 93 | The second task choice consists in adapting specific syntactic grammar of Czech to improve the results of ''punctuation detection'', i.e. placement of ''commas'' in the requested position in a sentence. |
| 94 | |
| 95 | ==== Task 2 ==== |
| 96 | |
| 97 | 1. login to aurora: `ssh aurora` |
| 98 | 1. download: |
| 99 | 1. [raw-attachment:punct.set syntactic grammar] for punctuation detection for the [http://nlp.fi.muni.cz/projects/set SET parser] |
| 100 | 1. [raw-attachment:test-nopunct.txt testing text with no commas] |
| 101 | 1. [raw-attachment:eval-gold.txt evaluation text with correct punctuation] |
| 102 | 1. [raw-attachment:evalpunct_robust.py evaluation script] which computes recall and precision with both texts |
| 103 | 1. run the parser to fill punctuation to the testing text |
| 104 | {{{ |
| 105 | cat test-nopunct.txt \ |
| 106 | | /nlp/projekty/set/unitok.py \ |
| 107 | | /nlp/projekty/rule_ind/stat/desamb.utf8.majka.sh \ |
| 108 | | /nlp/projekty/set/set/set.py --commas --grammar=punct.set \ |
| 109 | > test.txt |
| 110 | }}} |
| 111 | (takes a long time, about 30 s) |
| 112 | 1. evaluate the result |
| 113 | {{{ |
| 114 | ./evalpunct_robust.py eval-gold.txt test.txt > results.txt |
| 115 | cat results.txt |
| 116 | }}} |
| 117 | 1. edit the grammar `punct.set` and add 1-2 rules to increase the coverage of 10% |
| 118 | 1. upload the modified `punct.set` and the respective `results.txt`. |
| 119 | |
| 120 | |
| 121 | Do not forget to upload your resulting files to the [/en/AdvancedNlpCourse homework vault (odevzdávárna)]. |