| 34 | |
| 35 | === Spellchecker functionality with examples === |
| 36 | |
| 37 | 1. Spellchecker is '''trained''' from file `big.txt` which is a concatenation of several public domain books from '''Project Gutenberg''' and lists of most frequent words from '''Wiktionary''' and the '''British National Corpus'''. Function `train` stores how many times each word occurs in the text file. `NWORDS[w]` holds a count of how many times the word '''w has been seen'''. |
| 38 | |
| 39 | {{{ |
| 40 | def words(text): return re.findall('[a-z]+', text.lower()) |
| 41 | |
| 42 | def train(features): |
| 43 | model = collections.defaultdict(lambda: 1) |
| 44 | for f in features: |
| 45 | model[f] += 1 |
| 46 | return model |
| 47 | |
| 48 | NWORDS = train(words(file('big.txt').read())) |
| 49 | }}} |
| 50 | |
| 51 | 2. '''Edit distance 1''' is represented as function `edits1` - it represents deletion (remove one letter), a transposition (swap adjacent letters), an alteration (change one letter to another) or an insertion (add a letter). |
| 52 | |
| 53 | {{{ |
| 54 | def edits1(word): |
| 55 | splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] |
| 56 | deletes = [a + b[1:] for a, b in splits if b] |
| 57 | transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1] |
| 58 | replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b] |
| 59 | inserts = [a + c + b for a, b in splits for c in alphabet] |
| 60 | return set(deletes + transposes + replaces + inserts) |
| 61 | }}} |
| 62 | |
| 63 | For a word of length '''n''', there will be '''n deletions''', '''n-1 transpositions''', '''26n alterations''', and '''26(n+1) insertions''', for a '''total of 54n+25'''. Example: len(edits1('something')) = 494 words. |
| 64 | |
| 65 | 3. '''Edit distance 2'''(`edits2`) - applied edits1 to all the results of edits1. Example: len(edits2('something')) = 114 324 words, which is a high number. |
| 66 | |
| 67 | To enhance speed we can only keep the candidates that are actually known words (`known_edits2`). Now known_edits2('something') is a set of just 4 words: {'smoothing', 'seething', 'something', 'soothing'}. |
| 68 | |
| 69 | 4. The function `correct` chooses as the set of candidate words the set with the '''shortest edit distance''' to the original word. |
| 70 | {{{ |
| 71 | def known(words): return set(w for w in words if w in NWORDS) |
| 72 | |
| 73 | def correct(word): |
| 74 | candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word] |
| 75 | return max(candidates, key=NWORDS.get) |
| 76 | }}} |
| 77 | |
| 78 | 5. '''Result of the spellchecker''' is, that it takes a word as input and returns a likely correction of that word. |
| 79 | {{{ |
| 80 | >>> correct('speling') |
| 81 | 'spelling' |
| 82 | >>> correct('korrecter') |
| 83 | 'corrector' |
| 84 | }}} |
| 85 | |
| 86 | |