Context Navigation

AutomaticCorrection

Timestamp:: Dec 17, 2015, 11:31:16 PM (10 years ago)
Author:: xsvec3
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

private/NlpInPracticeCourse/AutomaticCorrection

-                      v16
+                      v17
 . Open it in your favourite editor and we will walk through its functionality.
+=== Spellchecker functionality with examples ===
+. Spellchecker is '''trained''' from file `big.txt` which is a concatenation of several public domain books from '''Project Gutenberg''' and lists of most frequent words from '''Wiktionary''' and the '''British National Corpus'''. Function `train` stores how many times each word occurs in the text file. `NWORDS[w]` holds a count of how many times the word '''w has been seen'''.
+{{{
+def words(text): return re.findall('[a-z]+', text.lower())
+def train(features):
+    model = collections.defaultdict(lambda: 1)
+    for f in features:
+        model[f] += 1
+    return model
+NWORDS = train(words(file('big.txt').read()))
+}}}
+. '''Edit distance 1''' is represented as function `edits1` - it represents deletion (remove one letter), a transposition (swap adjacent letters), an alteration (change one letter to another) or an insertion (add a letter).
+{{{
+def edits1(word):
+   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
+   deletes    = [a + b[1:] for a, b in splits if b]
+   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
+   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
+   inserts    = [a + c + b     for a, b in splits for c in alphabet]
+   return set(deletes + transposes + replaces + inserts)
+}}}
+For a word of length '''n''', there will be '''n deletions''', '''n-1 transpositions''', '''26n alterations''', and '''26(n+1) insertions''', for a '''total of 54n+25'''. Example: len(edits1('something')) = 494 words.
+. '''Edit distance 2'''(`edits2`) - applied edits1 to all the results of edits1. Example: len(edits2('something')) = 114 324 words, which is a high number.
+To enhance speed we can only keep the candidates that are actually known words (`known_edits2`). Now known_edits2('something') is a set of just 4 words: {'smoothing', 'seething', 'something', 'soothing'}.
+. The function `correct` chooses as the set of candidate words the set with the '''shortest edit distance''' to the original word.
+{{{
+def known(words): return set(w for w in words if w in NWORDS)
+def correct(word):
+    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
+    return max(candidates, key=NWORDS.get)
+}}}
+. '''Result of the spellchecker''' is, that it takes a word as input and returns a likely correction of that word.
+{{{
+>>> correct('speling')
+'spelling'
+>>> correct('korrecter')
+'corrector'
+}}}
 === Task ===
 . Create `<YOUR_FILE>`, a text file named ia161-UCO-14.txt where UCO is your university ID.