Changes between Version 16 and Version 17 of private/NlpInPracticeCourse/AutomaticCorrection


Ignore:
Timestamp:
Dec 17, 2015, 11:31:16 PM (8 years ago)
Author:
xsvec3
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • private/NlpInPracticeCourse/AutomaticCorrection

    v16 v17  
    3232 1. Open it in your favourite editor and we will walk through its functionality.
    3333
     34
     35=== Spellchecker functionality with examples ===
     36
     371. Spellchecker is '''trained''' from file `big.txt` which is a concatenation of several public domain books from '''Project Gutenberg''' and lists of most frequent words from '''Wiktionary''' and the '''British National Corpus'''. Function `train` stores how many times each word occurs in the text file. `NWORDS[w]` holds a count of how many times the word '''w has been seen'''. 
     38
     39{{{
     40def words(text): return re.findall('[a-z]+', text.lower())
     41
     42def train(features):
     43    model = collections.defaultdict(lambda: 1)
     44    for f in features:
     45        model[f] += 1
     46    return model
     47
     48NWORDS = train(words(file('big.txt').read()))
     49}}}
     50
     512. '''Edit distance 1''' is represented as function `edits1` - it represents deletion (remove one letter), a transposition (swap adjacent letters), an alteration (change one letter to another) or an insertion (add a letter).
     52
     53{{{
     54def edits1(word):
     55   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
     56   deletes    = [a + b[1:] for a, b in splits if b]
     57   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
     58   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
     59   inserts    = [a + c + b     for a, b in splits for c in alphabet]
     60   return set(deletes + transposes + replaces + inserts)
     61}}}
     62
     63For a word of length '''n''', there will be '''n deletions''', '''n-1 transpositions''', '''26n alterations''', and '''26(n+1) insertions''', for a '''total of 54n+25'''. Example: len(edits1('something')) = 494 words.
     64
     653. '''Edit distance 2'''(`edits2`) - applied edits1 to all the results of edits1. Example: len(edits2('something')) = 114 324 words, which is a high number.
     66
     67To enhance speed we can only keep the candidates that are actually known words (`known_edits2`). Now known_edits2('something') is a set of just 4 words: {'smoothing', 'seething', 'something', 'soothing'}.
     68
     694. The function `correct` chooses as the set of candidate words the set with the '''shortest edit distance''' to the original word.
     70{{{
     71def known(words): return set(w for w in words if w in NWORDS)
     72
     73def correct(word):
     74    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
     75    return max(candidates, key=NWORDS.get)
     76}}}
     77
     785. '''Result of the spellchecker''' is, that it takes a word as input and returns a likely correction of that word.
     79{{{
     80>>> correct('speling')
     81'spelling'
     82>>> correct('korrecter')
     83'corrector'
     84}}}
     85
     86
    3487=== Task ===
    3588 1. Create `<YOUR_FILE>`, a text file named ia161-UCO-14.txt where UCO is your university ID.