Version 1 (modified by 4 years ago) (diff) | ,
---|
Automatic language correction
IA161 Advanced NLP Course?, Course Guarantee: Aleš Horák
Prepared by: Aleš Horák, Ján Švec
State of the Art
Language correction nowadays has many potential applications on large amount of informal and unedited text generated online, among other things: web forums, tweets, blogs, and email. Automatic language correction can consist of many areas including: spell checking, grammar checking and word completion.
In the theoretical lesson we will introduce and compare various methods to automatically propose and choose a correction for an incorrectly written word. Spell checking is the process of detecting and sometimes providing spelling suggestions for incorrectly spelled words in a text. The lesson will also focus on grammatical checking problems, which are the most difficult and complex type of language errors, because grammar is made up of a very extensive number of rules and exceptions. We will also say a few words about word completion.
The lesson will also answer a question "How difficult is to develop a spell-checker?". And also describe a system that performs spell-checking and autocorrection.
References
- CHOUDHURY, Monojit, et al. "How Difficult is it to Develop a Perfect Spell-checker? A Cross-linguistic Analysis through Complex Network Approach" Graph-Based Algorithms for Natural Language Processing, pages 81–88, Rochester, 2007. Source
- WHITELAW, Casey, et al. "Using the Web for Language Independent Spellchecking and Autocorrection" Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 890–899, Singapore, 2009. Source
- GUPTA, Neha, MATHUR, Pratistha. "Spell Checking Techniques in NLP: A Survey" International Journal of Advanced Research in Computer Science and Software Engineering, volume 2, issue 12, pages 217-221, 2012. Source
- HLADEK, Daniel, STAS, Jan, JUHAR, Jozef. "Unsupervised Spelling Correction for the Slovak Text." Advances in Electrical and Electronic Engineering 11 (5), pages 392-397, 2013. Source
Practical Session
There are 2 tasks, you may choose one or both:
Task 1: Statistical spell checker for English
In theoretical lesson we have become acquainted with various approaches how spelling correctors work. Now we will get to know how a simple spellchecker based on edit distance works.
The example is based on Peter Norvig's Spelling Corrector in python. The spelling corrector will be trained with a large text file consisting of about a million words.
We will test this tool on prepared data. Your goal will be to enhance spellchecker's accuracy.
- Download prepared script spell.py and training data collection big.txt.
- Test the script
python ./spell.py
in your working directory. - Open it in your favourite editor and we will walk through its functionality.
Spellchecker functionality with examples
- Spellchecker is trained from file
big.txt
which is a concatenation of several public domain books from Project Gutenberg and lists of most frequent words from Wiktionary and the British National Corpus. Functiontrain
stores how many times each word occurs in the text file.NWORDS[w]
holds a count of how many times the word w has been seen.def words(text): return re.findall('[a-z]+', text.lower()) def train(features): model = collections.defaultdict(lambda: 1) for f in features: model[f] += 1 return model NWORDS = train(words(file('big.txt').read()))
- Edit distance 1 is represented as function
edits1
- it represents deletion (remove one letter), a transposition (swap adjacent letters), an alteration (change one letter to another) or an insertion (add a letter). For a word of length n, there will be n deletions, n-1 transpositions, 26n alterations, and 26(n+1) insertions, for a total of 54n+25. Example: len(edits1('something')) = 494 words.def edits1(word): splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [a + b[1:] for a, b in splits if b] transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1] replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b] inserts = [a + c + b for a, b in splits for c in alphabet] return set(deletes + transposes + replaces + inserts)
- Edit distance 2(
edits2
) - applied edits1 to all the results of edits1. Example: len(edits2('something')) = 114 324 words, which is a high number. To enhance speed we can only keep the candidates that are actually known words (known_edits2
). Now known_edits2('something') is a set of just 4 words: {'smoothing', 'seething', 'something', 'soothing'}. - The function
correct
chooses as the set of candidate words the set with the shortest edit distance to the original word.def known(words): return set(w for w in words if w in NWORDS) def correct(word): candidates = known([word]) or known(edits1(word)) or \ known_edits2(word) or [word] return max(candidates, key=NWORDS.get)
- For evaluation there are prepared two test sets - development(
test1
) and final test set(test2
).
Task 1
- Create
<YOUR_FILE>
, a text file namedia161-UCO-13.txt
where UCO is your university ID.
- Run
spell.py
with development and final test sets (tests1
andtests2
within the script), write the results in<YOUR_FILE>
.
- Explain the given results in few words and write it in
<YOUR_FILE>
.
- Modify the code of
spell.py
to increase accuracy at tests2 by 10 %. You may take an inspiration from the Future work section of the Norvig's article. Describe your changes and write your new accuracy results to<YOUR_FILE>
.
Upload <YOUR_FILE>
and edited spell.py
Task 2: Rule based grammar checker (punctuation) for Czech
The second task choice consists in adapting specific syntactic grammar of Czech to improve the results of punctuation detection, i.e. placement of commas in the requested position in a sentence.
Task 2
- login to aurora:
ssh aurora
- download:
- syntactic grammar for punctuation detection for the SET parser
- testing text with no commas
- evaluation text with correct punctuation
- evaluation script which computes recall and precision with both texts
mkdir ia161-grammar cd ia161-grammar wget https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/AdvancedNlpCourse/AutomaticCorrection/punct.set wget https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/AdvancedNlpCourse/AutomaticCorrection/test-nopunct.txt wget https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/AdvancedNlpCourse/AutomaticCorrection/eval-gold.txt wget https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/AdvancedNlpCourse/AutomaticCorrection/evalpunct_robust.py
- run the parser to fill punctuation to the testing text
cat test-nopunct.txt \ | /nlp/projekty/set/unitok.py \ | /nlp/projekty/rule_ind/stat/desamb.utf8.majka.sh -skipdis \ | /nlp/projekty/set/set/set.py --commas --grammar=punct.set \ > test.txt
(takes a long time, about 30 s) - evaluate the result
PYTHONIOENCODING=UTF-8 python evalpunct_robust.py eval-gold.txt test.txt > results.txt cat results.txt
- edit the grammar
punct.set
and add 1-2 rules to increase the F-score (combined recall and precision) of 10%.
You may need to go through general information about the SET grammar format. Information about adapting the grammar for the task of punctuation detection can be found the this published paper.
Current best results achieved with an extended grammar are 91.2 % of precision and 55 % recall, i.e. F-score of 68.6 %.
- upload the modified
punct.set
and the respectiveresults.txt
.
Do not forget to upload your resulting files to the homework vault (odevzdávárna).
Attachments (10)
- IMG_20151126_112640.jpg (1.6 MB) - added by 4 years ago.
- punct.set (3.5 KB) - added by 4 years ago.
- test-nopunct.txt (55.1 KB) - added by 4 years ago.
- eval-gold.txt (56.0 KB) - added by 4 years ago.
- tsd2014.pdf (131.3 KB) - added by 4 years ago.
- spell-testset1.txt (3.7 KB) - added by 4 years ago.
- spell-testset2.txt (7.3 KB) - added by 4 years ago.
- evalpunct_robust.py (1.8 KB) - added by 4 years ago.
- big.txt (6.2 MB) - added by 4 years ago.
- spell.py (15.7 KB) - added by 4 years ago.