Version 16 (modified by 9 years ago) (diff) | ,
---|
Automatic language correction
IA161 Advanced NLP Course?, Course Guarantee: Aleš Horák
Prepared by: Ján Švec
State of the Art
Language correction nowadays has many potential applications on large amount of informal and unedited text generated online, among other things: web forums, tweets, blogs, and email. Automatic language correction can consist of many areas including: spell checking, grammar checking and word completion.
In the theoretical lesson we will introduce and compare various methods to automatically propose and choose a correction for an incorrectly written word. Spell checking is the process of detecting and sometimes providing spelling suggestions for incorrectly spelled words in a text. The lesson will also focus on grammatical checking problems, which are the most difficult and complex type of language errors, because grammar is made up of a very extensive number of rules and exceptions. We will also say a few words about word completion.
The lesson will also answer a question "How difficult is to develop a spell-checker?". And also describe a system that performs spell-checking and autocorrection.
References
- CHOUDHURY, Monojit, et al. "How Difficult is it to Develop a Perfect Spell-checker? A Cross-linguistic Analysis through Complex Network Approach" Graph-Based Algorithms for Natural Language Processing, pages 81–88, Rochester, 2007. Source
- WHITELAW, Casey, et al. "Using the Web for Language Independent Spellchecking and Autocorrection" Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 890–899, Singapore, 2009. Source
- GUPTA, Neha, MATHUR, Pratistha. "Spell Checking Techniques in NLP: A Survey" International Journal of Advanced Research in Computer Science and Software Engineering, volume 2, issue 12, pages 217-221, 2012. Source
- HLADEK, Daniel, STAS, Jan, JUHAR, Jozef. "Unsupervised Spelling Correction for the Slovak Text." Advances in Electrical and Electronic Engineering 11 (5), pages 392-397, 2013. Source
Slides
Practical Session
In theoretical lesson we have become acquainted with various approaches how spelling correctors work. Now we will get to know how a simple spellchecker based on edit distance works.
The example is based on Peter Norvig's Spelling Corrector in python. The spelling corrector will be trained with a large text file consisting of about a million words.
We will test this tool on prepared data. Your goal will be to enhance spellchecker's accuracy. If you finish early, there is a bonus question in the task
section.
- Download prepared script spell.py and training data collection big.txt.
- Test the script
python ./spell.py
in your working directory. - Open it in your favourite editor and we will walk through its functionality.
Task
- Create
<YOUR_FILE>
, a text file named ia161-UCO-14.txt where UCO is your university ID.
- Run
spell.py
with developement and final test sets (test1 and test2), write the results in<YOUR_FILE>
.
- Explain the given results in few words and write it in
<YOUR_FILE>
.
- Modify the code of
spell.py
to increase accuraccy by 10 %. Write your new accuracy results to<YOUR_FILE>
.
- Run the script with
verbose=True
and examine given results. Try to suggest at least one adjustment how to enhance spellchecker's accuracy. Write your suggestions to<YOUR_FILE>
.
-Bonus question- How could you make the implementation faster without changing the results? Write your suggestions to
<YOUR_FILE>
.
Upload <YOUR_FILE>
and edited spell.py
Do not forget to upload your resulting files to the homework vault (odevzdávárna).
Attachments (10)
- IMG_20151126_112640.jpg (1.6 MB) - added by 9 years ago.
- test-nopunct.txt (55.1 KB) - added by 7 years ago.
- tsd2014.pdf (131.3 KB) - added by 7 years ago.
- spell-testset1.txt (3.7 KB) - added by 7 years ago.
- spell-testset2.txt (7.3 KB) - added by 7 years ago.
- evalpunct_robust.py (1.8 KB) - added by 7 years ago.
- big.txt (6.2 MB) - added by 7 years ago.
-
spell.py (15.8 KB) - added by 3 years ago.
Spelling corrector
- eval-gold.txt (56.5 KB) - added by 3 years ago.
- punct.set (3.4 KB) - added by 3 years ago.