wiki:private/NlpInPracticeCourse/AutomaticCorrection

Version 16 (modified by xsvec3, 8 years ago) (diff)

--

Automatic language correction

IA161 Advanced NLP Course?, Course Guarantee: Aleš Horák

Prepared by: Ján Švec

State of the Art

Language correction nowadays has many potential applications on large amount of informal and unedited text generated online, among other things: web forums, tweets, blogs, and email. Automatic language correction can consist of many areas including: spell checking, grammar checking and word completion.

In the theoretical lesson we will introduce and compare various methods to automatically propose and choose a correction for an incorrectly written word. Spell checking is the process of detecting and sometimes providing spelling suggestions for incorrectly spelled words in a text. The lesson will also focus on grammatical checking problems, which are the most difficult and complex type of language errors, because grammar is made up of a very extensive number of rules and exceptions. We will also say a few words about word completion.

The lesson will also answer a question "How difficult is to develop a spell-checker?". And also describe a system that performs spell-checking and autocorrection.

References

  1. CHOUDHURY, Monojit, et al. "How Difficult is it to Develop a Perfect Spell-checker? A Cross-linguistic Analysis through Complex Network Approach" Graph-Based Algorithms for Natural Language Processing, pages 81–88, Rochester, 2007. Source
  2. WHITELAW, Casey, et al. "Using the Web for Language Independent Spellchecking and Autocorrection" Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 890–899, Singapore, 2009. Source
  3. GUPTA, Neha, MATHUR, Pratistha. "Spell Checking Techniques in NLP: A Survey" International Journal of Advanced Research in Computer Science and Software Engineering, volume 2, issue 12, pages 217-221, 2012. Source
  4. HLADEK, Daniel, STAS, Jan, JUHAR, Jozef. "Unsupervised Spelling Correction for the Slovak Text." Advances in Electrical and Electronic Engineering 11 (5), pages 392-397, 2013. Source

Slides

http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/AdvancedNlpCourse/anlp-14-AutomaticCorrection.pdf

Practical Session

In theoretical lesson we have become acquainted with various approaches how spelling correctors work. Now we will get to know how a simple spellchecker based on edit distance works.

The example is based on Peter Norvig's Spelling Corrector in python. The spelling corrector will be trained with a large text file consisting of about a million words.

We will test this tool on prepared data. Your goal will be to enhance spellchecker's accuracy. If you finish early, there is a bonus question in the task section.

  1. Download prepared script spell.py and training data collection big.txt.
  2. Test the script python ./spell.py in your working directory.
  3. Open it in your favourite editor and we will walk through its functionality.

Task

  1. Create <YOUR_FILE>, a text file named ia161-UCO-14.txt where UCO is your university ID.
  1. Run spell.py with developement and final test sets (test1 and test2), write the results in <YOUR_FILE>.
  1. Explain the given results in few words and write it in <YOUR_FILE>.
  1. Modify the code of spell.py to increase accuraccy by 10 %. Write your new accuracy results to <YOUR_FILE>.
  1. Run the script with verbose=True and examine given results. Try to suggest at least one adjustment how to enhance spellchecker's accuracy. Write your suggestions to <YOUR_FILE>.

-Bonus question- How could you make the implementation faster without changing the results? Write your suggestions to <YOUR_FILE>.

Upload <YOUR_FILE> and edited spell.py

Do not forget to upload your resulting files to the homework vault (odevzdávárna).

Attachments (10)