wiki:private/AdvancedNlpCourse/MachineTranslation

Version 12 (modified by Vít Baisa, 5 years ago) (diff)

překlad do angličtiny

Machine translation

IA161 Advanced NLP Course, Course Guarantee: Aleš Horák

Prepared by: Vít Baisa

State of the Art

The Statistical Machine Translation consists of two main parts: a language model for a target language which is responsible for fluency and good-looking output sentences and a translation model which translates source words and phrases into target language. Both models are probability distributions and can be built using a monolingual corpus for language model and a parallel corpus for translation model.

References

Approx 3 current papers (preferably from best NLP conferences/journals, eg. ACL Anthology) that will be used as a source for the one-hour lecture:

  1. Koehn, Philipp, et al. "Moses: Open source toolkit for statistical machine translation." Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. Association for Computational Linguistics, 2007.
  2. Koehn, Philipp, Franz Josef Och, and Daniel Marcu. "Statistical phrase-based translation." Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for Computational Linguistics, 2003.
  3. Denkowski, Michael, and Alon Lavie. "Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems." Proceedings of the Sixth Workshop on Statistical Machine Translation. Association for Computational Linguistics, 2011.

Workshop: generating translation dictionary from parallel data

Basic instructions

  • download ia161_mt.tar.gz with scripts and train data
  • unzip into home directory with tar xzf ia161_mt.tar.gz
  • a new subdir will be created: it161_mt

Files in the archive

czech.words100,000 sentences from Czech part of DGT-TM
czech.lemmas100,000 sentences (lemmas) from Czech part of DGT
english.words100,000 sentences from English DGT
english.lemmas100,000 sentences (lemmas) from EN DGT
eval.pya script for evaluation of coverage and precision of a generated dictionary in comparison with a small English-Czech dictionary
gnudfl.txta small English-Czech dictionary containing only one-word items and words from the train data
make_dict.pya script for generating dictionary based on co-occurrences and frequency lists
Makefilea file with rules for building the dictionary based on the train data
par2items.pya file for generating pairs of words (lemmas) from the parallel data

Description of make

make dict

  • the command uses 1,000 lines from train data and generates a dictionary based on wordforms (files czech.words and english.words)
  • it is possible to use alternative files with lemmas using parameter L1DATA and L2DATA
  • it is also possible to change the number of lines used for the computation (parameter LIMIT)
  • in general: make dict [L1DATA=<file>] [L2DATA=<file>] [LIMIT=<number of lines>]
  • e.g.: make dict L1DATA=english.lemmas L2DATA=czech.lemmas LIMIT=10000

make eval

  • when a dictionary is generated, you can measure its precision and coverage using script eval.py: make eval.
  • if you use parameters L1DATA and L2DATA, you must repeat them make eval
  • e.g.: make dict L1DATA=english.lemmas L2DATA=czech.lemmas

make clean

  • after each change to the input files or the scripts or parameters, clean temporary files: make clean

Detailed description of the scripts and generated data

  • Try to run default make dict and look at the results:
    • czech.words.freq
    • english.words.freq
    • english.words-czech.words.cofreq
    • english.words-czech.words.dict (the resulting dictionary)
  • Look at sizes of the output files (how many lines they contain) and its contents.
  • Look at the script make_dict.py, which generates the dictionary: at key places it contains TODO
  • there you can change the script, add heuristics, change conditions etc. so the final f-score is the highest possible

Assignment

  1. Change the key places of scripts par2items.py, make_dict.py so to achieve the highest possible f-score (see make eval).
  2. Upload all the scripts into the vault in one archive file.
  3. You can create it like this: tar czf ia161_mt_<uco_or_login>.tar.gz Makefile *.py

Attachments (2)