Prepared by: Vít Baisa
State of the Art
The Statistical Machine Translation consists of two main parts: a language model for a target language which is responsible for fluency and good-looking output sentences and a translation model which translates source words and phrases into target language. Both models are probability distributions and can be built using a monolingual corpus for language model and a parallel corpus for translation model.
Approx 3 current papers (preferably from best NLP conferences/journals, eg. ACL Anthology) that will be used as a source for the one-hour lecture:
- Koehn, Philipp, et al. "Moses: Open source toolkit for statistical machine translation." Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. Association for Computational Linguistics, 2007.
- Koehn, Philipp, Franz Josef Och, and Daniel Marcu. "Statistical phrase-based translation." Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for Computational Linguistics, 2003.
- Denkowski, Michael, and Alon Lavie. "Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems." Proceedings of the Sixth Workshop on Statistical Machine Translation. Association for Computational Linguistics, 2011.
Workshop: generating translation dictionary from parallel data
- download ia161_mt.tar.gz with scripts and train data
- unzip with
tar xzf ia161_mt.tar.gz
ia161_mtwill be created
|czech.words||100,000 sentences from Czech part of DGT-TM|
|czech.lemmas||100,000 sentences (lemmas) from Czech part of DGT|
|english.words||100,000 sentences from English DGT|
|english.lemmas||100,000 sentences (lemmas) from EN DGT|
|eval.py||a script for evaluation of coverage and precision of a generated dictionary using a small English-Czech dictionary|
|gnudfl.txt||a small English-Czech dictionary containing only one-word items and words from the training data|
|make_dict.py||a script for generating a translation dictionary based on co-occurrences and frequency lists|
|Makefile||a file with rules for building the dictionary based on the training data|
|par2items.py||a file for generating pairs of words (lemmas) from the parallel data|
Description of make
- the command uses 1,000 lines from training data and generates a dictionary based on wordforms (files czech.words and english.words)
- it is possible to use alternative files with lemmas using parameter L1DATA and L2DATA
- it is also possible to change the number of lines used for the computation (parameter LIMIT)
- in general:
make dict [L1DATA=<file>] [L2DATA=<file>] [LIMIT=<number of lines>]
make dict L1DATA=english.lemmas L2DATA=czech.lemmas LIMIT=10000
The 1,000 lines by default are for the sake of speed.
- when the dictionary is generated, you can measure its precision and coverage using script eval.py:
- if you use parameters
L2DATA, you must repeat them
make dict L1DATA=english.lemmas L2DATA=czech.lemmas
- after each change to the input files or the scripts or parameters, clean temporary files:
Detailed description of the scripts and generated data
- Try to run default
make dictand look at the results:
- english.words-czech.words.dict (the resulting dictionary)
- Look at sizes of the output files (how many lines they contain) and its contents.
- Look at the script
make_dict.py, which generates the dictionary: at key places it contains
- there you can change the script, add heuristics, change conditions etc. so the final f-score is the highest possible
- Change the key places of scripts
make_dict.pyto achieve the highest possible f-score (see
- Upload all the scripts into the vault in one archive file.
- You can create it like this:
tar czf ia161_mt_<uco_or_login>.tar.gz Makefile *.py