Version 1 (modified by 4 years ago) (diff) | ,
---|
Machine translation
IA161 Advanced NLP Course?, Course Guarantee: Aleš Horák
Prepared by: Vít Baisa
State of the Art
The Statistical Machine Translation consists of two main parts: a language model for a target language which is responsible for fluency and good-looking output sentences and a translation model which translates source words and phrases into target language. Both models are probability distributions and can be built using a monolingual corpus for language model and a parallel corpus for translation model.
References
Approx 3 current papers (preferably from best NLP conferences/journals, eg. ACL Anthology) that will be used as a source for the one-hour lecture:
- Koehn, Philipp, et al. "Moses: Open source toolkit for statistical machine translation." Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. Association for Computational Linguistics, 2007.
- Koehn, Philipp, Franz Josef Och, and Daniel Marcu. "Statistical phrase-based translation." Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for Computational Linguistics, 2003.
- Denkowski, Michael, and Alon Lavie. "Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems." Proceedings of the Sixth Workshop on Statistical Machine Translation. Association for Computational Linguistics, 2011.
Workshop: generating translation dictionary from parallel data
Basic instructions
- download ia161_mt.tar.gz with scripts and train data
- unzip with
tar xzf ia161_mt.tar.gz
- subdir
ia161_mt
will be created
Files
czech.words | 100,000 sentences from Czech part of DGT-TM |
czech.lemmas | 100,000 sentences (lemmas) from Czech part of DGT |
english.words | 100,000 sentences from English DGT |
english.lemmas | 100,000 sentences (lemmas) from EN DGT |
eval.py | a script for evaluation of coverage and precision of a generated dictionary using a small English-Czech dictionary |
gnudfl.txt | a small English-Czech dictionary containing only one-word items and words from the training data |
make_dict.py | a script for generating a translation dictionary based on co-occurrences and frequency lists |
Makefile | a file with rules for building the dictionary based on the training data |
par2items.py | a file for generating pairs of words (lemmas) from the parallel data |
Description of make
make dict
- the command uses 1,000 lines from training data and generates a dictionary based on wordforms (files czech.words and english.words)
- it is possible to use alternative files with lemmas using parameter L1DATA and L2DATA
- it is also possible to change the number of lines used for the computation (parameter LIMIT)
- in general:
make dict [L1DATA=<file>] [L2DATA=<file>] [LIMIT=<number of lines>]
- e.g.:
make dict L1DATA=english.lemmas L2DATA=czech.lemmas LIMIT=10000
The 1,000 lines by default are for the sake of speed.
make eval
- when the dictionary is generated, you can measure its precision and coverage using script eval.py:
make eval
. - if you use parameters
L1DATA
andL2DATA
, you must repeat themmake eval
- e.g.:
make dict L1DATA=english.lemmas L2DATA=czech.lemmas
make clean
- after each change to the input files or the scripts or parameters, clean temporary files:
make clean
Detailed description of the scripts and generated data
- Try to run default
make dict
and look at the results:- czech.words.freq
- english.words.freq
- english.words-czech.words.cofreq
- english.words-czech.words.dict (the resulting dictionary)
- Look at sizes of the output files (how many lines they contain) and its contents.
- Look at the script
make_dict.py
, which generates the dictionary: at key places it containsTODO
- there you can change the script, add heuristics, change conditions etc. so the final f-score is the highest possible
Assignment
- Change the key places of scripts
par2items.py
,make_dict.py
to achieve the highest possible f-score (seemake eval
). - Upload all the scripts into the vault in one archive file.
- You can create it like this:
tar czf ia161_mt_<uco_or_login>.tar.gz Makefile *.py
Attachments (2)
- ia161_mt.tar.gz (7.6 MB) - added by 4 years ago.
- ia161.pdf (240.4 KB) - added by 4 years ago.