13 | | Approx 3 current papers (preferably from best NLP conferences/journals, eg. [[https://www.aclweb.org/anthology/|ACL Anthology]]) that will be used as a source for the one-hour lecture: |
| 16 | 1. Alammar, Jay (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/ |
| 17 | 1. Popel, Martin, et al. "Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals." Nature communications 11.1 (2020): 1-15. |
| 18 | 1. Thompson, Brian and Koehn, Philipp. "Vecalign: Improved Sentence Alignment in Linear Time and Space", Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019 |
15 | | 1. Popel, Martin, et al. "Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals." Nature communications 11.1 (2020): 1-15. |
16 | | 1. Koehn, Philipp, et al. "Moses: Open source toolkit for statistical machine translation." Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. Association for Computational Linguistics, 2007. |
17 | | 1. Koehn, Philipp, Franz Josef Och, and Daniel Marcu. "Statistical phrase-based translation." Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for Computational Linguistics, 2003. |
18 | | 1. Denkowski, Michael, and Alon Lavie. "Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems." Proceedings of the Sixth Workshop on Statistical Machine Translation. Association for Computational Linguistics, 2011. |
39 | | ||czech.words||100,000 sentences from Czech part of DGT-TM|| |
40 | | ||czech.lemmas||100,000 sentences (lemmas) from Czech part of DGT|| |
41 | | ||english.words||100,000 sentences from English DGT|| |
42 | | ||english.lemmas||100,000 sentences (lemmas) from EN DGT|| |
43 | | ||eval.py||a script for evaluation of coverage and precision of a generated dictionary using a small English-Czech dictionary|| |
44 | | ||gnudfl.txt||a small English-Czech dictionary containing only one-word items and words from the training data|| |
45 | | ||make_dict.py||a script for generating a translation dictionary based on co-occurrences and frequency lists|| |
46 | | ||Makefile||a file with rules for building the dictionary based on the training data|| |
47 | | ||par2items.py||a file for generating pairs of words (lemmas) from the parallel data|| |
| 32 | Access [https://colab.research.google.com/drive/1t9y01lL6gPw8f9GU1phC5qlS8AqqdvS0?usp=sharing|Python notebook in the Google Colab environment]. |
69 | | * after each change to the input files or the scripts or parameters, clean temporary files: {{{make clean}}} |
70 | | |
71 | | == Detailed description of the scripts and generated data == |
72 | | |
73 | | * Try to run default {{{make dict}}} and look at the results: |
74 | | * czech.words.freq |
75 | | * english.words.freq |
76 | | * english.words-czech.words.cofreq |
77 | | * english.words-czech.words.dict (the resulting dictionary) |
78 | | * Look at sizes of the output files (how many lines they contain) and its contents. |
79 | | * Look at the script {{{make_dict.py}}}, which generates the dictionary: at key places it contains {{{TODO}}} |
80 | | * there you can change the script, add heuristics, change conditions etc. so the final f-score is the highest possible |
81 | | |
82 | | == Assignment == |
83 | | |
84 | | 1. Change the key places of scripts {{{par2items.py}}}, {{{make_dict.py}}} to achieve the highest possible f-score (see {{{make eval}}}). |
85 | | 1. Upload all the scripts into the vault in one archive file. |
86 | | 1. You can create it like this: {{{tar czf ia161_mt_<uco_or_login>.tar.gz Makefile *.py}}} |