Machine Translation
Topics
- Statistical machine translation
- Extension of translation memories
- Domain-specic machine translation
- Machine translation between close languages
- Sub-word level machine translation
Statistical machine translation
Improving statistical machine translation
- free state-of-the-art tools available (SRILM, Moses)
- baseline SMT available for everyone
- languages with high number of wordforms need special treatment
- dělám, děláš, dělal, dělajícímu, dělaje, děláním, ...
- magas, magasabb, legmagasabb, legeslegmagasabb, ...
- language models can be enriched with linguistic knowledge
Word alignment matrix - from words to phrases
Domain-specic machine translation
- straightforward way of increasing quality of MT
- domain-specic corpora can be downloaded on demand
- separate models for each domain: sports, cooking, gardening
- one sense per domain: bat
- translations of
- product details, product descriptions in e-shops,
- manuals, warranty certicates,
- user interface localizations, ...
Machine translation between close languages
- West and South Slavic languages: Czech, Slovak, Polish, Serbian, Croatian, Slovene
- MT mainly on word level, structure is very similar
- dierences can be described systematically by rules: hraje na klavíri <-> hraje na klavír
- billion-word corpora available for these languages
- dictionaries can be generated semi-automatically
- -> searching for duplicates in close languages (reprinted news)
MT quality, European languages
Sub-word level machine translation
- SMT principle applied on character level
- translation on subword level (English -> Czech)
- translation across levels
- -> translation of out-of-dictionary words
Conclusions
- generating new segments for translation memories
- domain-specic translation
- translation between close languages
- sub-word level translation