Čeština

English

Machine Translation

Topics

Statistical machine translation
Extension of translation memories
Domain-specic machine translation
Machine translation between close languages
Sub-word level machine translation

Statistical machine translation

Improving statistical machine translation

free state-of-the-art tools available (SRILM, Moses)
baseline SMT available for everyone
languages with high number of wordforms need special treatment
dělám, děláš, dělal, dělajícímu, dělaje, děláním, ...
magas, magasabb, legmagasabb, legeslegmagasabb, ...
language models can be enriched with linguistic knowledge

Word alignment matrix - from words to phrases

Domain-specic machine translation

straightforward way of increasing quality of MT
domain-specic corpora can be downloaded on demand
separate models for each domain: sports, cooking, gardening
one sense per domain: bat

translations of
- product details, product descriptions in e-shops,
- manuals, warranty certicates,
- user interface localizations, ...

Machine translation between close languages

West and South Slavic languages: Czech, Slovak, Polish, Serbian, Croatian, Slovene
MT mainly on word level, structure is very similar
dierences can be described systematically by rules: hraje na klavíri <-> hraje na klavír
billion-word corpora available for these languages
dictionaries can be generated semi-automatically

-> searching for duplicates in close languages (reprinted news)

MT quality, European languages

Sub-word level machine translation

SMT principle applied on character level
translation on subword level (English -> Czech)

translation across levels

-> translation of out-of-dictionary words

Conclusions

generating new segments for translation memories
domain-specic translation
translation between close languages
sub-word level translation