Čeština
English
  • Vítejte na stránkách NLP Centra!
  • Zapojte se do vývoje softwarových nástrojů!
  • Analýza přirozeného jazyka
  • Vyzkoušejte si korpusy o velikosti knihoven online!
  • Studujte jednu ze specializací!
  • Členové laboratoře

Machine Translation

/trac/research/raw-attachment/wiki/en/MachineTranslation/error.png

Topics

  • Statistical machine translation
  • Extension of translation memories
  • Domain-specic machine translation
  • Machine translation between close languages
  • Sub-word level machine translation

Statistical machine translation

/trac/research/raw-attachment/wiki/en/MachineTranslation/trans.png

Improving statistical machine translation

  • free state-of-the-art tools available (SRILM, Moses)
  • baseline SMT available for everyone
  • languages with high number of wordforms need special treatment
  • dělám, děláš, dělal, dělajícímu, dělaje, děláním, ...
  • magas, magasabb, legmagasabb, legeslegmagasabb, ...
  • language models can be enriched with linguistic knowledge

/trac/research/raw-attachment/wiki/en/MachineTranslation/kings.png

Word alignment matrix - from words to phrases

/trac/research/raw-attachment/wiki/en/MachineTranslation/word_matrix1.png

/trac/research/raw-attachment/wiki/en/MachineTranslation/word_matrix2.png

/trac/research/raw-attachment/wiki/en/MachineTranslation/word_matrix3.png

Domain-specic machine translation

  • straightforward way of increasing quality of MT
  • domain-specic corpora can be downloaded on demand
  • separate models for each domain: sports, cooking, gardening
  • one sense per domain: bat

/trac/research/raw-attachment/wiki/en/MachineTranslation/bat.png

  • translations of
    • product details, product descriptions in e-shops,
    • manuals, warranty certicates,
    • user interface localizations, ...

Machine translation between close languages

  • West and South Slavic languages: Czech, Slovak, Polish, Serbian, Croatian, Slovene
  • MT mainly on word level, structure is very similar
  • dierences can be described systematically by rules: hraje na klavíri <-> hraje na klavír
  • billion-word corpora available for these languages
  • dictionaries can be generated semi-automatically
  • -> searching for duplicates in close languages (reprinted news)

MT quality, European languages

/trac/research/raw-attachment/wiki/en/MachineTranslation/lang_matrix.png

Sub-word level machine translation

  • SMT principle applied on character level
  • translation on subword level (English -> Czech)

/trac/research/raw-attachment/wiki/en/MachineTranslation/trans1.png

  • translation across levels

/trac/research/raw-attachment/wiki/en/MachineTranslation/trans2.png

  • -> translation of out-of-dictionary words

Conclusions

  • generating new segments for translation memories
  • domain-specic translation
  • translation between close languages
  • sub-word level translation