Introduction

Related (close, similar) languages have many similar characteristics. One of them is lexicon. Our framework Trdlo is dedicated for building simple dictionaries (set of translation pairs) that contain only words which are same or similar enough. This approach is not error-proof but can be very usable as a supplement to regular dictionary.

framework trdlo

Framework Trdlo is fully usable and ready for download. Package is not suitable for end-users because you will have to write some rules in programming languages. You will need gcc, perl, bash and GNU make which should be available for your system.

Tips & Hints:
  • If you don't have reference dictionary then you can use empty file
  • All targets are documented in Makefile itself
  • Counting edit distances can take several hours or days
  • Framework contains examples of transducing rules

News in last version: (v1.0; 30 Dec 2009)
  • Data in UTF-8
  • Uses context obtained from monolingual corpora
  • Shows problematic words (words with same translation)

Plans for v2.0
  • New methods for elimination of translation candidates
  • Parallel corpora

Ideas behinds Trdlo

Framework Trdlo is suitable for languages that lacks enough of language resources. We will need wordlists (with words in base form) for source and target language and set of rules that describe differencies on character level (eg. ô -> ů). These rules are written on the top of our framework and it is the only thing which you have to do yourself :). Rules usually have great precision but recall is not good enough. A next step in our approach was using edit-distance (eg. Levenshtein edit-distance). Using these methods we are able to greatly increase recall without losing too much of precision.

Combining those two previous methods we are able to get very good results. Closer languages means usually better results but unfortunately we do not have any measurement for closeness of languages :( This approach is suitable for building new and extending existing dictionaries of close languages (dialects).

Contact

If you will encounter problems or you found out how to improve quality of created dictionaries then don't be afraid and contact me on my e-mail: xgrac@fi.muni.cz. If you would like to use this framework but you don't know how to write rules, you can mail me too and we will try to solve it.

Bibliography

Acknowledgement

This software was developed within projects LC536, 1ET100300419 and 2C06009. Masaryk University, Faculty of Informatics, NLP Centre is its owner. Licence is available here.

My Other Projects

  • Zuzana
    Simple syntactic analyzer
  • MT
    Machine translation between close languages
  • MT Blog
    My blog about machine translation

Download

Links