Morphological analyser of Czech - ajka (zde je česká verze)
The analyser ajka is a program written in C language and it is a result of my Master's thesis named
"Morfologický analyzátor češtiny". It was developed in 1999. This web page contains only basic information about the analyser as well as it's latest version. The analyser is still being developed.
To install the analyser correctly, you should keep in your mind the following facts:
- All source codes are written in C language (ANSI). The current version is ready to be compiled on Linux machines with Intel processors and it expects ISO 8859-2 code page for Czech. Compilation on machines with another types of processors and other operating systems was not tested. Thus, such a compilation is not recommended. We are working on Windows version of this analyser and we are trying to develop as much platform independent version as possible.
- The installation of programs abin and ajka is classical for UNIX-like operating systems.
Typing the command make in the directory with source codes is enough.
- The program abin is used for transformation of the Czech machine dictionary (ajka.dic) and the definition file of termination sets and paradigms (ajka.par) into binary files ajka.stm and ajka.mrf, respectively.
In our case, the command abin -d
ajka.dic will do this work. It will generate two binary files - a file ajka.mrf with morphological information and a file
ajka.stm with Czech stem basis stored in the trie data structure. These two files are necessary for the analyser ajka (it uses them).
For more information, you can type abin -h at the command line, a brief help message about using of the program abin will appear.
- The command ajka starts the analyser in the interactive mode. It is possible to run the analyser in the batch mode too. The batch mode is used for analysing verticalised (one word per line) text files. The name of the file to be analysed has to be written as a parameter of the program at the command line (e.g. ajka TextToBeAnalysed.txt). For further information about ajka please use the
- Details about both programs are available in my Master's thesis (in Czech).
Master's thesis abstract
The thesis focusses on finding an efficient data structure that would be suitable for storing lexical items --- words or stems. A relation between the trie data structure, trees and deterministic finite state machines is discussed and explained. The second part of the thesis contains documentations describing formats of the Czech machine dictionary and the definition file of termination sets and paradigms. The implementation of the morphological analyser and the tool for transformation source text files of these two files into binary files is included as well.
Master's thesis and slides
alib library documentation
The author of this web page and all included files is
The last revision: 18.2.2004