== Morphological analysis for Slovak == For Slovak there is only one morphological tagger called '''MORČE'''. '''MORČE''' is Czech morphological tagger based on Averaged Perceptron developed in Prague, Czech Republic in 2007. It was trained on Slovak manually annotated corpus '''r-mak''' and now use for Slovak morphological analysis. Nevertheless we decide train morphological tagger called '''RFTagger'''. Developed in German at University of Stuttgart. '''RFTagge'''r is HMM part-of-speech tagger which is particularly suited for POS tagsets with a large number of fine-grained tags. RFTagger consist of tree main ideas: [[BR]] spit morphological tag into attribute vector and determine POS probabilities of HMM as a~product of attribute probabilities. [[BR]] useage decision trees for determine contextual probabilities [[BR]] usasge high-order HMMs For disambiguation the part-of-speech '''RFTagger''' use additional attributes (like gender, case etc. from fine-grained tagsets) and word dependeces. The process of detemining tag in '''RFTagger''' is following. The tag is decomposed into a set of simple attributes and uses of decision trees to estimate the probability of each attribute. Then this probabilities are used for determine probability of word tag. The module trained on corpus learn which words can put together based on contextual dependeces. Then if '''RFTagger''' get some word that does not occur in corpus the trained model can deduce the tag. The tagger treats dots in POS tag labels as attribute separators. This feature makes '''RFTagger''' an universal tool because you can get your tagset as a~input for '''RFTagger''' without any translation to some specific tagset. The first attribute of POS tag is represent the main category and additional attributes are category-specific, which means that case in noun and case in adjective are two different attributes. The very important features of '''RFTagger''' is lexicon and file with possible POS tags of unknown words, that can be set as a~parameter for training. This two features can increase annotation accuracy. In our case this features increase accuracy by 1%. Most of POS taggers are trained on corpora with about 150 different POS tags. This tagsets usually contains little or no morphological features. For languages like German, Slovak or Czech with more fine-grained tagsets these taggers are not suitable. For this reason we decide to use '''RFTAgger''', that can process morphological analysis on languages with fine-grained tags. Because the '''RFTagger''' treats dots in POS tag labels as attribute separators and expected that each category has fixed number of additional attributes we have to adapt and update program for tag translation on this formalism. We add each additional attribute suited for given category. If the tag do not consist this attribute we put it undefined. For training '''RFTagger''' we use Slovak manually annotated '''r-mak 3.0''', which is translated into tagger formalism by program '''tag_RF_sk.py'''. By running '''RFTagger''' training program on '''r-mak 3.0''' we obtain parameter file for annotating. The training program has lots of parameters like additional lexicon, amount tags for context etc. After experimenting with '''RFTagger''' we determine to use these three of attributes. We use lexicon obtained from '''skTenTen''', that we annotate with '''RFTagger''' trained on '''r-mak 3.0'''. We use POS tags list for unknown words (word that do not occur in training corpus). This list contains all possible tags for nouns, adjectives, numbers and verbs. We use 8 preceding tags as a~context for given word. After run the training program of '''RFTagger''' we obtain parameter file, that is used as an input for annotation program of '''RFTagger'''. So the annotation is then very simple. We get some vertical text as an input for annotation and obtain tagged vertical. Only one disadvantage of '''RFTagger''' is that determine lemma of given word. == Evaluation of '''RFTagger''' == Parameters used in '''RFTagger''': [[BR]] -c 8: the 8 preceding tags are used as context [[BR]] -o POSTag: the possible POS tags of unknown words are restricted to those listed in file POSTag [[BR]] -l lexicon: additional lexicon entries [[BR]] {{{ Feature Accuracy kind 98.10 % genus 93.87 % number 98.76 % case 93.32 % person 96.67 % mod 99.93 % whole tag 92.31 % }}} [[BR]] In this evaluation we don't use any parameters. As a~input for training we use 80% of '''r-mak 3.0''' and we annotate the rest 20% of corpus. Then we determine accuracy between original 20% part of '''r-mak 3.0''' and annotated by '''RFTagger'''. {{{ Feature Accuracy kind 98.02 % genus 95.81 % number 99.24 % case 95.42 % person 98.53 % mod 99.92 % whole tag 94.10 % }}} [[BR]] We do use -o POSTag -c 8 -l lexicon parameters. As a~input for training we use 80% of '''r-mak 3.0''' and we annotate the rest 20% of corpus. Then we determine accuracy between original 20% part of '''r-mak 3.0''' and annotated by '''RFTagger'''. {{{ Feature Accuracy kind 97.95 % genus 95.68 % number 99.22 % case 95.45 % person 98.53 % mod 99.898 % whole tag 94.016 % }}} [[BR]] In this evaluation we use lexicon too. The lexicon is obtained from skTenTen annotated by '''RFTagger''', that is trained on '''r-mak 3.0'''. In this case we divide the '''r-mak 3.0''' into 5 folds and we perform cross validation, when 4 folds is used for training '''RFTagger''' and 1 fold is used for annotation ad obtain results. '''To create a parametr file you cen use:''' /nlp/projekty/syntax_sk/RFTagger/aux/createParFile brief -> where brief is corpus in brief format[[BR]] '''For annotation use:''' /corpora/programy/cztaggers/RFTagger/bin/rft-annotate parfile infile outfile[[BR]] '''To get same format of tags as before use:''' cat file | /nlp/projekty/syntax_sk/RFTagger/aux/transformBack[[BR]] '''Download''' slovak parameter file is awailable here: [http://nlp.fi.muni.cz/projekty/syntax_sk/parameter_file.tar.gz]