wiki:sk

Version 8 (modified by xmedved1, 11 years ago) (diff)

--

Morphological analysis for Slovak

For Slovak there is only one morphological tagger called MORČE. MORČE is Czech morphological tagger based on Averaged Perceptron developed in Prague, Czech Republic in 2007. It was trained on Slovak manually annotated corpus r-mak and now use for Slovak morphological analysis.

Nevertheless we decide train morphological tagger called RFTagger. Developed in German at University of Stuttgart. RFTagger is HMM part-of-speech tagger which is particularly suited for POS tagsets with a large number of fine-grained tags.

RFTagger consist of tree main ideas:
spit morphological tag into attribute vector and determine POS probabilities of HMM as a~product of attribute probabilities.
useage decision trees for determine contextual probabilities
usasge high-order HMMs

For disambiguation the part-of-speech RFTagger use additional attributes (like gender, case etc. from fine-grained tagsets) and word dependeces.

The process of detemining tag in RFTagger is following. The tag is decomposed into a set of simple attributes and uses of decision trees to estimate the probability of each attribute. Then this probabilities are used for determine probability of word tag. The module trained on corpus learn which words can put together based on contextual dependeces. Then if RFTagger get some word that does not occur in corpus the trained model can deduce the tag.

The tagger treats dots in POS tag labels as attribute separators. This feature makes RFTagger an universal tool because you can get your tagset as a~input for RFTagger without any translation to some specific tagset. The first attribute of POS tag is represent the main category and additional attributes are category-specific, which means that case in noun and case in adjective are two different attributes.

The very important features of RFTagger is lexicon and file with possible POS tags of unknown words, that can be set as a~parameter for training. This two features can increase annotation accuracy. In our case this features increase accuracy by 1%.

Most of POS taggers are trained on corpora with about 150 different POS tags. This tagsets usually contains little or no morphological features. For languages like German, Slovak or Czech with more fine-grained tagsets these taggers are not suitable.

For this reason we decide to use RFTAgger, that can process morphological analysis on languages with fine-grained tags.

Because the RFTagger treats dots in POS tag labels as attribute separators and expected that each category has fixed number of additional attributes we have to adapt and update program for tag translation on this formalism. We add each additional attribute suited for given category. If the tag do not consist this attribute we put it undefined.

For training RFTagger we use Slovak manually annotated r-mak 3.0, which is translated into tagger formalism by program tag_RF_sk.py. By running RFTagger training program on r-mak 3.0 we obtain parameter file for annotating.

The training program has lots of parameters like additional lexicon, amount tags for context etc. After experimenting with RFTagger we determine to use these three of attributes.

We use lexicon obtained from skTenTen, that we annotate with RFTagger trained on r-mak 3.0. We use POS tags list for unknown words (word that do not occur in training corpus). This list contains all possible tags for nouns, adjectives, numbers and verbs. We use 8 preceding tags as a~context for given word.

After run the training program of RFTagger we obtain parameter file, that is used as an input for annotation program of RFTagger. So the annotation is then very simple. We get some vertical text as an input for annotation and obtain tagged vertical.

Only one disadvantage of RFTagger is that determine lemma of given word.

Evaluation of RFTagger

Parameters used in RFTagger:
-c 8: the 8 preceding tags are used as context
-o POSTag: the possible POS tags of unknown words are restricted to those listed in file POSTag
-l lexicon: additional lexicon entries

Feature     Accuracy
kind        98.10 %
genus       93.87 %
number      98.76 %
case        93.32 %
person      96.67 %
mod         99.93 %
whole tag   92.31 %


In this evaluation we don't use any parameters. As a~input for training we use 80% of r-mak 3.0 and we annotate the rest 20% of corpus. Then we determine accuracy between original 20% part of r-mak 3.0 and annotated by RFTagger.

Feature     Accuracy
kind        98.02 %
genus       95.81 %
number      99.24 %
case        95.42 %
person      98.53 %
mod         99.92 %
whole tag   94.10 %


We do use -o POSTag -c 8 -l lexicon parameters. As a~input for training we use 80% of r-mak 3.0 and we annotate the rest 20% of corpus. Then we determine accuracy between original 20% part of r-mak 3.0 and annotated by RFTagger.

Feature     Accuracy
kind        97.95 %
genus       95.68 %
number      99.22 %
case        95.45 %
person      98.53 %
mod         99.898 % 
whole tag   94.016 %


In this evaluation we use lexicon too. The lexicon is obtained from skTenTen annotated by RFTagger, that is trained on r-mak 3.0. In this case we divide the r-mak 3.0 into 5 folds and we perform cross validation, when 4 folds is used for training RFTagger and 1 fold is used for annotation ad obtain results.

To create a parametr file you cen use: /nlp/projekty/syntax_sk/RFTagger/aux/createParFile brief -> where brief is corpus in brief format

For annotation use: /corpora/programy/cztaggers/RFTagger/bin/rft-annotate parfile infile outfile

To get same format of tags as before use: cat file | /nlp/projekty/syntax_sk/RFTagger/aux/transformBack