Context Navigation

Changes between Version 4 and Version 5 of sk

Timestamp:: Apr 11, 2013, 8:47:14 PM (12 years ago)
Author:: xmedved1
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

sk

-                      v4
+                      v5
-== Ataptation RFTagger for Slovak:  ==
+The RFTagger was addapted fo Slovak language. We use r-mak-3.0 corpora for training. The skTenTen was used to create lexicon for RFTagger.
+== Morphological analysis for slovak ==
+For Slovak there is only one morphological tagger called '''MORČE'''. '''MORČE''' is Czech morphological tagger based on Averaged Perceptron developed in Prague, Czech Republic in 2007. It was trained on Slovak manually annotated corpus '''r-mak''' and now use for Slovak morphological analysis.
+Nevertheless we decide train morphological tagger called '''RFTagger'''. Developed in German at University of Stuttgart. '''RFTagge'''r is HMM part-of-speech tagger which is particularly suited for POS tagsets with a large number of fine-grained tags.
+RFTagger consist of tree main ideas:
 [[BR]]
+The best results:
+spit morphological tag into attribute vector and determine POS probabilities of HMM as a~product of attribute probabilities.
+[[BR]]
+useage decision trees for determine contextual probabilities
+[[BR]]
+usasge high-order HMMs
+For disambiguation the part-of-speech '''RFTagger''' use additional attributes (like gender, case etc. from fine-grained tagsets) and word dependeces.
+The process of detemining tag in '''RFTagger''' is following. The tag is decomposed into a set of simple attributes and uses of decision trees to estimate the probability of each attribute. Then this probabilities are used for determine probability of word tag. The module trained on corpus learn which words can put together based on contextual dependeces. Then if '''RFTagger''' get some word that does not occur in corpus the trained model can deduce the tag.
+The tagger treats dots in POS tag labels as attribute separators. This feature makes '''RFTagger''' an universal tool because you can get your tagset as a~input for '''RFTagger''' without any translation to some specific tagset. The first attribute of POS tag is represent the main category and additional attributes are category-specific, which means that case in noun and case in adjective are two different attributes.
+The very important features of '''RFTagger''' is lexicon and file with possible POS tags of unknown words, that can be set as a~parameter for training. This two features can increase annotation accuracy. In our case this features increase accuracy by 1%.
+Most of POS taggers are trained on corpora with about 150 different POS tags. This tagsets usually contains little or no morphological features. For languages like German, Slovak or Czech with more fine-grained tagsets these taggers are not suitable.
+For this reason we decide to use '''RFTAgger''', that can process morphological analysis on languages with fine-grained tags.
+Because the '''RFTagger''' treats dots in POS tag labels as attribute separators and expected that each category has fixed number of additional attributes we have to adapt and update program for tag translation on this formalism. We add each additional attribute suited for given category. If the tag do not consist this attribute we put it undefined.
+For training '''RFTagger''' we use Slovak manually annotated '''r-mak 3.0''', which is translated into tagger formalism by program '''tag_RF_sk.py'''. By running '''RFTagger''' training program on '''r-mak 3.0'''  we obtain parameter file for annotating.
+The training program has lots of parameters like additional lexicon, amount tags for context etc. After experimenting with '''RFTagger''' we determine to use these three of attributes.
+We use lexicon obtained from '''skTenTen''', that we annotate with '''RFTagger''' trained on '''r-mak 3.0'''. We use POS tags list for unknown words (word that do not occur in training corpus). This list contains all possible tags for nouns, adjectives, numbers and verbs. We use 8 preceding tags as a~context for given word.
+After run the training program of '''RFTagger''' we obtain parameter file, that is used as an input for annotation program of '''RFTagger'''. So the annotation is then very simple. We get some vertical text as an input for annotation and obtain tagged vertical.
+Only one disadvantage of '''RFTagger''' is that determine lemma of given word.
+== Evaluation of '''RFTagger''' ==
+Parameters used in '''RFTagger''':
 [[BR]]
+kind 98.14 %
+-c 8: the 8 preceding tags are used as context
 [[BR]]
+genus 95.37 %
+-o POSTag: the possible POS tags of unknown words are restricted to those listed in file POSTag
 [[BR]]
+number 99.16 %
+-l lexicon: additional lexicon entries
 [[BR]]
+case 94.63 %
+{{{
+Feature     Accuracy
+kind        98.16 %
+genus       94.01 %
+number      98.78 %
+case        93.49 %
+person      96.85 %
+mod         99.92 %
+whole tag   89.55 %
+}}}
 [[BR]]
+person: 98.26 %
+We do not use any parameters. As a~input for training we use 90% of '''r-mak 3.0''' and we annotate the rest 10% of corpus. Then we determine accuracy between original 10% part of '''r-mak 3.0''' and annotated by '''RFTagger'''.
+{{{
+Feature     Accuracy
+kind        97.95 %
+genus       95.68 %
+number      99.22 %
+case        95.45 %
+person      98.53 %
+mod         99.898 %
+whole tag   92.42 %
+}}}
 [[BR]]
-mod: 99.98 %
-[[BR]]
-whole tag: 91.81 %
+In this evaluation we use lexicon too. The lexicon is obtained from skTenTen annotated by '''RFTagger''', that is trained on '''r-mak 3.0'''. In this case we divide the '''r-mak 3.0''' into 5 folds and we perform cross validation, when 4 folds is used for training '''RFTagger''' and 1 fold is used for annotation ad obtain results.