Changes between Version 4 and Version 5 of sk


Ignore:
Timestamp:
Apr 11, 2013, 8:47:14 PM (11 years ago)
Author:
xmedved1
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • sk

    v4 v5  
    11
    2 == Ataptation RFTagger for Slovak:  ==
    32
    4 The RFTagger was addapted fo Slovak language. We use r-mak-3.0 corpora for training. The skTenTen was used to create lexicon for RFTagger.
     3
     4== Morphological analysis for slovak ==
     5
     6For Slovak there is only one morphological tagger called '''MORČE'''. '''MORČE''' is Czech morphological tagger based on Averaged Perceptron developed in Prague, Czech Republic in 2007. It was trained on Slovak manually annotated corpus '''r-mak''' and now use for Slovak morphological analysis.
     7
     8Nevertheless we decide train morphological tagger called '''RFTagger'''. Developed in German at University of Stuttgart. '''RFTagge'''r is HMM part-of-speech tagger which is particularly suited for POS tagsets with a large number of fine-grained tags.
     9
     10RFTagger consist of tree main ideas:
    511[[BR]]
    6 The best results:
     12spit morphological tag into attribute vector and determine POS probabilities of HMM as a~product of attribute probabilities.
     13[[BR]]
     14useage decision trees for determine contextual probabilities
     15[[BR]]
     16usasge high-order HMMs
    717
     18
     19For disambiguation the part-of-speech '''RFTagger''' use additional attributes (like gender, case etc. from fine-grained tagsets) and word dependeces.
     20
     21The process of detemining tag in '''RFTagger''' is following. The tag is decomposed into a set of simple attributes and uses of decision trees to estimate the probability of each attribute. Then this probabilities are used for determine probability of word tag. The module trained on corpus learn which words can put together based on contextual dependeces. Then if '''RFTagger''' get some word that does not occur in corpus the trained model can deduce the tag.
     22
     23The tagger treats dots in POS tag labels as attribute separators. This feature makes '''RFTagger''' an universal tool because you can get your tagset as a~input for '''RFTagger''' without any translation to some specific tagset. The first attribute of POS tag is represent the main category and additional attributes are category-specific, which means that case in noun and case in adjective are two different attributes.
     24
     25The very important features of '''RFTagger''' is lexicon and file with possible POS tags of unknown words, that can be set as a~parameter for training. This two features can increase annotation accuracy. In our case this features increase accuracy by 1%.
     26
     27Most of POS taggers are trained on corpora with about 150 different POS tags. This tagsets usually contains little or no morphological features. For languages like German, Slovak or Czech with more fine-grained tagsets these taggers are not suitable.
     28
     29For this reason we decide to use '''RFTAgger''', that can process morphological analysis on languages with fine-grained tags.
     30
     31Because the '''RFTagger''' treats dots in POS tag labels as attribute separators and expected that each category has fixed number of additional attributes we have to adapt and update program for tag translation on this formalism. We add each additional attribute suited for given category. If the tag do not consist this attribute we put it undefined.
     32
     33For training '''RFTagger''' we use Slovak manually annotated '''r-mak 3.0''', which is translated into tagger formalism by program '''tag_RF_sk.py'''. By running '''RFTagger''' training program on '''r-mak 3.0'''  we obtain parameter file for annotating.
     34
     35The training program has lots of parameters like additional lexicon, amount tags for context etc. After experimenting with '''RFTagger''' we determine to use these three of attributes.
     36
     37We use lexicon obtained from '''skTenTen''', that we annotate with '''RFTagger''' trained on '''r-mak 3.0'''. We use POS tags list for unknown words (word that do not occur in training corpus). This list contains all possible tags for nouns, adjectives, numbers and verbs. We use 8 preceding tags as a~context for given word.
     38
     39After run the training program of '''RFTagger''' we obtain parameter file, that is used as an input for annotation program of '''RFTagger'''. So the annotation is then very simple. We get some vertical text as an input for annotation and obtain tagged vertical.
     40
     41Only one disadvantage of '''RFTagger''' is that determine lemma of given word.
     42
     43
     44== Evaluation of '''RFTagger''' ==
     45
     46Parameters used in '''RFTagger''':
    847[[BR]]
    9 kind 98.14 %
     48-c 8: the 8 preceding tags are used as context
    1049[[BR]]
    11 genus 95.37 %
     50-o POSTag: the possible POS tags of unknown words are restricted to those listed in file POSTag
    1251[[BR]]
    13 number 99.16 %
     52-l lexicon: additional lexicon entries
    1453[[BR]]
    15 case 94.63 %
     54
     55
     56{{{
     57Feature     Accuracy
     58kind        98.16 %
     59genus       94.01 %
     60number      98.78 %
     61case        93.49 %
     62person      96.85 %
     63mod         99.92 %
     64whole tag   89.55 %
     65}}}
    1666[[BR]]
    17 person: 98.26 %
     67
     68   
     69We do not use any parameters. As a~input for training we use 90% of '''r-mak 3.0''' and we annotate the rest 10% of corpus. Then we determine accuracy between original 10% part of '''r-mak 3.0''' and annotated by '''RFTagger'''.
     70
     71
     72{{{
     73Feature     Accuracy
     74kind        97.95 %
     75genus       95.68 %
     76number      99.22 %
     77case        95.45 %
     78person      98.53 %
     79mod         99.898 %
     80whole tag   92.42 %
     81
     82}}}
    1883[[BR]]
    19 mod: 99.98 %
    20 [[BR]]
    21 whole tag: 91.81 %
    2284
     85In this evaluation we use lexicon too. The lexicon is obtained from skTenTen annotated by '''RFTagger''', that is trained on '''r-mak 3.0'''. In this case we divide the '''r-mak 3.0''' into 5 folds and we perform cross validation, when 4 folds is used for training '''RFTagger''' and 1 fold is used for annotation ad obtain results.