| 18 | |
| 19 | For disambiguation the part-of-speech '''RFTagger''' use additional attributes (like gender, case etc. from fine-grained tagsets) and word dependeces. |
| 20 | |
| 21 | The process of detemining tag in '''RFTagger''' is following. The tag is decomposed into a set of simple attributes and uses of decision trees to estimate the probability of each attribute. Then this probabilities are used for determine probability of word tag. The module trained on corpus learn which words can put together based on contextual dependeces. Then if '''RFTagger''' get some word that does not occur in corpus the trained model can deduce the tag. |
| 22 | |
| 23 | The tagger treats dots in POS tag labels as attribute separators. This feature makes '''RFTagger''' an universal tool because you can get your tagset as a~input for '''RFTagger''' without any translation to some specific tagset. The first attribute of POS tag is represent the main category and additional attributes are category-specific, which means that case in noun and case in adjective are two different attributes. |
| 24 | |
| 25 | The very important features of '''RFTagger''' is lexicon and file with possible POS tags of unknown words, that can be set as a~parameter for training. This two features can increase annotation accuracy. In our case this features increase accuracy by 1%. |
| 26 | |
| 27 | Most of POS taggers are trained on corpora with about 150 different POS tags. This tagsets usually contains little or no morphological features. For languages like German, Slovak or Czech with more fine-grained tagsets these taggers are not suitable. |
| 28 | |
| 29 | For this reason we decide to use '''RFTAgger''', that can process morphological analysis on languages with fine-grained tags. |
| 30 | |
| 31 | Because the '''RFTagger''' treats dots in POS tag labels as attribute separators and expected that each category has fixed number of additional attributes we have to adapt and update program for tag translation on this formalism. We add each additional attribute suited for given category. If the tag do not consist this attribute we put it undefined. |
| 32 | |
| 33 | For training '''RFTagger''' we use Slovak manually annotated '''r-mak 3.0''', which is translated into tagger formalism by program '''tag_RF_sk.py'''. By running '''RFTagger''' training program on '''r-mak 3.0''' we obtain parameter file for annotating. |
| 34 | |
| 35 | The training program has lots of parameters like additional lexicon, amount tags for context etc. After experimenting with '''RFTagger''' we determine to use these three of attributes. |
| 36 | |
| 37 | We use lexicon obtained from '''skTenTen''', that we annotate with '''RFTagger''' trained on '''r-mak 3.0'''. We use POS tags list for unknown words (word that do not occur in training corpus). This list contains all possible tags for nouns, adjectives, numbers and verbs. We use 8 preceding tags as a~context for given word. |
| 38 | |
| 39 | After run the training program of '''RFTagger''' we obtain parameter file, that is used as an input for annotation program of '''RFTagger'''. So the annotation is then very simple. We get some vertical text as an input for annotation and obtain tagged vertical. |
| 40 | |
| 41 | Only one disadvantage of '''RFTagger''' is that determine lemma of given word. |
| 42 | |
| 43 | |
| 44 | == Evaluation of '''RFTagger''' == |
| 45 | |
| 46 | Parameters used in '''RFTagger''': |