Opened 5 years ago
Last modified 5 years ago
#31 new task
keep tags alphanumeric only
Reported by: | pary | Owned by: | xsmerk |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | majka | Version: | |
Keywords: | Cc: |
Description
change specification of the tagset to be alphanumeric only:
change kIx subclasification to:
xS .!? (sentence, stop)
xC ,:; (comma, colon)
xQ "’‘„“ (quotation)
xL ({[< (left)
xR )}]> (right)
xX ~$%&-_+=\|/# etc.
remove statistical characteristic attribute (~):
it is not well defined:
- is it normalized?
- raw or logarithm?
- raw or doc freq?
- freq of tag, word, word/tag, lemma/tag, word/lemma/tag?
- what corpus, domain, ...
statistical characteristic has nothing to do with morphology
Note: See
TracTickets for help on using
tickets.
the statistical characteristic attribute (~) is important for parsing, it allows to assign a probability/ranking to the word analysis. without this, "s" has equal probability to be preposition, abbreviation and interjection. its exact definition is not as important, it has only a few values/classes (0-3).
it is also not generated in standard database (
majka.w-lt
), so no need to be removed (it is inmajka.w-lt.synt
).