Opened 5 years ago

Last modified 5 years ago

#31 new task

keep tags alphanumeric only

Reported by: pary Owned by: xsmerk
Priority: major Milestone:
Component: majka Version:
Keywords: Cc:

Description

change specification of the tagset to be alphanumeric only:

change kIx subclasification to:

xS .!? (sentence, stop)
xC ,:; (comma, colon)
xQ "’‘„“ (quotation)
xL ({[< (left)
xR )}]> (right)
xX ~$%&-_+=\|/# etc.

remove statistical characteristic attribute (~):
it is not well defined:

  • is it normalized?
  • raw or logarithm?
  • raw or doc freq?
  • freq of tag, word, word/tag, lemma/tag, word/lemma/tag?
  • what corpus, domain, ...

statistical characteristic has nothing to do with morphology

Change History (1)

comment:1 Changed 5 years ago by hales

the statistical characteristic attribute (~) is important for parsing, it allows to assign a probability/ranking to the word analysis. without this, "s" has equal probability to be preposition, abbreviation and interjection. its exact definition is not as important, it has only a few values/classes (0-3).

it is also not generated in standard database (majka.w-lt), so no need to be removed (it is in majka.w-lt.synt).

Note: See TracTickets for help on using tickets.