Version 4 (modified by xkocinc, 9 years ago) (diff)


Word Level Analysis



Many applications need a tool for “clustering” of word forms appearing in texts:

  • chladniček
  • chladničky
  • chladničkách <=> chladnička
  • chladničce
  • ...


  • Indexing, searching, keyword extraction, ...
  • And almost all NLP tools

Word Level Processing Data for Czech

For almost 12 M word forms (incl. colloquial forms):

  • lemma (canonical form, dictionary form)
  • grammatical information: part of speech, number, case etc.

Word form stroj has 3 interpretations:

  • lemma stroj, nominative
  • lemma stroj, accusative
    • noun, masculine animated, singular
  • lemma strojit
    • verb, 2nd person, singular, imperative mood

Possible Applications

Various types of analyses:

  • word form => lemma (many types of searching/indexation)
    • nebral => brát/nebrat (úplatky)
    • nejstaršího => nejstarší/starý (člověk)
    • chladnička => chladničky (as a class)
    • bavlna => bavlněný (word derivation)
  • word form/lemma + gram. info. => word form
    • e.g. salutation generation: pane Procházko
  • word form/lemma => all word forms
  • word form => lemma + full/partial grammatical information

The analysis is very fast - approx. 1 million word forms per second

Processing Unknown Words

Some word forms in processed texts are unknown:

  • terms polydaktylie, neologisms klausoviny, typos bizardního, colloquial words plaťáky, etc.

An ending of the word form is able to determine e.g.

  • lemma: klausoviny => klausovina
  • grammatical information: bizardního => genitive, etc.
  • derivational relations: plaťáky => plaťákový

Texts from a particular domain allows grouping of unknown word forms:

  • polydaktylie, polydaktiliích, polydaktylií, ... <=> polydaktylie
  • => extension of data or more precise “guessing”

Resolving Ambiguities Using Context

An extreme case Stroj ženu holí.

  • Já stroj ženu holí, ty stroj ženu holí, ten stroj ženu holí.

Usual case is e.g. stát

  • noun: Stát jsem já.
  • verb: Celá továrna musela hodinu stát.
  • at the part of speech level, it is a bigger problem for English

The context of the word determines its interpretation

  • rules and/or statistical data describe typical contexts of nouns, verbs, etc.
  • using such information one can tell that stát is noun/verb

Attachments (3)

Download all attachments as: .zip