= Word Level Analysis =
== Motivation ==

[[Image(/trac/research/raw-attachment/wiki/en/WordLevelAnalysis/chladnicka.png)]]


Many applications need a tool for “clustering” of word forms appearing in texts:

 * chladniček
 * chladničky
 * chladničkách     <=>   chladnička
 * chladničce
 * ...

Usage:

 * Indexing, searching, keyword extraction, ...
 * And almost all NLP tools

== Word Level Processing Data for Czech ==
For almost 12 M word forms (incl. colloquial forms):

 * lemma (canonical form, dictionary form)
 * grammatical information: part of speech, number, case etc.

Word form stroj has 3 interpretations:

 * lemma ''stroj'', nominative
 * lemma ''stroj'', accusative
   * noun, masculine animated, singular
 * lemma ''strojit''
   * verb, 2nd person, singular, imperative mood


== Possible Applications ==
Various types of analyses:
 * word form => lemma (many types of searching/indexation)
   * ''nebral => brát/nebrat (úplatky)''
   * ''nejstaršího => nejstarší/starý (člověk)''
   * ''chladnička => chladničky'' (as a class)
   * ''bavlna => bavlněný'' (word derivation)
 * word form/lemma + gram. info. => word form
   * e.g. salutation generation: ''pane Procházko''
 * word form/lemma => all word forms
 * word form => lemma + full/partial grammatical information

The analysis is very fast - approx. 1 million word forms per second


== Processing Unknown Words ==

Some word forms in processed texts are unknown:
 * terms ''polydaktylie'', neologisms ''klausoviny'', typos ''bizardního'', colloquial words ''plaťáky'', etc.

An ending of the word form is able to determine e.g.
 * lemma: ''klausoviny => klausovina''
 * grammatical information: ''bizardního'' => genitive, etc.
 * derivational relations: ''plaťáky => plaťákový''

Texts from a particular domain allows grouping of unknown word forms:
 * ''polydaktylie, polydaktiliích, polydaktylií, ... <=> polydaktylie''
 * => extension of data or more precise “guessing”


== Resolving Ambiguities Using Context ==

An extreme case ''Stroj ženu holí.''
 * ''Já stroj ženu holí, ty stroj ženu holí, ten stroj ženu holí.''

Usual case is e.g. ''stát''
 * noun: ''Stát jsem já.''
 * verb: ''Celá továrna musela hodinu stát.''
 * at the part of speech level, it is a bigger problem for English

The context of the word determines its interpretation
 * rules and/or statistical data describe typical contexts of nouns, verbs, etc.
 * using such information one can tell that ''stát'' is noun/verb