= Word Level Analysis = == Motivation == [[Image(/trac/research/raw-attachment/wiki/en/WordLevelAnalysis/chladnicka.png)]] Many applications need a tool for “clustering” of word forms appearing in texts: * chladniček * chladničky * chladničkách <=> chladnička * chladničce * ... Usage: * Indexing, searching, keyword extraction, ... * And almost all NLP tools == Word Level Processing Data for Czech == For almost 12 M word forms (incl. colloquial forms): * lemma (canonical form, dictionary form) * grammatical information: part of speech, number, case etc. Word form stroj has 3 interpretations: * lemma ''stroj'', nominative * lemma ''stroj'', accusative * noun, masculine animated, singular * lemma ''strojit'' * verb, 2nd person, singular, imperative mood == Possible Applications == Various types of analyses: * word form => lemma (many types of searching/indexation) * ''nebral => brát/nebrat (úplatky)'' * ''nejstaršího => nejstarší/starý (člověk)'' * ''chladnička => chladničky'' (as a class) * ''bavlna => bavlněný'' (word derivation) * word form/lemma + gram. info. => word form * e.g. salutation generation: ''pane Procházko'' * word form/lemma => all word forms * word form => lemma + full/partial grammatical information The analysis is very fast - approx. 1 million word forms per second == Processing Unknown Words == Some word forms in processed texts are unknown: * terms ''polydaktylie'', neologisms ''klausoviny'', typos ''bizardního'', colloquial words ''plaťáky'', etc. An ending of the word form is able to determine e.g. * lemma: ''klausoviny => klausovina'' * grammatical information: ''bizardního'' => genitive, etc. * derivational relations: ''plaťáky => plaťákový'' Texts from a particular domain allows grouping of unknown word forms: * ''polydaktylie, polydaktiliích, polydaktylií, ... <=> polydaktylie'' * => extension of data or more precise “guessing” == Resolving Ambiguities Using Context == An extreme case ''Stroj ženu holí.'' * ''Já stroj ženu holí, ty stroj ženu holí, ten stroj ženu holí.'' Usual case is e.g. ''stát'' * noun: ''Stát jsem já.'' * verb: ''Celá továrna musela hodinu stát.'' * at the part of speech level, it is a bigger problem for English The context of the word determines its interpretation * rules and/or statistical data describe typical contexts of nouns, verbs, etc. * using such information one can tell that ''stát'' is noun/verb