= Word Level Analysis = == Motivation == Many applications need a tool for “clustering” of word forms appearing in texts: * chladniček * chladničky * chladničkách <=> chladnička * chladničce * ... Usage: * Indexing, searching, keyword extraction, ... * And almost all NLP tools == Word Level Processing Data for Czech == For almost 12 M word forms (incl. colloquial forms): * lemma (canonical form, dictionary form) * grammatical information: part of speech, number, case etc. Word form stroj has 3 interpretations: * lemma ''stroj'', nominative * lemma ''stroj'', accusative * noun, masculine animated, singular * lemma ''strojit'' * verb, 2nd person, singular, imperative mood == Possible Applications == Various types of analyses: * word form => lemma (many types of searching/indexation) * nebral => brát/nebrat (úplatky) * nejstaršího => nejstarší/starý (člověk) * chladnička => chladničky (as a class) * bavlna => bavlněný (word derivation) * word form/lemma + gram. info. => word form * e.g. salutation generation: pane Procházko * word form/lemma => all word forms * word form => lemma + full/partial grammatical information The analysis is very fast - approx. 1 million word forms per second