= Word Level Analysis = == Motivation == Many applications need a tool for “clustering” of word forms appearing in texts: * chladniček * chladničky * chladničkách <=> chladnička * chladničce * ... [[Image(/trac/research/raw-attachment/wiki/en/WordLevelAnalysis/chladnicka.png)]] Usage: * Indexing, searching, keyword extraction, ... * And almost all NLP tools == Word Level Processing Data for Czech == For almost 12 M word forms (incl. colloquial forms): * lemma (canonical form, dictionary form) * grammatical information: part of speech, number, case etc. Word form stroj has 3 interpretations: * lemma ''stroj'', nominative * lemma ''stroj'', accusative * noun, masculine animated, singular * lemma ''strojit'' * verb, 2nd person, singular, imperative mood == Possible Applications == Various types of analyses: * word form => lemma (many types of searching/indexation) * ''nebral => brát/nebrat (úplatky)'' * ''nejstaršího => nejstarší/starý (člověk)'' * ''chladnička => chladničky'' (as a class) * ''bavlna => bavlněný'' (word derivation) * word form/lemma + gram. info. => word form * e.g. salutation generation: ''pane Procházko'' * word form/lemma => all word forms * word form => lemma + full/partial grammatical information The analysis is very fast - approx. 1 million word forms per second == Processing Unknown Words == Some word forms in processed texts are unknown: * terms ''polydaktylie'', neologisms ''klausoviny'', typos ''bizardního'', colloquial words ''plaťáky'', etc. An ending of the word form is able to determine e.g. * lemma: ''klausoviny => klausovina'' * grammatical information: ''bizardního'' => genitive, etc. * derivational relations: ''plaťáky => plaťákový'' Texts from a particular domain allows grouping of unknown word forms: * ''polydaktylie, polydaktiliích, polydaktylií, ... <=> polydaktylie'' * => extension of data or more precise “guessing” == Resolving Ambiguities Using Context == An extreme case ''Stroj ženu holí.'' * ''Já stroj ženu holí, ty stroj ženu holí, ten stroj ženu holí.'' Usual case is e.g. ''stát'' * noun: ''Stát jsem já.'' * verb: ''Celá továrna musela hodinu stát.'' * at the part of speech level, it is a bigger problem for English The context of the word determines its interpretation * rules and/or statistical data describe typical contexts of nouns, verbs, etc. * using such information one can tell that ''stát'' is noun/verb == Example of Contexts — Word Sketches == [[Image(/trac/research/raw-attachment/wiki/en/WordLevelAnalysis/stat.png)]] == Spellchecking and Diacritics Restoration == Data also allow spellchecking and diacritics restoration: [[Image(/trac/research/raw-attachment/wiki/en/WordLevelAnalysis/czAccent.png)]] == Universality == All the mentioned processes can be * tuned for a specific domain * using texts from this domain * applied to a language other than Czech * (Slovak, Polish, German, English, ...) == Latest Applications == Seznam.cz, Yandex.ru, Aukro.cz, Václav Havel Library * indexing and searching Information System of Masaryk University * other universities and schools (FHS UK, JAMU, VŠFS, ...) * affiliate projects (theses.cz, odevzdej.cz, repozitar.cz) * indexing, searching and plagiarism detection “Internetová jazyková příručka” * online source on Czech orthography and grammar * NLP Centre data were a starting point for word form tables == Conclusions == Word level processing of texts allows: * various types of base word determining which forms are to be grouped together * ambiguity resolution according to the context * word form generation * spellchecking, diacritics restoration The tools/data can be domain specific and for various languages