Changes between Version 1 and Version 2 of en/WordLevelAnalysis
- Timestamp:
- Jun 5, 2014, 10:48:28 AM (9 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
en/WordLevelAnalysis
v1 v2 1 = Word Level Analysis = 1 = Word Level Analysis = 2 == Motivation == 3 Many applications need a tool for “clustering” of word forms appearing in texts: 2 4 3 == Motivation == 4 5 Many applications need a tool for “clustering” of word forms appearing in texts: 6 * chladniček 5 * chladniček 7 6 * chladničky 8 7 * chladničkách <=> chladnička … … 11 10 12 11 Usage: 12 13 13 * Indexing, searching, keyword extraction, ... 14 14 * And almost all NLP tools 15 15 16 == Word Level Processing Data for Czech == 17 For almost 12 M word forms (incl. colloquial forms): 16 18 17 == Word Level Processing Data for Czech ==18 19 For almost 12 M word forms (incl. colloquial forms):20 19 * lemma (canonical form, dictionary form) 21 20 * grammatical information: part of speech, number, case etc. 22 21 23 22 Word form stroj has 3 interpretations: 23 24 24 * lemma ''stroj'', nominative 25 25 * lemma ''stroj'', accusative 26 26 * noun, masculine animated, singular 27 27 * lemma ''strojit'' 28 * verb, 2nd person, singular, imperative mood 28 * verb, 2nd person, singular, imperative mood 29 30 31 == Possible Applications == 32 Various types of analyses: 33 * word form => lemma (many types of searching/indexation) 34 * nebral => brát/nebrat (úplatky) 35 * nejstaršího => nejstarší/starý (člověk) 36 * chladnička => chladničky (as a class) 37 * bavlna => bavlněný (word derivation) 38 * word form/lemma + gram. info. => word form 39 * e.g. salutation generation: pane Procházko 40 * word form/lemma => all word forms 41 * word form => lemma + full/partial grammatical information 42 43 The analysis is very fast - approx. 1 million word forms per second