Changes between Version 2 and Version 3 of en/WordLevelAnalysis
- Timestamp:
- Jun 5, 2014, 10:51:18 AM (10 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
en/WordLevelAnalysis
v2 v3 32 32 Various types of analyses: 33 33 * word form => lemma (many types of searching/indexation) 34 * nebral => brát/nebrat (úplatky)35 * nejstaršího => nejstarší/starý (člověk)36 * chladnička => chladničky(as a class)37 * bavlna => bavlněný(word derivation)34 * ''nebral => brát/nebrat (úplatky)'' 35 * ''nejstaršího => nejstarší/starý (člověk)'' 36 * ''chladnička => chladničky'' (as a class) 37 * ''bavlna => bavlněný'' (word derivation) 38 38 * word form/lemma + gram. info. => word form 39 * e.g. salutation generation: pane Procházko39 * e.g. salutation generation: ''pane Procházko'' 40 40 * word form/lemma => all word forms 41 41 * word form => lemma + full/partial grammatical information 42 42 43 43 The analysis is very fast - approx. 1 million word forms per second 44 45 46 == Processing Unknown Words == 47 48 Some word forms in processed texts are unknown: 49 * terms ''polydaktylie'', neologisms ''klausoviny'', typos ''bizardního'', colloquial words ''plaťáky'', etc. 50 51 An ending of the word form is able to determine e.g. 52 * lemma: ''klausoviny => klausovina'' 53 * grammatical information: ''bizardního'' => genitive, etc. 54 * derivational relations: ''plaťáky => plaťákový'' 55 56 Texts from a particular domain allows grouping of unknown word forms: 57 * ''polydaktylie, polydaktiliích, polydaktylií, ... <=> polydaktylie'' 58 * => extension of data or more precise “guessing”