Changes between Version 2 and Version 3 of en/WordLevelAnalysis


Ignore:
Timestamp:
Jun 5, 2014, 10:51:18 AM (7 years ago)
Author:
xkocinc
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • en/WordLevelAnalysis

    v2 v3  
    3232Various types of analyses:
    3333 * word form => lemma (many types of searching/indexation)
    34    * nebral => brát/nebrat (úplatky)
    35    * nejstaršího => nejstarší/starý (člověk)
    36    * chladnička => chladničky (as a class)
    37    * bavlna => bavlněný (word derivation)
     34   * ''nebral => brát/nebrat (úplatky)''
     35   * ''nejstaršího => nejstarší/starý (člověk)''
     36   * ''chladnička => chladničky'' (as a class)
     37   * ''bavlna => bavlněný'' (word derivation)
    3838 * word form/lemma + gram. info. => word form
    39    * e.g. salutation generation: pane Procházko
     39   * e.g. salutation generation: ''pane Procházko''
    4040 * word form/lemma => all word forms
    4141 * word form => lemma + full/partial grammatical information
    4242
    4343The analysis is very fast - approx. 1 million word forms per second
     44
     45
     46== Processing Unknown Words ==
     47
     48Some word forms in processed texts are unknown:
     49 * terms ''polydaktylie'', neologisms ''klausoviny'', typos ''bizardního'', colloquial words ''plaťáky'', etc.
     50
     51An ending of the word form is able to determine e.g.
     52 * lemma: ''klausoviny => klausovina''
     53 * grammatical information: ''bizardního'' => genitive, etc.
     54 * derivational relations: ''plaťáky => plaťákový''
     55
     56Texts from a particular domain allows grouping of unknown word forms:
     57 * ''polydaktylie, polydaktiliích, polydaktylií, ... <=> polydaktylie''
     58 * => extension of data or more precise “guessing”