Changes between Version 1 and Version 2 of en/WordLevelAnalysis


Ignore:
Timestamp:
Jun 5, 2014, 10:48:28 AM (7 years ago)
Author:
xkocinc
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • en/WordLevelAnalysis

    v1 v2  
    1 = Word Level Analysis =
     1= Word Level Analysis =
     2== Motivation ==
     3Many applications need a tool for “clustering” of word forms appearing in texts:
    24
    3 == Motivation ==
    4 
    5 Many applications need a tool for “clustering” of word forms appearing in texts:
    6  * chladniček   
     5 * chladniček
    76 * chladničky
    87 * chladničkách     <=>   chladnička
     
    1110
    1211Usage:
     12
    1313 * Indexing, searching, keyword extraction, ...
    1414 * And almost all NLP tools
    1515
     16== Word Level Processing Data for Czech ==
     17For almost 12 M word forms (incl. colloquial forms):
    1618
    17 == Word Level Processing Data for Czech ==
    18 
    19 For almost 12 M word forms (incl. colloquial forms):
    2019 * lemma (canonical form, dictionary form)
    2120 * grammatical information: part of speech, number, case etc.
    2221
    2322Word form stroj has 3 interpretations:
     23
    2424 * lemma ''stroj'', nominative
    2525 * lemma ''stroj'', accusative
    26     * noun, masculine animated, singular
     26   * noun, masculine animated, singular
    2727 * lemma ''strojit''
    28     * verb, 2nd person, singular, imperative mood
     28   * verb, 2nd person, singular, imperative mood
     29
     30
     31== Possible Applications ==
     32Various types of analyses:
     33 * word form => lemma (many types of searching/indexation)
     34   * nebral => brát/nebrat (úplatky)
     35   * nejstaršího => nejstarší/starý (člověk)
     36   * chladnička => chladničky (as a class)
     37   * bavlna => bavlněný (word derivation)
     38 * word form/lemma + gram. info. => word form
     39   * e.g. salutation generation: pane Procházko
     40 * word form/lemma => all word forms
     41 * word form => lemma + full/partial grammatical information
     42
     43The analysis is very fast - approx. 1 million word forms per second