wiki:en/WordLevelAnalysis

Version 2 (modified by xkocinc, 10 years ago) (diff)

--

Word Level Analysis

Motivation

Many applications need a tool for “clustering” of word forms appearing in texts:

  • chladniček
  • chladničky
  • chladničkách <=> chladnička
  • chladničce
  • ...

Usage:

  • Indexing, searching, keyword extraction, ...
  • And almost all NLP tools

Word Level Processing Data for Czech

For almost 12 M word forms (incl. colloquial forms):

  • lemma (canonical form, dictionary form)
  • grammatical information: part of speech, number, case etc.

Word form stroj has 3 interpretations:

  • lemma stroj, nominative
  • lemma stroj, accusative
    • noun, masculine animated, singular
  • lemma strojit
    • verb, 2nd person, singular, imperative mood

Possible Applications

Various types of analyses:

  • word form => lemma (many types of searching/indexation)
    • nebral => brát/nebrat (úplatky)
    • nejstaršího => nejstarší/starý (člověk)
    • chladnička => chladničky (as a class)
    • bavlna => bavlněný (word derivation)
  • word form/lemma + gram. info. => word form
    • e.g. salutation generation: pane Procházko
  • word form/lemma => all word forms
  • word form => lemma + full/partial grammatical information

The analysis is very fast - approx. 1 million word forms per second

Attachments (3)

Download all attachments as: .zip