Version 2 (modified by 10 years ago) (diff) | ,
---|
Word Level Analysis
Motivation
Many applications need a tool for “clustering” of word forms appearing in texts:
- chladniček
- chladničky
- chladničkách <=> chladnička
- chladničce
- ...
Usage:
- Indexing, searching, keyword extraction, ...
- And almost all NLP tools
Word Level Processing Data for Czech
For almost 12 M word forms (incl. colloquial forms):
- lemma (canonical form, dictionary form)
- grammatical information: part of speech, number, case etc.
Word form stroj has 3 interpretations:
- lemma stroj, nominative
- lemma stroj, accusative
- noun, masculine animated, singular
- lemma strojit
- verb, 2nd person, singular, imperative mood
Possible Applications
Various types of analyses:
- word form => lemma (many types of searching/indexation)
- nebral => brát/nebrat (úplatky)
- nejstaršího => nejstarší/starý (člověk)
- chladnička => chladničky (as a class)
- bavlna => bavlněný (word derivation)
- word form/lemma + gram. info. => word form
- e.g. salutation generation: pane Procházko
- word form/lemma => all word forms
- word form => lemma + full/partial grammatical information
The analysis is very fast - approx. 1 million word forms per second
Attachments (3)
- stat.png (181.4 KB) - added by 10 years ago.
- chladnicka.png (222.6 KB) - added by 10 years ago.
- czAccent.png (49.9 KB) - added by 10 years ago.
Download all attachments as: .zip