Context Navigation

Changes between Version 1 and Version 2 of en/WordLevelAnalysis

Timestamp:: Jun 5, 2014, 10:48:28 AM (11 years ago)
Author:: xkocinc
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

en/WordLevelAnalysis

-                      v1
+                      v2
+= Word Level Analysis =
+= Word Level Analysis =
+== Motivation ==
+Many applications need a tool for “clustering” of word forms appearing in texts:
+== Motivation ==
+Many applications need a tool for “clustering” of word forms appearing in texts:
+ * chladniček
+ * chladniček
  * chladničky
  * chladničkách     <=>   chladnička
 …
 Usage:
  * Indexing, searching, keyword extraction, ...
  * And almost all NLP tools
+== Word Level Processing Data for Czech ==
+For almost 12 M word forms (incl. colloquial forms):
-== Word Level Processing Data for Czech ==
-For almost 12 M word forms (incl. colloquial forms):
  * lemma (canonical form, dictionary form)
  * grammatical information: part of speech, number, case etc.
 Word form stroj has 3 interpretations:
  * lemma ''stroj'', nominative
  * lemma ''stroj'', accusative
     * noun, masculine animated, singular
+   * noun, masculine animated, singular
  * lemma ''strojit''
+    * verb, 2nd person, singular, imperative mood
+   * verb, 2nd person, singular, imperative mood
+== Possible Applications ==
+Various types of analyses:
+ * word form => lemma (many types of searching/indexation)
+   * nebral => brát/nebrat (úplatky)
+   * nejstaršího => nejstarší/starý (člověk)
+   * chladnička => chladničky (as a class)
+   * bavlna => bavlněný (word derivation)
+ * word form/lemma + gram. info. => word form
+   * e.g. salutation generation: pane Procházko
+ * word form/lemma => all word forms
+ * word form => lemma + full/partial grammatical information
+The analysis is very fast - approx. 1 million word forms per second