Changes between Version 4 and Version 5 of en/WordLevelAnalysis


Ignore:
Timestamp:
Jun 5, 2014, 10:56:28 AM (10 years ago)
Author:
xkocinc
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • en/WordLevelAnalysis

    v4 v5  
    7676 * rules and/or statistical data describe typical contexts of nouns, verbs, etc.
    7777 * using such information one can tell that ''stát'' is noun/verb
     78
     79
     80== Example of Contexts — Word Sketches ==
     81
     82[[Image(/trac/research/raw-attachment/wiki/en/WordLevelAnalysis/stat.png)]]
     83
     84== Spellchecking and Diacritics Restoration ==
     85
     86Data also allow spellchecking and diacritics restoration:
     87
     88[[Image(/trac/research/raw-attachment/wiki/en/WordLevelAnalysis/czAccent.png)]]
     89
     90
     91== Universality ==
     92
     93All the mentioned processes can be
     94 * tuned for a specific domain
     95   * using texts from this domain
     96 * applied to a language other than Czech
     97   * (Slovak, Polish, German, English, ...)
     98
     99
     100== Latest Applications ==
     101
     102Seznam.cz, Yandex.ru, Aukro.cz, Václav Havel Library
     103 * indexing and searching
     104
     105Information System of Masaryk University
     106 * other universities and schools (FHS UK, JAMU, VŠFS, ...)
     107 * affiliate projects (theses.cz, odevzdej.cz, repozitar.cz)
     108 * indexing, searching and plagiarism detection
     109
     110“Internetová jazyková příručka”
     111 * online source on Czech orthography and grammar
     112 * NLP Centre data were a starting point for word form tables
     113
     114
     115== Conclusions ==
     116
     117Word level processing of texts allows:
     118 * various types of base word determining which forms are to be grouped together
     119 * ambiguity resolution according to the context
     120 * word form generation
     121 * spellchecking, diacritics restoration
     122
     123The tools/data can be domain specific and for various languages
     124