= Language Resources =

== For NLP we need: ==

[[Image(/trac/research/raw-attachment/wiki/en/LanguageResources/nlp_tools_resources_algorithms.png)]]

== Language resources ==

Similar to dictionaries but more general
 * knowledge about language
 * knowledge about the world

[[Image(/trac/research/raw-attachment/wiki/en/LanguageResources/cat.png)]]

[[Image(/trac/research/raw-attachment/wiki/en/LanguageResources/slovnik_spis.cestiny.png)]]

 * '''intended for humans:''' multilingual dictionaries, explanatory dictionaries, thesauri, encyclopedias 
 
 * '''intended for computer programs:''' translation memory, knowledge bases, semantic networks

[[Image(/trac/research/raw-attachment/wiki/en/LanguageResources/CzechWordNet.png)]]



== Types of language resources ==

 * synonym dictionary - fuzzy searching
   * over 23000 entries, with over 56000 synonyms
   * Czech !Wordnet - 85592 words organized in 40919 synonym sets, plus grouping to domains/categories
   * thesaurus in Sketch Engine

 * translation dictionary - multilingual searching
   * Czech-English dictionary - 54000 entries
   * interconnected wordnets (!EuroWordnet, Balkanet) - Czech, English, Dutch, Italian, Spanish, French, Greek, Polish, Romanian, Turkish (at least 8500 common synonimical sets)

 * vulgar words dictionary - detection of inappropriate behavior in discussions
   * current language (April 2013), 600 manually edited words/collocations, with rules to detect masking

 * other: dictionary of toponyms? ancient surnames, genealogy? gestures, artworks...?
   * multimedial content in explanatory dictionaries (artworks, videos, recordings) for text enhancement
   * sign language dictionary with gesture videos


== !WordNets ==

[[Image(/trac/research/raw-attachment/wiki/en/LanguageResources/WordNet_parallel.png)]]

 * 85,592 words organized in 40,919 synonymical sets
 
 * several relation types: subclass, part-of, translation, synonymy


== Synonyms: Dictionary vs. thesaurus ==

[[Image(/trac/research/raw-attachment/wiki/en/LanguageResources/handsome_corpus.png, align=right)]]

[[Image(/trac/research/raw-attachment/wiki/en/LanguageResources/handsome.png)]]

 * from the contemporary language
 * similarity score
 * available for many languages
 * for every word used in the language



== Selected language resources at NLPC ==
 * 6 dictionaries of Czech language, 512,000 of entries
 * synonyms
   * Czech synonyms (K. Pala): 23,000 entries, 56,000 synonyms
   * Czech !WordNet: 85,592 words organized in 40,919 synonymical sets
   * automatically generated thesaurus
 * translation
   * interconnected wordnets: Czech, English, Dutch, Italian, Spanish, French, Greek, Polish, Romanian, Turkish
 * specials
   * contemporary vulgar words (April 2013): 600 words/collocations + rules to detect concealing
   * sign language dictionary with gesture videos

== Tools for language resources == 

Language resources have to be
 * built and continuously maintained
 * digitalized (OCR to XML) 
 * connected with other language resources
 * shared among computer programs
 * readable for humans


== Tools for language resources processing ==

 * creating, editing, importing, connecting with other resources, visualizing


== Language resource tools: the DEB platform ==

 * platform for '''d'''ictionary '''e'''diting and '''b'''rowsing
   * strict client-server architecture
   * basically any XML data

 * server
   * server side modules
   * database backend (XML database)

 * client
   * lightweight
   * graphical interface
   * web interface

 * practically used in 22 international scientific/commercial projects

[[Image(/trac/research/raw-attachment/wiki/en/LanguageResources/world.png)]]


== Conclusions ==
 * language resources:
   * dictionaries
   * corpus-based thesauri
   * semantic networks (!WordNet)
 * flexible and powerful tool for language resources processing: the DEB platform