Language Resources
For NLP we need:
Language resources
Similar to dictionaries but more general
- knowledge about language
- knowledge about the world
- intended for humans: multilingual dictionaries, explanatory dictionaries, thesauri, encyclopedias
- intended for computer programs: translation memory, knowledge bases, semantic networks
Types of language resources
- synonym dictionary - fuzzy searching
- over 23000 entries, with over 56000 synonyms
- Czech !Wordnet - 85592 words organized in 40919 synonym sets, plus grouping to domains/categories
- thesaurus in Sketch Engine
- translation dictionary - multilingual searching
- Czech-English dictionary - 54000 entries
- interconnected wordnets (EuroWordnet, Balkanet) - Czech, English, Dutch, Italian, Spanish, French, Greek, Polish, Romanian, Turkish (at least 8500 common synonimical sets)
- vulgar words dictionary - detection of inappropriate behavior in discussions
- current language (April 2013), 600 manually edited words/collocations, with rules to detect masking
- other: dictionary of toponyms? ancient surnames, genealogy? gestures, artworks...?
- multimedial content in explanatory dictionaries (artworks, videos, recordings) for text enhancement
- sign language dictionary with gesture videos
WordNets
- 85,592 words organized in 40,919 synonymical sets
- several relation types: subclass, part-of, translation, synonymy
Synonyms: Dictionary vs. thesaurus
- from the contemporary language
- similarity score
- available for many languages
- for every word used in the language
Selected language resources at NLPC
- 6 dictionaries of Czech language, 512,000 of entries
- synonyms
- Czech synonyms (K. Pala): 23,000 entries, 56,000 synonyms
- Czech WordNet: 85,592 words organized in 40,919 synonymical sets
- automatically generated thesaurus
- translation
- interconnected wordnets: Czech, English, Dutch, Italian, Spanish, French, Greek, Polish, Romanian, Turkish
- specials
- contemporary vulgar words (April 2013): 600 words/collocations + rules to detect concealing
- sign language dictionary with gesture videos
Tools for language resources
Language resources have to be
- built and continuously maintained
- digitalized (OCR to XML)
- connected with other language resources
- shared among computer programs
- readable for humans
Tools for language resources processing
- creating, editing, importing, connecting with other resources, visualizing
Language resource tools: the DEB platform
- platform for dictionary editing and browsing
- strict client-server architecture
- basically any XML data
- server
- server side modules
- database backend (XML database)
- client
- lightweight
- graphical interface
- web interface
- practically used in 22 international scientific/commercial projects
Conclusions
- language resources:
- dictionaries
- corpus-based thesauri
- semantic networks (WordNet)
- flexible and powerful tool for language resources processing: the DEB platform