= Language Resources = == For NLP we need: == [[Image(/trac/research/raw-attachment/wiki/en/LanguageResources/nlp_tools_resources_algorithms.png)]] == Language resources == Similar to dictionaries but more general * knowledge about language * knowledge about the world [[Image(/trac/research/raw-attachment/wiki/en/LanguageResources/cat.png)]] [[Image(/trac/research/raw-attachment/wiki/en/LanguageResources/slovnik_spis.cestiny.png)]] * '''intended for humans:''' multilingual dictionaries, explanatory dictionaries, thesauri, encyclopedias * '''intended for computer programs:''' translation memory, knowledge bases, semantic networks [[Image(/trac/research/raw-attachment/wiki/en/LanguageResources/CzechWordNet.png)]] == Types of language resources == * synonym dictionary - fuzzy searching * over 23000 entries, with over 56000 synonyms * Czech !Wordnet - 85592 words organized in 40919 synonym sets, plus grouping to domains/categories * thesaurus in Sketch Engine * translation dictionary - multilingual searching * Czech-English dictionary - 54000 entries * interconnected wordnets (!EuroWordnet, Balkanet) - Czech, English, Dutch, Italian, Spanish, French, Greek, Polish, Romanian, Turkish (at least 8500 common synonimical sets) * vulgar words dictionary - detection of inappropriate behavior in discussions * current language (April 2013), 600 manually edited words/collocations, with rules to detect masking * other: dictionary of toponyms? ancient surnames, genealogy? gestures, artworks...? * multimedial content in explanatory dictionaries (artworks, videos, recordings) for text enhancement * sign language dictionary with gesture videos == !WordNets == [[Image(/trac/research/raw-attachment/wiki/en/LanguageResources/WordNet_parallel.png)]] * 85,592 words organized in 40,919 synonymical sets * several relation types: subclass, part-of, translation, synonymy == Synonyms: Dictionary vs. thesaurus == [[Image(/trac/research/raw-attachment/wiki/en/LanguageResources/handsome_corpus.png, align=right)]] [[Image(/trac/research/raw-attachment/wiki/en/LanguageResources/handsome.png)]] * from the contemporary language * similarity score * available for many languages * for every word used in the language == Selected language resources at NLPC == * 6 dictionaries of Czech language, 512,000 of entries * synonyms * Czech synonyms (K. Pala): 23,000 entries, 56,000 synonyms * Czech !WordNet: 85,592 words organized in 40,919 synonymical sets * automatically generated thesaurus * translation * interconnected wordnets: Czech, English, Dutch, Italian, Spanish, French, Greek, Polish, Romanian, Turkish * specials * contemporary vulgar words (April 2013): 600 words/collocations + rules to detect concealing * sign language dictionary with gesture videos == Tools for language resources == Language resources have to be * built and continuously maintained * digitalized (OCR to XML) * connected with other language resources * shared among computer programs * readable for humans == Tools for language resources processing == * creating, editing, importing, connecting with other resources, visualizing == Language resource tools: the DEB platform == * platform for '''d'''ictionary '''e'''diting and '''b'''rowsing * strict client-server architecture * basically any XML data * server * server side modules * database backend (XML database) * client * lightweight * graphical interface * web interface * practically used in 22 international scientific/commercial projects [[Image(/trac/research/raw-attachment/wiki/en/LanguageResources/world.png)]] == Conclusions == * language resources: * dictionaries * corpus-based thesauri * semantic networks (!WordNet) * flexible and powerful tool for language resources processing: the DEB platform