wiki:en/LanguageResources

Language Resources

For NLP we need:

/trac/research/raw-attachment/wiki/en/LanguageResources/nlp_tools_resources_algorithms.png

Language resources

Similar to dictionaries but more general

  • knowledge about language
  • knowledge about the world

/trac/research/raw-attachment/wiki/en/LanguageResources/cat.png

/trac/research/raw-attachment/wiki/en/LanguageResources/slovnik_spis.cestiny.png

  • intended for humans: multilingual dictionaries, explanatory dictionaries, thesauri, encyclopedias

  • intended for computer programs: translation memory, knowledge bases, semantic networks

/trac/research/raw-attachment/wiki/en/LanguageResources/CzechWordNet.png

Types of language resources

  • synonym dictionary - fuzzy searching
    • over 23000 entries, with over 56000 synonyms
    • Czech !Wordnet - 85592 words organized in 40919 synonym sets, plus grouping to domains/categories
    • thesaurus in Sketch Engine
  • translation dictionary - multilingual searching
    • Czech-English dictionary - 54000 entries
    • interconnected wordnets (EuroWordnet, Balkanet) - Czech, English, Dutch, Italian, Spanish, French, Greek, Polish, Romanian, Turkish (at least 8500 common synonimical sets)
  • vulgar words dictionary - detection of inappropriate behavior in discussions
    • current language (April 2013), 600 manually edited words/collocations, with rules to detect masking
  • other: dictionary of toponyms? ancient surnames, genealogy? gestures, artworks...?
    • multimedial content in explanatory dictionaries (artworks, videos, recordings) for text enhancement
    • sign language dictionary with gesture videos

WordNets

/trac/research/raw-attachment/wiki/en/LanguageResources/WordNet_parallel.png

  • 85,592 words organized in 40,919 synonymical sets

  • several relation types: subclass, part-of, translation, synonymy

Synonyms: Dictionary vs. thesaurus

/trac/research/raw-attachment/wiki/en/LanguageResources/handsome_corpus.png

  • from the contemporary language
  • similarity score
  • available for many languages
  • for every word used in the language

/trac/research/raw-attachment/wiki/en/LanguageResources/handsome.png

Selected language resources at NLPC

  • 6 dictionaries of Czech language, 512,000 of entries
  • synonyms
    • Czech synonyms (K. Pala): 23,000 entries, 56,000 synonyms
    • Czech WordNet: 85,592 words organized in 40,919 synonymical sets
    • automatically generated thesaurus
  • translation
    • interconnected wordnets: Czech, English, Dutch, Italian, Spanish, French, Greek, Polish, Romanian, Turkish
  • specials
    • contemporary vulgar words (April 2013): 600 words/collocations + rules to detect concealing
    • sign language dictionary with gesture videos

Tools for language resources

Language resources have to be

  • built and continuously maintained
  • digitalized (OCR to XML)
  • connected with other language resources
  • shared among computer programs
  • readable for humans

Tools for language resources processing

  • creating, editing, importing, connecting with other resources, visualizing

Language resource tools: the DEB platform

  • platform for dictionary editing and browsing
    • strict client-server architecture
    • basically any XML data
  • server
    • server side modules
    • database backend (XML database)
  • client
    • lightweight
    • graphical interface
    • web interface
  • practically used in 22 international scientific/commercial projects

/trac/research/raw-attachment/wiki/en/LanguageResources/world.png

Conclusions

  • language resources:
    • dictionaries
    • corpus-based thesauri
    • semantic networks (WordNet)
  • flexible and powerful tool for language resources processing: the DEB platform
Last modified 6 years ago Last modified on Jun 6, 2014, 1:18:21 PM

Attachments (8)

Download all attachments as: .zip