Changes between Version 1 and Version 2 of en/MainTopics


Ignore:
Timestamp:
May 12, 2014, 11:26:47 AM (7 years ago)
Author:
xkocinc
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • en/MainTopics

    v1 v2  
    11= What do we work on in the NLP Lab? = #What_do_we_work_on_in_the_NLP_Lab.3F
    22Try some of our language tools:
     3
    34 * [http://prirucka.ujc.cas.cz/ The Online Language Handbook]
    45 * [http://nlp.fi.muni.cz/cz_accent/ CZ accent][[BR]]''for adding diacritics''
    56 * [http://nlp.fi.muni.cz/~xpopelk/xplain/ X-Plain][[BR]]''the Activity game with a computer''
    67 * [http://nlp.fi.muni.cz/projekty/wwwajka/ Ajka][[BR]]''morphological analyzer''
    7  * [http://nlp.fi.muni.cz/projekty/wwwsynt/query.cgi Synt] and [http://nlp.fi.muni.cz/projekty/set/ SET][[BR]]''syntactic analyzers''
     8 * [http://nlp.fi.muni.cz/projekty/wwwsynt/query.cgi Synt] and [http://nlp.fi.muni.cz/projekty/set/ SET][[BR]]''syntactic analyzers''
    89
    9 The [https://nlp.fi.muni.cz/en/nlplab Natural Language Processing Centre] focuses on obtaining practical results in the field of information technologies and linguistics. Results of the projects are frequently published at various conferences, the NLP Centre also cooperates with similarly oriented institutes in Czech Republic and abroad, and offers students the possibility to participate in student exchange with partner universities abroad.
     10The [https://nlp.fi.muni.cz/en/nlplab Natural Language Processing Centre] focuses on obtaining practical results in the field of information technologies and linguistics. Results of the projects are frequently published at various conferences, the NLP Centre also cooperates with similarly oriented institutes in Czech Republic and abroad, and offers students the possibility to participate in student exchange with partner universities abroad.
    1011
    1112More detailed information follows below, grouped into chapters according to their topic:
     
    1718|| [https://nlp.fi.muni.cz/en/main_topics#semant Semantics] ||
    1819
     20== Corpora == #Corpora
     21[[Image(/trac/research/raw-attachment/wiki/cs/MainTopics/corpora.png)]]
    1922
    20 == Corpora == #Corpora
    2123Corpus is a collection of text data in electronic form. As a significant source of linguistic data, corpora make it possible to investigate many frequency-related phenomena in language, and nowadays they are an indispensable tool in NLP. In addition to corpora containing general texts, corpora for specific purposes are also produced, such as annotated, domain-specific, spoken or error corpora.
    2224
    2325Corpora are used for investigation and development of natural language grammars. They are further helpful when developing a grammar checker, choosing entries for a dictionary or as a data source for automatic text categorization based on machine learning. Parallel corpora comprise of identical texts in various languages. They are used especially in word sense disambiguation and machine translation.
    2426
    25 Nowadays the main source of corpus texts is the World Wide Web. To obtain quality data on a larger scale, pre-processing tools for filtering undesired content need to be used: notably the '''jusText''' tool for removing boilerplate, the'''onion''' tool for removing duplicate text parts, or the '''chared''' utility for detecting text encoding. Very useful is also the popular '''gensim''' framework for extracting semantic topics from documents.
     27Nowadays the main source of corpus texts is the World Wide Web. To obtain quality data on a larger scale, pre-processing tools for filtering undesired content need to be used: notably the '''jusText''' tool for removing boilerplate, the'''onion''' tool for removing duplicate text parts, or the '''chared''' utility for detecting text encoding. Very useful is also the popular '''gensim''' framework for extracting semantic topics from documents.
    2628
    27 The NLP Centre has produced a complete set of tools for creating and managing corpora, the '''Corpus Architect'''. It can store and manage corpora containing 100+ billion word tokens.
     29The NLP Centre has produced a complete set of tools for creating and managing corpora, the '''Corpus Architect'''. It can store and manage corpora containing 100+ billion word tokens.
     30
     31[[Image(/trac/research/raw-attachment/wiki/cs/MainTopics/metatrans.png)]]
    2832
    2933''Related projects:''
     
    3135 * [http://nlp.fi.muni.cz/projekty/bonito/ Bonito]
    3236
    33 
    3437 * [http://ske.fi.muni.cz/ Corpus Architect]
    35 
    3638
    3739 * [http://www.sketchengine.co.uk/ Word Sketch Engine]
    3840
    39 
    4041 * [http://nlp.fi.muni.cz/projekty/cpa/ CPA]
    41 
    4242
    4343 * [http://nlp.fi.muni.cz/projekty/justext/ jusText]
    4444
    45 
    4645 * [http://code.google.com/p/onion/ onion]
    47 
    4846
    4947 * [http://code.google.com/p/chared/ chared]
    5048
    51 
    5249 * [http://radimrehurek.com/gensim/index.html Gensim]
    5350
    54 
    55 
    56 ([https://nlp.fi.muni.cz/en/main_topics#guidepost back to the list of topics]) [[BR]]
    57 
     51([https://nlp.fi.muni.cz/en/main_topics#guidepost back to the list of topics]) [[BR]]
    5852
    5953== Dictionaries == #Dictionaries
     54
     55[[Image(/trac/research/raw-attachment/wiki/cs/MainTopics/debII_slovniky.png, align=right)]]
     56
    6057Dictionaries have always been a fundamental part of every linguist's basic equipment. However, handling paper dictionaries is rather inconvenient. Therefore, one of the first projects of the NLP Centre was to digitize classic dictionaries of Czech and develop a set of advanced tools for processing lexicographic data, a so-called lexicographer's workbench. This term refers to a system that enables each expert user to easily access various linguistic resources and provides them with an application interface for searching and editing data.
    6158
    62 One of our projects related to dictionaries is the development of '''the DEB platform''', offering all the above mentioned features, thanks to its client-server architecture. One of the client applications is the '''DEBDict''' dictionary viewer, which contains apart from digitized dictionaries also several encyclopediae, and an onomastic and phraseological dictionary. Applications for DEB are developed in the XUL language and are available as extensions for the Firefox web browser. 
     59One of our projects related to dictionaries is the development of '''the DEB platform''', offering all the above mentioned features, thanks to its client-server architecture. One of the client applications is the '''DEBDict''' dictionary viewer, which contains apart from digitized dictionaries also several encyclopediae, and an onomastic and phraseological dictionary. Applications for DEB are developed in the XUL language and are available as extensions for the Firefox web browser.
    6360
    6461''Related projects:''
     
    6663 * [http://nlp.fi.muni.cz/projekty/deb2/ DEB II]
    6764
     65 * [http://nlp.fi.muni.cz/projekty/deb2/debdict/ DEBDict]
    6866
    69    * [http://nlp.fi.muni.cz/projekty/deb2/debdict/ DEBDict]
    70 
    71 
    72    * [http://nlp.fi.muni.cz/projekty/deb2/#debvisdic DEBVisDic]
    73 
     67 * [http://nlp.fi.muni.cz/projekty/deb2/#debvisdic DEBVisDic]
    7468
    7569 * [http://nlp.fi.muni.cz/publications/slovko2005_ydana_hales/slovko2005_ydana_hales.pdf Verbalex]
    7670
    77 
    7871 * [http://metatrans.fi.muni.cz/ MetaTrans]
    79 
    8072
    8173 * [http://nlp.fi.muni.cz/projekty/cpa/ CPA]
    8274
    83 
    84 
    85 ([https://nlp.fi.muni.cz/en/main_topics#guidepost back to the list of topics]) [[BR]]
    86 
     75([https://nlp.fi.muni.cz/en/main_topics#guidepost back to the list of topics]) [[BR]]
    8776
    8877== Morphology == #Morphology
     78
     79[[Image(/trac/research/raw-attachment/wiki/cs/MainTopics/majka_nlpportal.png, align=right)]]
     80
    8981Morphological analysis gives a basic insight into natural language by studying how to distinguish and generate grammatical forms of words arising through inflection (ie. declension and conjugation). This involves considering a set of tags describing the grammatical categories of the word form concerned, most notably, its base form (lemma) and paradigm. Automatic analysis of word forms in free text can be used for instance in grammar checker development, and can aid corpus tagging, or semi-automatic dictionary compiling.
    9082
    91 The NLP Centre has produced a general morphological analyzer for Czech, '''ajka''', which covers vocabulary of over 6 million word forms. It further served as a base for a similar analyzer for Slovak, the '''fispell'''grammar-checker, the '''czaccent''' converter of ascii text to text with diacritics, and an interactive interface for the IM Jabber protocol.
     83The NLP Centre has produced a general morphological analyzer for Czech, '''ajka''', which covers vocabulary of over 6 million word forms. It further served as a base for a similar analyzer for Slovak, the '''fispell'''grammar-checker, the '''czaccent''' converter of ascii text to text with diacritics, and an interactive interface for the IM Jabber protocol.
    9284
    9385''Related projects:''
     
    9587 * [http://nlp.fi.muni.cz/projekty/ajka/ Ajka]
    9688
    97 
    9889 * [http://nlp.fi.muni.cz/ma/free.html Fajka (the analyzer with free data)]
    99 
    10090
    10191 * [http://nlp.fi.muni.cz/cz_accent/ CZ accent]
    10292
    103 
    104 
    105 ([https://nlp.fi.muni.cz/en/main_topics#guidepost back to the list of topics]) [[BR]]
    106 
     93([https://nlp.fi.muni.cz/en/main_topics#guidepost back to the list of topics]) [[BR]]
    10794
    10895== Syntactic Analysis == #Syntactic_Analysis
     96
     97[[Image(/trac/research/raw-attachment/wiki/cs/MainTopics/synt_tree.png​, align=right)]]
     98
    10999The goal of syntactic analysis is to determine whether the text string on input is a sentence in the given (natural) language. If it is, the result of the analysis contains a description of the syntactic structure of the sentence, for example in the form of a derivation tree. Such formalizations are aimed at making computers "understand" grammar of natural languages. Syntactic analysis can be utilized for instance when developing a punctuation corrector, dialogue systems with a natural language interface, or as a building block in a machine translation system. Czech is a language exhibiting rich inflection and free word order and thus belongs to the languages that are very hard to analyze, as it requires more grammar rules than most other languages.
    110100
    111 The NLP Centre is developing several syntactic analyzers. The '''synt''' syntactic analyzer is based on a handcraftedCzech meta-grammar enhanced by semantic actions and contextual constraints. '''SET''' is a popular lightweightsyntactic analyzer based on set of patterns. Both '''synt''' and '''SET''' perform syntactic analysis of Czech sentences with an accuracy close to 90%. For educational purposes we have a simple syntactic analyzer '''Zuzana'''.
     101The NLP Centre is developing several syntactic analyzers. The '''synt''' syntactic analyzer is based on a handcraftedCzech meta-grammar enhanced by semantic actions and contextual constraints. '''SET''' is a popular lightweightsyntactic analyzer based on set of patterns. Both '''synt''' and '''SET''' perform syntactic analysis of Czech sentences with an accuracy close to 90%. For educational purposes we have a simple syntactic analyzer '''Zuzana'''.
    112102
    113103''Related projects:''
     
    115105 * [http://nlp.fi.muni.cz/projekty/wwwsynt/ Synt]
    116106
    117 
    118107 * [http://nlp.fi.muni.cz/projekty/set/ SET]
    119 
    120108
    121109 * [http://nlp.fi.muni.cz/projekty/zuzana/ Zuzana]
    122110
    123 
    124 
    125 ([https://nlp.fi.muni.cz/en/main_topics#guidepost back to the list of topics]) [[BR]]
    126 
     111([https://nlp.fi.muni.cz/en/main_topics#guidepost back to the list of topics]) [[BR]]
    127112
    128113== Semantics == #Semantics
     114
     115[[Image(/trac/research/raw-attachment/wiki/cs/MainTopics/dict2_small.png, align=left)]]
     116
    129117Semantic and pragmatic analysis make up the most complex phase of language processing as they build up on results of all the above mentioned disciplines. The ultimate touchstone on this level is machine translation, which hasn't been implemented for Czech with satisfactory results yet.
    130118
    131 One of the long-term projects of the NLP Centre is the use of'''Transparent Intensional Logic (TIL)''' as a semantic representation of knowledge and subsequently as a transfer language in automatic machine translation. At the current stage, it is realistic to process knowledge in a simpler form - considerably less complex tasks have been addressed, such as machine translation for a restricted domain (eg. official documents and weather reports), or semi-automatic machine translation between close languages. The resources exploited in these applications are corpora, semantic nets, and electronic dictionaries.
     119One of the long-term projects of the NLP Centre is the use of'''Transparent Intensional Logic (TIL)''' as a semantic representation of knowledge and subsequently as a transfer language in automatic machine translation. At the current stage, it is realistic to process knowledge in a simpler form - considerably less complex tasks have been addressed, such as machine translation for a restricted domain (eg. official documents and weather reports), or semi-automatic machine translation between close languages. The resources exploited in these applications are corpora, semantic nets, and electronic dictionaries.
    132120
    133 In the field of representation of meaning and knowledge we shall mention the notable contribution of NLP Centre members to the '''EuroWordNet''' and '''Balkanet''' projects, which were aimed at building a multilingual '''WordNet'''-like semantic net. 
     121In the field of representation of meaning and knowledge we shall mention the notable contribution of NLP Centre members to the '''EuroWordNet''' and '''Balkanet''' projects, which were aimed at building a multilingual '''WordNet'''-like semantic net.
    134122
    135123''Related projects:''
     
    137125 * [http://nlp.fi.muni.cz/projekty/deb2/#debvisdic DEBVisDic]
    138126
    139 
    140127 * [http://www.fi.muni.cz/~hales/disert/ Logical Analysis of Czech Sentences in TIL]
    141 
    142128
    143129 * [http://nlp.fi.muni.cz/projekty/vizualni_lexikon/ Visual Browser]
    144130
    145 
    146131 * [http://radimrehurek.com/gensim/index.html Gensim]
    147 
    148 
    149132
    150133''Animated demonstration of the Visual Browser:''
    151134
    152  * [https://nlp.fi.muni.cz/en/main_topics/VlDemoGif in GIF format (simplified)]
    153 
    154 
    155 
    156 ([https://nlp.fi.muni.cz/en/main_topics#guidepost back to the list of topics]) [[BR]]
     135[[Image(/trac/research/raw-attachment/wiki/cs/MainTopics/vl_anim.gif)]]
    157136
    158137== Further information == #Further_information
    159 
    160138 * [http://nlp.fi.muni.cz/projekty/ List of selected NLPlab projects]
    161 
    162139
    163140 * [https://nlp.fi.muni.cz/nlpis/baliky.php?lang=en&type=free Currently offered thesis topics]
    164141
    165 
    166142 * [https://nlp.fi.muni.cz/en/nlplab NLP lab homepage]