Changes between Version 9 and Version 10 of en/MainTopics


Ignore:
Timestamp:
Feb 18, 2025, 10:10:25 AM (5 months ago)
Author:
Zuzana Nevěřilová
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • en/MainTopics

    v9 v10  
    1111 * [http://prirucka.ujc.cas.cz/ The Online Language Handbook]
    1212 * [http://nlp.fi.muni.cz/cz_accent/ CZ accent][[BR]]''for adding diacritics''
    13  * [http://nlp.fi.muni.cz/~xpopelk/xplain/ X-Plain][[BR]]''the Activity game with a computer''
    1413 * [http://nlp.fi.muni.cz/projekty/wwwajka/ Ajka][[BR]]''morphological analyzer''
    1514 * [http://nlp.fi.muni.cz/projekty/wwwsynt/query.cgi Synt] and [http://nlp.fi.muni.cz/projekty/set/ SET][[BR]]''syntactic analyzers''
     15 * [http://nlp.fi.muni.cz/languageservices Language Services][[BR]]''Language Services'' - Aggregate API
    1616
    17 The [https://nlp.fi.muni.cz/en/ Natural Language Processing Centre] focuses on obtaining practical results in the field of information technologies and linguistics. Results of the projects are frequently published at various conferences, the NLP Centre also cooperates with similarly oriented institutes in Czech Republic and abroad, and offers students the possibility to participate in student exchange with partner universities abroad.
     17The [https://nlp.fi.muni.cz/en/ Natural Language Processing Centre] focuses on obtaining practical results in the field of language modeling, information technologies, and linguistics. Results of the projects are frequently published at various conferences, the NLP Centre also cooperates with similarly oriented institutes in Czech Republic and abroad, and offers students the possibility to participate in student exchange with partner universities abroad.
    1818
    1919More detailed information follows below, grouped into chapters according to their topic:
     20 * [[en/MainTopics#model| Language Modeling]]
    2021 * [[en/MainTopics#corp| Corpora]]
    2122 * [[en/MainTopics#dict| Dictionaries]]
     
    2425 * [[en/MainTopics#semant| Semantics]]
    2526
     27== Language Modeling == #model
     28
     29Language modeling is the prevalent approach to NLP. Thanks to huge text data, language models accurately represent natural languages and can:
     30* classify tokens or sequences of tokens,
     31* predict new tokens based on a sequence of previous words.
     32Language models are used in natural language generation (NLG), text summarization, sentiment analysis, named entity recognition, and many other NLP tasks.
     33
     34=== !BenCzechMark ===
     35
     36NLPC participates in the project !BenCzechMark, which aims to provide a universal large language model benchmark for Czech. The project is a joint work of Brno Technology University, Faculty of Informatics, Mendel University, and other institutions.
     37
     38Our contribution was to provide benchmark tasks for:
     39* [https://nlp.fi.muni.cz/trac/propaganda Propaganda] text annotation
     40* [https://nlp.fi.muni.cz/projekty/sqad/ SQAD] - question answering dataset
     41* [https://www.umimeto.org/ Umime.to] assessments
     42* [https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-5548 natural language inference]
     43
     44The !BenCzechMark leaderboard is available at [https://huggingface.co/spaces/CZLC/BenCzechMark].
     45
     46=== Slama ===
     47
     48Is it possible to train a language model from scratch? Yes. Check Slama - Slavic Large Language Model: [https://nlp.fi.muni.cz/raslan/2024/paper13.pdf RASLAN 2024 paper]
     49
    2650== Corpora == #corp
    2751[[Image(/trac/research/raw-attachment/wiki/en/MainTopics/corpora.png)]]
    2852
    29 Corpus is a collection of text data in electronic form. As a significant source of linguistic data, corpora make it possible to investigate many frequency-related phenomena in language, and nowadays they are an indispensable tool in NLP. In addition to corpora containing general texts, corpora for specific purposes are also produced, such as annotated, domain-specific, spoken or error corpora.
     53A corpus is a collection of text data in electronic form. As a significant source of linguistic data, corpora make it possible to investigate many frequency-related phenomena in language, and nowadays, they are an indispensable tool in NLP. In addition to corpora containing general texts, corpora for specific purposes are also produced, such as annotated, domain-specific, spoken, or error corpora.
    3054
    31 Corpora are used for investigation and development of natural language grammars. They are further helpful when developing a grammar checker, choosing entries for a dictionary or as a data source for automatic text categorization based on machine learning. Parallel corpora comprise of identical texts in various languages. They are used especially in word sense disambiguation and machine translation.
     55Corpora are the core technology for language modeling. They are also used for research in natural language grammars. They are further helpful when developing a grammar checker, choosing entries for a dictionary, or as a data source for automatic text categorization based on machine learning. Parallel corpora comprise identical texts in various languages. They are used especially in word sense disambiguation and machine translation.
    3256
    33 Nowadays the main source of corpus texts is the World Wide Web. To obtain quality data on a larger scale, pre-processing tools for filtering undesired content need to be used: notably the '''jusText''' tool for removing boilerplate, the'''onion''' tool for removing duplicate text parts, or the '''chared''' utility for detecting text encoding. Very useful is also the popular '''gensim''' framework for extracting semantic topics from documents.
     57Nowadays, the main source of corpus texts is the World Wide Web. To obtain quality data on a larger scale, pre-processing tools for filtering undesired content need to be used: notably the '''jusText''' tool for removing boilerplate, the '''onion''' tool for removing duplicate text parts, or the '''chared''' utility for detecting text encoding. The popular '''gensim''' framework for extracting semantic topics from documents is very useful.
    3458
    3559The NLP Centre has produced a complete set of tools for creating and managing corpora, the '''Corpus Architect'''. It can store and manage corpora containing 100+ billion word tokens.