Changes between Version 9 and Version 10 of en/MainTopics
- Timestamp:
- Feb 18, 2025, 10:10:25 AM (5 months ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
en/MainTopics
v9 v10 11 11 * [http://prirucka.ujc.cas.cz/ The Online Language Handbook] 12 12 * [http://nlp.fi.muni.cz/cz_accent/ CZ accent][[BR]]''for adding diacritics'' 13 * [http://nlp.fi.muni.cz/~xpopelk/xplain/ X-Plain][[BR]]''the Activity game with a computer''14 13 * [http://nlp.fi.muni.cz/projekty/wwwajka/ Ajka][[BR]]''morphological analyzer'' 15 14 * [http://nlp.fi.muni.cz/projekty/wwwsynt/query.cgi Synt] and [http://nlp.fi.muni.cz/projekty/set/ SET][[BR]]''syntactic analyzers'' 15 * [http://nlp.fi.muni.cz/languageservices Language Services][[BR]]''Language Services'' - Aggregate API 16 16 17 The [https://nlp.fi.muni.cz/en/ Natural Language Processing Centre] focuses on obtaining practical results in the field of information technologiesand linguistics. Results of the projects are frequently published at various conferences, the NLP Centre also cooperates with similarly oriented institutes in Czech Republic and abroad, and offers students the possibility to participate in student exchange with partner universities abroad.17 The [https://nlp.fi.muni.cz/en/ Natural Language Processing Centre] focuses on obtaining practical results in the field of language modeling, information technologies, and linguistics. Results of the projects are frequently published at various conferences, the NLP Centre also cooperates with similarly oriented institutes in Czech Republic and abroad, and offers students the possibility to participate in student exchange with partner universities abroad. 18 18 19 19 More detailed information follows below, grouped into chapters according to their topic: 20 * [[en/MainTopics#model| Language Modeling]] 20 21 * [[en/MainTopics#corp| Corpora]] 21 22 * [[en/MainTopics#dict| Dictionaries]] … … 24 25 * [[en/MainTopics#semant| Semantics]] 25 26 27 == Language Modeling == #model 28 29 Language modeling is the prevalent approach to NLP. Thanks to huge text data, language models accurately represent natural languages and can: 30 * classify tokens or sequences of tokens, 31 * predict new tokens based on a sequence of previous words. 32 Language models are used in natural language generation (NLG), text summarization, sentiment analysis, named entity recognition, and many other NLP tasks. 33 34 === !BenCzechMark === 35 36 NLPC participates in the project !BenCzechMark, which aims to provide a universal large language model benchmark for Czech. The project is a joint work of Brno Technology University, Faculty of Informatics, Mendel University, and other institutions. 37 38 Our contribution was to provide benchmark tasks for: 39 * [https://nlp.fi.muni.cz/trac/propaganda Propaganda] text annotation 40 * [https://nlp.fi.muni.cz/projekty/sqad/ SQAD] - question answering dataset 41 * [https://www.umimeto.org/ Umime.to] assessments 42 * [https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-5548 natural language inference] 43 44 The !BenCzechMark leaderboard is available at [https://huggingface.co/spaces/CZLC/BenCzechMark]. 45 46 === Slama === 47 48 Is it possible to train a language model from scratch? Yes. Check Slama - Slavic Large Language Model: [https://nlp.fi.muni.cz/raslan/2024/paper13.pdf RASLAN 2024 paper] 49 26 50 == Corpora == #corp 27 51 [[Image(/trac/research/raw-attachment/wiki/en/MainTopics/corpora.png)]] 28 52 29 Corpus is a collection of text data in electronic form. As a significant source of linguistic data, corpora make it possible to investigate many frequency-related phenomena in language, and nowadays they are an indispensable tool in NLP. In addition to corpora containing general texts, corpora for specific purposes are also produced, such as annotated, domain-specific, spokenor error corpora.53 A corpus is a collection of text data in electronic form. As a significant source of linguistic data, corpora make it possible to investigate many frequency-related phenomena in language, and nowadays, they are an indispensable tool in NLP. In addition to corpora containing general texts, corpora for specific purposes are also produced, such as annotated, domain-specific, spoken, or error corpora. 30 54 31 Corpora are used for investigation and development of natural language grammars. They are further helpful when developing a grammar checker, choosing entries for a dictionary or as a data source for automatic text categorization based on machine learning. Parallel corpora comprise ofidentical texts in various languages. They are used especially in word sense disambiguation and machine translation.55 Corpora are the core technology for language modeling. They are also used for research in natural language grammars. They are further helpful when developing a grammar checker, choosing entries for a dictionary, or as a data source for automatic text categorization based on machine learning. Parallel corpora comprise identical texts in various languages. They are used especially in word sense disambiguation and machine translation. 32 56 33 Nowadays the main source of corpus texts is the World Wide Web. To obtain quality data on a larger scale, pre-processing tools for filtering undesired content need to be used: notably the '''jusText''' tool for removing boilerplate, the'''onion''' tool for removing duplicate text parts, or the '''chared''' utility for detecting text encoding. Very useful is also the popular '''gensim''' framework for extracting semantic topics from documents.57 Nowadays, the main source of corpus texts is the World Wide Web. To obtain quality data on a larger scale, pre-processing tools for filtering undesired content need to be used: notably the '''jusText''' tool for removing boilerplate, the '''onion''' tool for removing duplicate text parts, or the '''chared''' utility for detecting text encoding. The popular '''gensim''' framework for extracting semantic topics from documents is very useful. 34 58 35 59 The NLP Centre has produced a complete set of tools for creating and managing corpora, the '''Corpus Architect'''. It can store and manage corpora containing 100+ billion word tokens.