= A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents == Description This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER). === Example {{{ Král/B-PER Zikmund/I-PER dává/O Petrovi/B-PER z/I-PER Michalovic/I-PER ,/O který/O mu/O prokazoval/O věrné/O služby/O a/O kterého/O chce/O Zikmund/B-PER touto/O odměnou/O povzbuditi/O k/O ještě/O usilovnější/O službě/O ,/O vesnici/O Předměřice/B-LOC nad/I-LOC Jizerou/I-LOC s/O alody/O ,/O poplužími/O ,/O obdělávanými/O i/O neobdělávanými/O poli/O ,/O platy/O ,/O službami/O ,/O robotami/O ,/O loukami/O ,/O pastvinami/O ,/O vodami/O ,/O vodními/O toky/O ,/O mlýny/O ,/O všemi/O příjmy/O a/O vším/O příslušenstvím/O ./O }}} More information about the database can be found at the [https://nlp.fi.muni.cz/trac/ahisto/wiki/NerDataset AHISTO project page] == LINDAT handle http://hdl.handle.net/11234/1-5024 == Acknowledgements If you use the system, please cite the related publication as well as the LINDAT/CLARIAH infrastructure: http://hdl.handle.net/11234/1-5024. Project code: LM2018101 Project name: LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy === TAČR Project code: TL03000365 Project name: Accessible historical sources. Making medieval written documents available in the form of a contextual database == Publication info - BANKOVIČ, Mikuláš, Vít NOVOTNÝ a Petr SOJKA. Application of Super-Resolution Models in Optical Character Recognition of Czech Medieval Texts. In Horák, Rychlý, Rambousek. Recent Advances in Slavonic Natural Language Processing (RASLAN 2021). Brno: Tribun EU, 2021, s. 11-18. ISBN 978-80-263-1670-1. - Vít Novotný, Kristýna Seidlová, Tereza Vrabcová, Ales Horák: When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts. 29-39 If you cite the dataset, please use this citation: {{{ @inproceedings{DBLP:conf/raslan/NovotnySVH21, author = {V{\'{\i}}t Novotn{\'{y}} and Krist{\'{y}}na Seidlov{\'{a}} and Tereza Vrabcov{\'{a}} and Ales Hor{\'{a}}k}, editor = {Ales Hor{\'{a}}k and Pavel Rychl{\'{y}} and Adam Rambousek}, title = {When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts}, booktitle = {The 15th Workshop on Recent Advances in Slavonic Natural Languages Processing, {RASLAN} 2021, Karlova Studanka, Czech Republic, December 10-12, 2021}, pages = {29--39}, publisher = {Tribun {EU}}, year = {2021}, url = {http://nlp.fi.muni.cz/raslan/2021/paper10.pdf}, timestamp = {Tue, 18 Jan 2022 17:52:53 +0100}, biburl = {https://dblp.org/rec/conf/raslan/NovotnySVH21.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } }}} == License Public Domain Dedication (CC Zero)