| 1 | = A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents |
| 2 | |
| 3 | == Description |
| 4 | |
| 5 | This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER). |
| 6 | |
| 7 | === Example |
| 8 | |
| 9 | {{{ |
| 10 | Král/B-PER Zikmund/I-PER dává/O Petrovi/B-PER z/I-PER Michalovic/I-PER ,/O |
| 11 | který/O mu/O prokazoval/O věrné/O služby/O a/O kterého/O chce/O Zikmund/B-PER |
| 12 | touto/O odměnou/O povzbuditi/O k/O ještě/O usilovnější/O službě/O ,/O |
| 13 | vesnici/O Předměřice/B-LOC nad/I-LOC Jizerou/I-LOC s/O alody/O ,/O |
| 14 | poplužími/O ,/O obdělávanými/O i/O neobdělávanými/O poli/O ,/O platy/O ,/O |
| 15 | službami/O ,/O robotami/O ,/O loukami/O ,/O pastvinami/O ,/O vodami/O ,/O |
| 16 | vodními/O toky/O ,/O mlýny/O ,/O všemi/O příjmy/O a/O vším/O |
| 17 | příslušenstvím/O ./O |
| 18 | }}} |
| 19 | |
| 20 | More information about the database can be found at the [https://nlp.fi.muni.cz/trac/ahisto/wiki/NerDataset AHISTO project page] |
| 21 | |
| 22 | == LINDAT handle |
| 23 | |
| 24 | http://hdl.handle.net/11234/1-5024 |
| 25 | |
| 26 | == Acknowledgements |
| 27 | |
| 28 | If you use the system, please cite the related publication as well as the LINDAT/CLARIAH infrastructure: http://hdl.handle.net/11234/1-5024. |
| 29 | |
| 30 | Project code: LM2018101 |
| 31 | |
| 32 | Project name: LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy |
| 33 | |
| 34 | === TAČR |
| 35 | |
| 36 | Project code: TL03000365 |
| 37 | |
| 38 | Project name: Accessible historical sources. Making medieval written documents available in the form of a contextual database |
| 39 | |
| 40 | |
| 41 | == Publication info |
| 42 | |
| 43 | - BANKOVIČ, Mikuláš, Vít NOVOTNÝ a Petr SOJKA. Application of Super-Resolution Models in Optical Character Recognition of Czech Medieval Texts. In Horák, Rychlý, Rambousek. Recent Advances in Slavonic Natural Language Processing (RASLAN 2021). Brno: Tribun EU, 2021, s. 11-18. ISBN 978-80-263-1670-1. |
| 44 | - Vít Novotný, Kristýna Seidlová, Tereza Vrabcová, Ales Horák: When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts. 29-39 |
| 45 | |
| 46 | If you cite the dataset, please use this citation: |
| 47 | |
| 48 | {{{ |
| 49 | @inproceedings{DBLP:conf/raslan/NovotnySVH21, |
| 50 | author = {V{\'{\i}}t Novotn{\'{y}} and |
| 51 | Krist{\'{y}}na Seidlov{\'{a}} and |
| 52 | Tereza Vrabcov{\'{a}} and |
| 53 | Ales Hor{\'{a}}k}, |
| 54 | editor = {Ales Hor{\'{a}}k and |
| 55 | Pavel Rychl{\'{y}} and |
| 56 | Adam Rambousek}, |
| 57 | title = {When Tesseract Brings Friends: Layout Analysis, Language Identification, |
| 58 | and Super-Resolution in the Optical Character Recognition of Medieval |
| 59 | Texts}, |
| 60 | booktitle = {The 15th Workshop on Recent Advances in Slavonic Natural Languages |
| 61 | Processing, {RASLAN} 2021, Karlova Studanka, Czech Republic, December |
| 62 | 10-12, 2021}, |
| 63 | pages = {29--39}, |
| 64 | publisher = {Tribun {EU}}, |
| 65 | year = {2021}, |
| 66 | url = {http://nlp.fi.muni.cz/raslan/2021/paper10.pdf}, |
| 67 | timestamp = {Tue, 18 Jan 2022 17:52:53 +0100}, |
| 68 | biburl = {https://dblp.org/rec/conf/raslan/NovotnySVH21.bib}, |
| 69 | bibsource = {dblp computer science bibliography, https://dblp.org} |
| 70 | } |
| 71 | }}} |
| 72 | |
| 73 | == License |
| 74 | |
| 75 | Public Domain Dedication (CC Zero) |
| 76 | |