Version 1 (modified by 2 months ago) (diff) | ,
---|
A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents
Description
This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
Example
Král/B-PER Zikmund/I-PER dává/O Petrovi/B-PER z/I-PER Michalovic/I-PER ,/O který/O mu/O prokazoval/O věrné/O služby/O a/O kterého/O chce/O Zikmund/B-PER touto/O odměnou/O povzbuditi/O k/O ještě/O usilovnější/O službě/O ,/O vesnici/O Předměřice/B-LOC nad/I-LOC Jizerou/I-LOC s/O alody/O ,/O poplužími/O ,/O obdělávanými/O i/O neobdělávanými/O poli/O ,/O platy/O ,/O službami/O ,/O robotami/O ,/O loukami/O ,/O pastvinami/O ,/O vodami/O ,/O vodními/O toky/O ,/O mlýny/O ,/O všemi/O příjmy/O a/O vším/O příslušenstvím/O ./O
More information about the database can be found at the AHISTO project page
LINDAT handle
http://hdl.handle.net/11234/1-5024
Acknowledgements
If you use the system, please cite the related publication as well as the LINDAT/CLARIAH infrastructure: http://hdl.handle.net/11234/1-5024.
Project code: LM2018101
Project name: LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy
TAČR
Project code: TL03000365
Project name: Accessible historical sources. Making medieval written documents available in the form of a contextual database
Publication info
- BANKOVIČ, Mikuláš, Vít NOVOTNÝ a Petr SOJKA. Application of Super-Resolution Models in Optical Character Recognition of Czech Medieval Texts. In Horák, Rychlý, Rambousek. Recent Advances in Slavonic Natural Language Processing (RASLAN 2021). Brno: Tribun EU, 2021, s. 11-18. ISBN 978-80-263-1670-1.
- Vít Novotný, Kristýna Seidlová, Tereza Vrabcová, Ales Horák: When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts. 29-39
If you cite the dataset, please use this citation:
@inproceedings{DBLP:conf/raslan/NovotnySVH21, author = {V{\'{\i}}t Novotn{\'{y}} and Krist{\'{y}}na Seidlov{\'{a}} and Tereza Vrabcov{\'{a}} and Ales Hor{\'{a}}k}, editor = {Ales Hor{\'{a}}k and Pavel Rychl{\'{y}} and Adam Rambousek}, title = {When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts}, booktitle = {The 15th Workshop on Recent Advances in Slavonic Natural Languages Processing, {RASLAN} 2021, Karlova Studanka, Czech Republic, December 10-12, 2021}, pages = {29--39}, publisher = {Tribun {EU}}, year = {2021}, url = {http://nlp.fi.muni.cz/raslan/2021/paper10.pdf}, timestamp = {Tue, 18 Jan 2022 17:52:53 +0100}, biburl = {https://dblp.org/rec/conf/raslan/NovotnySVH21.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
License
Public Domain Dedication (CC Zero)