wiki:MedievalNamedEntities

A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents

Description

This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).

Example

Král/B-PER Zikmund/I-PER dává/O Petrovi/B-PER z/I-PER Michalovic/I-PER ,/O
který/O mu/O prokazoval/O věrné/O služby/O a/O kterého/O chce/O Zikmund/B-PER
touto/O odměnou/O povzbuditi/O k/O ještě/O usilovnější/O službě/O ,/O
vesnici/O Předměřice/B-LOC nad/I-LOC Jizerou/I-LOC s/O alody/O ,/O
poplužími/O ,/O obdělávanými/O i/O neobdělávanými/O poli/O ,/O platy/O ,/O
službami/O ,/O robotami/O ,/O loukami/O ,/O pastvinami/O ,/O vodami/O ,/O
vodními/O toky/O ,/O mlýny/O ,/O všemi/O příjmy/O a/O vším/O
příslušenstvím/O ./O

More information about the database can be found at the AHISTO project page

LINDAT handle

http://hdl.handle.net/11234/1-5024

Acknowledgements

If you use the system, please cite the related publication as well as the LINDAT/CLARIAH infrastructure: http://hdl.handle.net/11234/1-5024.

Project code: LM2018101

Project name: LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy

TAČR

Project code: TL03000365

Project name: Accessible historical sources. Making medieval written documents available in the form of a contextual database

Publication info

  • BANKOVIČ, Mikuláš, Vít NOVOTNÝ a Petr SOJKA. Application of Super-Resolution Models in Optical Character Recognition of Czech Medieval Texts. In Horák, Rychlý, Rambousek. Recent Advances in Slavonic Natural Language Processing (RASLAN 2021). Brno: Tribun EU, 2021, s. 11-18. ISBN 978-80-263-1670-1.
  • Vít Novotný, Kristýna Seidlová, Tereza Vrabcová, Ales Horák: When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts. 29-39

If you cite the dataset, please use this citation:

@inproceedings{DBLP:conf/raslan/NovotnySVH21,
  author       = {V{\'{\i}}t Novotn{\'{y}} and
                  Krist{\'{y}}na Seidlov{\'{a}} and
                  Tereza Vrabcov{\'{a}} and
                  Ales Hor{\'{a}}k},
  editor       = {Ales Hor{\'{a}}k and
                  Pavel Rychl{\'{y}} and
                  Adam Rambousek},
  title        = {When Tesseract Brings Friends: Layout Analysis, Language Identification,
                  and Super-Resolution in the Optical Character Recognition of Medieval
                  Texts},
  booktitle    = {The 15th Workshop on Recent Advances in Slavonic Natural Languages
                  Processing, {RASLAN} 2021, Karlova Studanka, Czech Republic, December
                  10-12, 2021},
  pages        = {29--39},
  publisher    = {Tribun {EU}},
  year         = {2021},
  url          = {http://nlp.fi.muni.cz/raslan/2021/paper10.pdf},
  timestamp    = {Tue, 18 Jan 2022 17:52:53 +0100},
  biburl       = {https://dblp.org/rec/conf/raslan/NovotnySVH21.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

License

Public Domain Dedication (CC Zero)

Last modified 2 months ago Last modified on May 24, 2024, 11:30:41 PM