Version 21 (modified by 21 měsíci ago) (diff) | ,
---|
A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents
This is an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification.
You can download the dataset from 2021 and supplementary materials from 2022 in the LINDAT/CLARIAH-CZ repository.
Contents
The dataset from 2021 is structured as follows:
- The archive scanned-images.zip (47.13 GB) contains 51,351 high-resolution scanned images.
- The archive ocr-texts.zip (5.09 GB) contains 51,351 OCR texts in three formats:
- HOCR documents from the Tesseract 4 OCR engine.
- JSON documents from the Google Vision AI OCR engine.
- TXT documents that combine Tesseract and Google outputs to achieve maximum accuracy on different types of layout.
- The archive annotations-ocr.zip (178.62 KB) contains 120 annotations for the evaluation of OCR.
The archive is divided into two subdirectories for the evaluation of layout analysis:- The subdirectory
with-columns
contains annotations for 17 multi-column pages. - The subdirectory
without-columns
contains annotations for 103 single-column pages.
- The subdirectory
- The archive annotations-language-identification.zip (1.1 MB) contains 122 annotations for the evaluation of language identification.
The supplementary materials from 2022 are structured as follows:
- The archive ocr-texts-supplementary.zip (24.39 MB) contains 110 OCR texts for which we have both high-resolution scanned images and annotations for OCR evaluation.
The archive is divided into a number of subdirectories with outputs of different OCR engines:- The subdirectory
google-vision-ai-old
contains JSON and TXT documents from the Google Vision AI OCR engine from 2020-10-02. - The subdirectory
google-vision-ai
contains JSON and TXT documents from the Google Vision AI OCR engine from 2022-08-11. - The subdirectory
pero-demo
contains PAGE and TXT documents from the web demo of the PERO OCR engine. - The subdirectory
pero-github
contains PAGE and TXT documents from the open-source variant of the PERO OCR engine using public pretrained models. - The subdirectory
tesseract
contains HOCR and TXT documents from the Tesseract 4 OCR engine. - The subdirectory
tesseract-and-google-vision-ai-old
contains TXT documents that combinetesseract
andgoogle-vision-ai-old
documents. - The subdirectory
tesseract-and-google-vision-ai
contains TXT documents that combinetesseract
andgoogle-vision-ai
documents. - The subdirectory
tesseract-and-pero-github
contains TXT documents that combinetesseract
andpero-github
documents.
- The subdirectory
Citing
If you use our dataset in your work, please cite the following articles:
Novotný, V., Seidlová, K., Vrabcová, T., Horák, A.: When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts. In: Horák, A., Rychlý, P., Rambousek, A. (eds.) Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2021 . pp. 91–100. ISSN 2336-4289. ISBN 978-80-263-1600-8. Tribun EU (2021). Available also from WWW: https://nlp.fi.muni.cz/raslan/2021/paper10.pdf
Novotný, V., Horák, A.: When Tesseract Meets PERO: Open-Source Optical Character Recognition of Medieval Texts. In: Horák, A., Rychlý, P., Rambousek, A. (eds.) Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2022 . pp. 157–160. ISSN 2336-4289. ISBN 978-80-263-1752-4. Tribun EU (2022). Available also from WWW: https://nlp.fi.muni.cz/raslan/2022/paper12.pdf
If you use LaTeX, you can use the following BibTeX entries:
@inproceedings{novotny2021when, title = {When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts}, author = {Vít Novotný and Kristýna Seidlová and Tereza Vrabcová and Aleš Horák}, editor = {Aleš Horák and Pavel Rychlý and Adam Rambousek}, booktitle = {Proceedings of Recent Advances in Slavonic Natural Language Processing, {RASLAN} 2021}, publisher = {Tribun {EU}}, pages = {91-100}, year = {2021}, issn = {2336-4289}, isbn = {978-80-263-1600-8}, url = {https://nlp.fi.muni.cz/raslan/2021/paper10.pdf}, }
@inproceedings{novotny2022when, title = {When Tesseract Meets {PERO}: Open-Source Optical Character Recognition of Medieval Texts}, author = {Vít Novotný and Aleš Horák}, editor = {Aleš Horák and Pavel Rychlý and Adam Rambousek}, booktitle = {Proceedings of Recent Advances in Slavonic Natural Language Processing, {RASLAN} 2022}, publisher = {Tribun {EU}}, pages = {157-160}, year = {2022}, issn = {2336-4289}, isbn = {978-80-263-1752-4}, url = {https://nlp.fi.muni.cz/raslan/2022/paper12.pdf}, }
Acknowledgements
This work was funded by TAČR Éta, project number TL03000365.