wiki:OcrDataset

Version 6 (modified by xnovot32@fi.muni.cz, 3 lety ago) (diff)

--

A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents

This is an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification.

Contents

The dataset is structured as follows:

  • The directory scanned-images contains 51,351 high-resolution scanned images.
  • The directory ocr-texts contains 51,351 OCR texts in three formats:
    1. HOCR documents from the Tesseract 4 OCR engine.
    2. JSON documents from the Google Vision AI OCR engine.
    3. TXT documents that combine Tesseract and Google outputs to achieve maximum accuracy on different types of layout.
  • The directory annotations-ocr contains 120 annotations for the evaluation of OCR. The directory is divided into two subdirectories for the evaluation of layout analysis:
    1. The subdirectory with-columns contains annotations for 17 multi-column pages.
    2. The subdirectory without-columns contains annotations for 103 single-column pages.
  • The directory annotations-language-identification contains 122 annotations for the evaluation of language identification.

Citing

If you use our dataset in your work, please cite the following article:

Novotný, V., Seidlová, K., Vrabcová, T., Horák, A.: When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts. In: Horák, A., Rychlý, P., Rambousek, A. (eds.) Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2021. pp. 91–100. ISSN 2336-4289. ISBN 978-80-263-1600-8. Tribun EU (2021). Available also from WWW: https://nlp.fi.muni.cz/raslan/2021/paper10.pdf .

If you use LaTeX, you can use the following BibTeX entry:

@inproceedings{novotny2020when,
  title = {When Tesseract Brings Friends: Layout Analysis, Language
           Identification, and Super-Resolution in the Optical Character
           Recognition of Medieval Texts},
  author = {Vít Novotný and Kristýna Seidlová and Tereza Vrabcová and
            Aleš Horák},
  editor = {Aleš Horák and Pavel Rychlý and Adam Rambousek},
  booktitle = {Proceedings of Recent Advances in Slavonic Natural
               Language Processing, {RASLAN} 2021},
  publisher = {Tribun {EU}},
  pages = {91-100},
  year = {2021},
  issn = {2336-4289},
  isbn = {978-80-263-1600-8},
  url = {https://nlp.fi.muni.cz/raslan/2021/paper10.pdf},
}

Acknowledgements

This work was funded by TAČR Éta, project number TL03000365.