Změny mezi verzí 11 a verzí 12 u OcrDataset


Ignorovat:
Časová značka:
10. 12. 2021 13:47:21 (před 3 lety)
Autor:
xnovot32@fi.muni.cz
Komentář:

Publish at LINDAT

Vysvětlivky:

Nezměněno
Přidáno
Odstraněno
Změněno
  • OcrDataset

    v11 v12  
    11= A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents =
    2 This is an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification. You can [https://nlp.fi.muni.cz/projekty/ahisto/dataset.zip download the dataset here].
     2This is an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification. You can [http://hdl.handle.net/11234/1-4615 download the dataset in the LINDAT/CLARIAH-CZ Repository].
    33
    44== Contents ==
    55The dataset is structured as follows:
    66
    7  * The directory `scanned-images` contains 51,351 high-resolution scanned images.
    8  * The directory `ocr-texts` contains 51,351 OCR texts in three formats:
     7 * The archive `scanned-images.zip` contains 51,351 high-resolution scanned images.
     8 * The archive `ocr-texts.zip` contains 51,351 OCR texts in three formats:
    99   1. HOCR documents from the Tesseract 4 OCR engine.
    1010   1. JSON documents from the Google Vision AI OCR engine.
    1111   1. TXT documents that combine Tesseract and Google outputs to achieve maximum accuracy on different types of layout.
    12  * The directory `annotations-ocr` contains 120 annotations for the evaluation of OCR. The directory is divided into two subdirectories for the evaluation of layout analysis:
     12 * The archive `annotations-ocr.zip` contains 120 annotations for the evaluation of OCR. The directory is divided into two subdirectories for the evaluation of layout analysis:
    1313   1. The subdirectory `with-columns` contains annotations for 17 multi-column pages.
    1414   1. The subdirectory `without-columns` contains annotations for 103 single-column pages.
    15  * The directory `annotations-language-identification` contains 122 annotations for the evaluation of language identification.
     15 * The archive `annotations-language-identification.zip` contains 122 annotations for the evaluation of language identification.
    1616
    1717== Citing ==
    1818If you use our dataset in your work, please cite the following article:
    1919
    20   Novotný, V., Seidlová, K., Vrabcová, T., Horák, A.: When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts. In: Horák, A., Rychlý, P., Rambousek, A. (eds.) ''        Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2021''. pp. 91–100. ISSN 2336-4289. ISBN 978-80-263-1600-8. Tribun EU (2021). Available also from WWW: https://nlp.fi.muni.cz/raslan/2021/paper10.pdf
     20  Novotný, V., Seidlová, K., Vrabcová, T., Horák, A.: When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts. In: Horák, A., Rychlý, P., Rambousek, A. (eds.) ''         Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2021'' . pp. 91–100. ISSN 2336-4289. ISBN 978-80-263-1600-8. Tribun EU (2021). Available also from WWW: https://nlp.fi.muni.cz/raslan/2021/paper10.pdf
    2121
    2222If you use LaTeX, you can use the following BibTeX entry: