Změny mezi verzí 17 a verzí 18 u OcrDataset


Ignorovat:
Časová značka:
25. 11. 2022 13:05:15 (před 20 měsíci)
Autor:
xnovot32@fi.muni.cz
Komentář:

--

Vysvětlivky:

Nezměněno
Přidáno
Odstraněno
Změněno
  • OcrDataset

    v17 v18  
    22This is an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification.
    33
    4 You can [https://hdl.handle.net/11234/1-4615 download the dataset from 2021] and [https://nlp.fi.muni.cz/projects/ahisto/ocr-texts-supplementary.zip supplementary materials from 2022] in the LINDAT/CLARIAH-CZ repository.
     4You can [https://hdl.handle.net/11234/1-4615 download the dataset from 2021] and [https://nlp.fi.muni.cz/projects/ahisto/ocr-texts-supplementary.zip supplementary materials from 2022] in the LINDAT/CLARIAH-CZ repository.
    55
    66== Contents ==
    77The dataset from 2021 is structured as follows:
    88
    9  * The archive `scanned-images.zip` contains 51,351 high-resolution scanned images.
    10  * The archive `ocr-texts.zip` contains 51,351 OCR texts in three formats:
     9 * The archive `scanned-images.zip` contains 51,351 high-resolution scanned images.
     10 * The archive `ocr-texts.zip` contains 51,351 OCR texts in three formats:
    1111   1. HOCR documents from the Tesseract 4 OCR engine.
    12    1. JSON documents from the [https://cloud.google.com/vision Google Vision AI] OCR engine.
     12   1. JSON documents from the [https://cloud.google.com/vision Google Vision AI] OCR engine.
    1313   1. TXT documents that combine Tesseract and Google outputs to achieve maximum accuracy on different types of layout.
    14  * The archive `annotations-ocr.zip` contains 120 annotations for the evaluation of OCR. The directory is divided into two subdirectories for the evaluation of layout analysis:
     14 * The archive `annotations-ocr.zip` contains 120 annotations for the evaluation of OCR. The directory is divided into two subdirectories for the evaluation of layout analysis:
    1515   1. The subdirectory `with-columns` contains annotations for 17 multi-column pages.
    1616   1. The subdirectory `without-columns` contains annotations for 103 single-column pages.
    17  * The archive `annotations-language-identification.zip` contains 122 annotations for the evaluation of language identification.
     17 * The archive `annotations-language-identification.zip` contains 122 annotations for the evaluation of language identification.
    1818
    1919The supplementary materials from 2022 are structured as follows:
    2020
    21  * The archive `ocr-texts-supplementary.zip` contains 110 OCR texts for which we have both high-resolution scanned images and also annotations for the evaluation of OCR.
    22    * The subdirectory `google-vision-ai-old` contains JSON and TXT documents from the Google Vision AI OCR engine from 2020-10-02.
    23    * The subdirectory `google-vision-ai` contains JSON and TXT documents from the Google Vision AI OCR engine from 2022-08-11.
    24    * The subdirectory `pero-demo` contains PAGE and TXT documents from [https://pero-ocr.fit.vutbr.cz/ the web demo of the PERO OCR engine].
    25    * The subdirectory `pero-github` contains PAGE and TXT documents from [https://github.com/DCGM/pero-ocr the open-source variant of the PERO OCR engine] using [https://www.fit.vut.cz/~ihradis/pero/pero_eu_cz_print_newspapers_2020-10-09.tar.gz public pretrained models].
    26    * The subdirectory `tesseract` contains HOCR and TXT documents from the Tesseract 4 OCR engine.
    27    * The subdirectory `tesseract-and-google-vision-ai-old` contains TXT documents that combine `tesseract` and `google-vision-ai-old` documents.
    28    * The subdirectory `tesseract-and-google-vision-ai` contains TXT documents that combine `tesseract` and `google-vision-ai` documents.
    29    * The subdirectory `tesseract-and-pero-github` contains TXT documents that combine `tesseract` and `pero-github` documents.[[BR]] 12] with pre-trained models ![3].
     21 * The archive `ocr-texts-supplementary.zip` contains 110 OCR texts for which we have both high-resolution scanned images and also annotations for the evaluation of OCR.
     22   * The subdirectory `google-vision-ai-old` contains JSON and TXT documents from the Google Vision AI OCR engine from 2020-10-02.
     23   * The subdirectory `google-vision-ai` contains JSON and TXT documents from the Google Vision AI OCR engine from 2022-08-11.
     24   * The subdirectory `pero-demo` contains PAGE and TXT documents from [https://pero-ocr.fit.vutbr.cz/ the web demo of the PERO OCR engine].
     25   * The subdirectory `pero-github` contains PAGE and TXT documents from [https://github.com/DCGM/pero-ocr the open-source variant of the PERO OCR engine] using [https://www.fit.vut.cz/~ihradis/pero/pero_eu_cz_print_newspapers_2020-10-09.tar.gz public pretrained models].
     26   * The subdirectory `tesseract` contains HOCR and TXT documents from the Tesseract 4 OCR engine.
     27   * The subdirectory `tesseract-and-google-vision-ai-old` contains TXT documents that combine `tesseract` and `google-vision-ai-old` documents.
     28   * The subdirectory `tesseract-and-google-vision-ai` contains TXT documents that combine `tesseract` and `google-vision-ai` documents.
     29   * The subdirectory `tesseract-and-pero-github` contains TXT documents that combine `tesseract` and `pero-github` documents.
    3030
    3131== Citing ==
    3232If you use our dataset in your work, please cite the following article:
    3333
    34   Novotný, V., Seidlová, K., Vrabcová, T., Horák, A.: When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts. In: Horák, A., Rychlý, P., Rambousek, A. (eds.) ''          Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2021'' . pp. 91–100. ISSN 2336-4289. ISBN 978-80-263-1600-8. Tribun EU (2021). Available also from WWW: https://nlp.fi.muni.cz/raslan/2021/paper10.pdf
     34  Novotný, V., Seidlová, K., Vrabcová, T., Horák, A.: When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts. In: Horák, A., Rychlý, P., Rambousek, A. (eds.) ''           Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2021'' . pp. 91–100. ISSN 2336-4289. ISBN 978-80-263-1600-8. Tribun EU (2021). Available also from WWW: https://nlp.fi.muni.cz/raslan/2021/paper10.pdf
    3535
    3636If you use LaTeX, you can use the following BibTeX entry: