Kontextová navigace

Změny mezi verzí 19 a verzí 20 u OcrDataset

Časová značka:: 28. 11. 2022 13:13:12 (před 3 lety)
Autor:: xnovot32@fi.muni.cz
Komentář:: --

Vysvětlivky:

: Nezměněno
: Přidáno
: Odstraněno
: Změněno

OcrDataset

-                      v19
+                      v20
 The dataset from 2021 is structured as follows:
  * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4615/scanned-images.zip?sequence=7&isAllowed=y scanned-images.zip] (47.13 GB) contains 51,351 high-resolution scanned images.
  * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4615/ocr-texts.zip?sequence=5&isAllowed=y ocr-texts.zip] (5.09 GB) contains 51,351 OCR texts in three formats:
+ * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4615/scanned-images.zip?sequence=7&isAllowed=y scanned-images.zip] (47.13 GB) contains 51,351 high-resolution scanned images.
+ * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4615/ocr-texts.zip?sequence=5&isAllowed=y ocr-texts.zip] (5.09 GB) contains 51,351 OCR texts in three formats:
 . HOCR documents from the Tesseract 4 OCR engine.
 . JSON documents from the [https://cloud.google.com/vision Google Vision AI] OCR engine.
 . TXT documents that combine Tesseract and Google outputs to achieve maximum accuracy on different types of layout.
  * The archive [https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-4615#file_file_7686 annotations-ocr.zip] (178.62 KB) contains 120 annotations for the evaluation of OCR.[[BR]]The archive is divided into two subdirectories for the evaluation of layout analysis:
+ * The archive [https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-4615#file_file_7686 annotations-ocr.zip] (178.62 KB) contains 120 annotations for the evaluation of OCR.[[BR]]The archive is divided into two subdirectories for the evaluation of layout analysis:
 . The subdirectory `with-columns` contains annotations for 17 multi-column pages.
 . The subdirectory `without-columns` contains annotations for 103 single-column pages.
  * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4615/annotations-language-identification.zip?sequence=3&isAllowed=y annotations-language-identification.zip] (1.1 MB) contains 122 annotations for the evaluation of language identification.
+ * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4615/annotations-language-identification.zip?sequence=3&isAllowed=y annotations-language-identification.zip] (1.1 MB) contains 122 annotations for the evaluation of language identification.
 The supplementary materials from 2022 are structured as follows:
  * The archive [https://nlp.fi.muni.cz/projects/ahisto/ocr-texts-supplementary.zip ocr-texts-supplementary.zip] (24.39 MB) contains 110 OCR texts for which we have both high-resolution scanned images and also annotations for the evaluation of OCR.[[BR]]The archive is divided into a number of subdirectories with outputs of different OCR engines:
+ * The archive [https://nlp.fi.muni.cz/projects/ahisto/ocr-texts-supplementary.zip ocr-texts-supplementary.zip] (24.39 MB) contains 110 OCR texts for which we have both high-resolution scanned images and annotations for OCR evaluation.[[BR]]The archive is divided into a number of subdirectories with outputs of different OCR engines:
    * The subdirectory `google-vision-ai-old` contains JSON and TXT documents from the Google Vision AI OCR engine from 2020-10-02.
    * The subdirectory `google-vision-ai` contains JSON and TXT documents from the Google Vision AI OCR engine from 2022-08-11.
 …
 If you use our dataset in your work, please cite the following article:
   Novotný, V., Seidlová, K., Vrabcová, T., Horák, A.: When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts. In: Horák, A., Rychlý, P., Rambousek, A. (eds.) ''            Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2021''   . pp. 91–100. ISSN 2336-4289. ISBN 978-80-263-1600-8. Tribun EU (2021). Available also from WWW: https://nlp.fi.muni.cz/raslan/2021/paper10.pdf
+  Novotný, V., Seidlová, K., Vrabcová, T., Horák, A.: When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts. In: Horák, A., Rychlý, P., Rambousek, A. (eds.) ''             Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2021''    . pp. 91–100. ISSN 2336-4289. ISBN 978-80-263-1600-8. Tribun EU (2021). Available also from WWW: https://nlp.fi.muni.cz/raslan/2021/paper10.pdf
 If you use LaTeX, you can use the following BibTeX entry: