Kontextová navigace

Změny mezi verzí 25 a verzí 26 u OcrDataset

Časová značka:: 30. 11. 2022 15:39:41 (před 3 lety)
Autor:: xnovot32@fi.muni.cz
Komentář:: --

Vysvětlivky:

: Nezměněno
: Přidáno
: Odstraněno
: Změněno

OcrDataset

-                      v25
+                      v26
 This is an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification.
 You can [https://hdl.handle.net/11234/1-4615 download the dataset from 2021] and [https://nlp.fi.muni.cz/projects/ahisto/ocr-texts-supplementary.zip supplementary materials from 2022] in the LINDAT/CLARIAH-CZ repository.
+You can [https://hdl.handle.net/11234/1-4615 download the dataset from 2021] and [http://hdl.handle.net/11234/1-4935 supplementary materials from 2022] in the LINDAT/CLARIAH-CZ repository.
 == Contents ==
 …
  * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4615/annotations-language-identification.zip?sequence=3&isAllowed=y annotations-language-identification.zip] (1.1 MB) contains 122 annotations for the evaluation of language identification.
 [https://nlp.fi.muni.cz/projects/ahisto/ocr-texts-supplementary.zip The supplementary materials from 2022] are structured as follows:
+[http://hdl.handle.net/11234/1-4935 The supplementary materials from 2022] are structured as follows:
  * The archive [https://nlp.fi.muni.cz/projects/ahisto/ocr-texts-supplementary.zip ocr-texts-supplementary.zip] (23.26 MB) contains 110 OCR texts for which we have both high-resolution scanned images and annotations for OCR evaluation.[[BR]]The archive is divided into a number of subdirectories with outputs of different OCR engines:
 …
 If you use our dataset in your work, please cite the following articles:
   Novotný, V., Seidlová, K., Vrabcová, T., Horák, A.: When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts. In: Horák, A., Rychlý, P., Rambousek, A. (eds.) ''                    Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2021''      . pp. 91–100. ISSN 2336-4289. ISBN 978-80-263-1600-8. Tribun EU (2021). Available also from WWW: https://nlp.fi.muni.cz/raslan/2021/paper10.pdf
+  Novotný, V., Seidlová, K., Vrabcová, T., Horák, A.: When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts. In: Horák, A., Rychlý, P., Rambousek, A. (eds.) ''                     Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2021''       . pp. 91–100. ISSN 2336-4289. ISBN 978-80-263-1600-8. Tribun EU (2021). Available also from WWW: https://nlp.fi.muni.cz/raslan/2021/paper10.pdf
   Novotný, V., Horák, A.: When Tesseract Meets PERO: Open-Source Optical Character Recognition of Medieval Texts. In: Horák, A., Rychlý, P., Rambousek, A. (eds.) ''      Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2022''      . pp. 157–160. ISSN 2336-4289. ISBN 978-80-263-1752-4. Tribun EU (2022). Available also from WWW: https://nlp.fi.muni.cz/raslan/2022/paper12.pdf
+  Novotný, V., Horák, A.: When Tesseract Meets PERO: Open-Source Optical Character Recognition of Medieval Texts. In: Horák, A., Rychlý, P., Rambousek, A. (eds.) ''       Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2022''       . pp. 157–160. ISSN 2336-4289. ISBN 978-80-263-1752-4. Tribun EU (2022). Available also from WWW: https://nlp.fi.muni.cz/raslan/2022/paper12.pdf
 If you use LaTeX, you can use the following BibTeX entries: