Version 2 (modified by 2 months ago) (diff) | ,
---|
Bulky
Description
Bulky is a list of 9109 Czech sentences where interlingual homographs cause problems in tagging. We observed that interlingual homographs, e.g., Czech-English homographs such as step, drop, barely, car, copy, are often tagged incorrectly in Czech corpora. This subcorpus can serve as a test for enhanced taggers.
More information about the corpus can be found in PELIKÁNOVÁ, Zuzana a Zuzana NEVĚŘILOVÁ. Corpus Annotation Pipeline for Non-standard Texts. In P. Sojka, A. Horák, I. Kopeček, K. Pala. Text, Speech, and Dialogue, 21st International Conference, TSD 2018. Switzerland: Springer International Publishing, 2018, s. 304-312. ISBN 978-3-030-00794-2. Dostupné z: https://dx.doi.org/10.1007/978-3-030-00794-2_32.
LINDAT handle
http://hdl.handle.net/11234/1-2822
Acknowledgements
This software was developed within the projects LC536 and 2C06009 and is owned by Masaryk University, Faculty of Informatics, NLP Centre.
If you use the system, please cite the related publication as well as the LINDAT/CLARIAH infrastructure: http://hdl.handle.net/11234/1-2822
Publication info
https://www.muni.cz/vyzkum/publikace/1471077
@InProceedings{10.1007/978-3-030-00794-2_32, author="Pelikánová, Zuzana and Nevěřilová, Zuzana", editor="Sojka, Petr and Hor{\'a}k, Ale{\v{s}} and Kope{\v{c}}ek, Ivan and Pala, Karel", title="Corpus Annotation Pipeline for Non-standard Texts", booktitle="Text, Speech, and Dialogue", year="2018", publisher="Springer International Publishing", pages="295--303", isbn="978-3-030-00794-2" }
Licence
Creative Commons - Attribution-NonCommercial?-NoDerivatives? 4.0 International (CC BY-NC-ND 4.0)