= Bulky

== Description

Bulky is a list of 9109 Czech sentences where interlingual homographs cause problems in tagging. We observed that interlingual homographs, e.g., Czech-English homographs such as ''step'', ''drop'', ''barely'', ''car'', ''copy'', are often tagged incorrectly in Czech corpora. This subcorpus can serve as a test for enhanced taggers.

More information about the corpus can be found in ''PELIKÁNOVÁ, Zuzana a Zuzana NEVĚŘILOVÁ. Corpus Annotation Pipeline for Non-standard Texts. In P. Sojka, A. Horák, I. Kopeček, K. Pala. Text, Speech, and Dialogue, 21st International Conference, TSD 2018. Switzerland: Springer International Publishing, 2018, s. 304-312. ISBN 978-3-030-00794-2. Dostupné z: https://dx.doi.org/10.1007/978-3-030-00794-2_32''.

== LINDAT handle

http://hdl.handle.net/11234/1-2822

== Acknowledgements

If you use the system, please cite the related publication as well as the LINDAT/CLARIAH infrastructure: http://hdl.handle.net/11234/1-2822

== Publication info

https://www.muni.cz/vyzkum/publikace/1471077

{{{
@InProceedings{10.1007/978-3-030-00794-2_32,
   author="Pelikánová, Zuzana and Nevěřilová, Zuzana",
   editor="Sojka, Petr and Hor{\'a}k, Ale{\v{s}} and Kope{\v{c}}ek, Ivan and Pala, Karel",
   title="Corpus Annotation Pipeline for Non-standard Texts",
   booktitle="Text, Speech, and Dialogue",
   year="2018",
   publisher="Springer International Publishing",
   pages="295--303",
   isbn="978-3-030-00794-2"
}
}}}

== License

Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)