wiki:BulkyCorpus

Bulky

Description

Bulky is a list of 9109 Czech sentences where interlingual homographs cause problems in tagging. We observed that interlingual homographs, e.g., Czech-English homographs such as step, drop, barely, car, copy, are often tagged incorrectly in Czech corpora. This subcorpus can serve as a test for enhanced taggers.

More information about the corpus can be found in PELIKÁNOVÁ, Zuzana a Zuzana NEVĚŘILOVÁ. Corpus Annotation Pipeline for Non-standard Texts. In P. Sojka, A. Horák, I. Kopeček, K. Pala. Text, Speech, and Dialogue, 21st International Conference, TSD 2018. Switzerland: Springer International Publishing, 2018, s. 304-312. ISBN 978-3-030-00794-2. Dostupné z: https://dx.doi.org/10.1007/978-3-030-00794-2_32.

LINDAT handle

http://hdl.handle.net/11234/1-2822

Acknowledgements

If you use the system, please cite the related publication as well as the LINDAT/CLARIAH infrastructure: http://hdl.handle.net/11234/1-2822

Publication info

https://www.muni.cz/vyzkum/publikace/1471077

@InProceedings{10.1007/978-3-030-00794-2_32,
   author="Pelikánová, Zuzana and Nevěřilová, Zuzana",
   editor="Sojka, Petr and Hor{\'a}k, Ale{\v{s}} and Kope{\v{c}}ek, Ivan and Pala, Karel",
   title="Corpus Annotation Pipeline for Non-standard Texts",
   booktitle="Text, Speech, and Dialogue",
   year="2018",
   publisher="Springer International Publishing",
   pages="295--303",
   isbn="978-3-030-00794-2"
}

License

Creative Commons - Attribution-NonCommercial?-NoDerivatives? 4.0 International (CC BY-NC-ND 4.0)

Last modified 2 months ago Last modified on May 20, 2024, 6:13:17 PM