wiki:BulkyCorpus

Version 2 (modified by xpopelk, 2 months ago) (diff)

--

Bulky

Description

Bulky is a list of 9109 Czech sentences where interlingual homographs cause problems in tagging. We observed that interlingual homographs, e.g., Czech-English homographs such as step, drop, barely, car, copy, are often tagged incorrectly in Czech corpora. This subcorpus can serve as a test for enhanced taggers.

More information about the corpus can be found in PELIKÁNOVÁ, Zuzana a Zuzana NEVĚŘILOVÁ. Corpus Annotation Pipeline for Non-standard Texts. In P. Sojka, A. Horák, I. Kopeček, K. Pala. Text, Speech, and Dialogue, 21st International Conference, TSD 2018. Switzerland: Springer International Publishing, 2018, s. 304-312. ISBN 978-3-030-00794-2. Dostupné z: https://dx.doi.org/10.1007/978-3-030-00794-2_32.

LINDAT handle

http://hdl.handle.net/11234/1-2822

Acknowledgements

This software was developed within the projects LC536 and 2C06009 and is owned by Masaryk University, Faculty of Informatics, NLP Centre.

If you use the system, please cite the related publication as well as the LINDAT/CLARIAH infrastructure: http://hdl.handle.net/11234/1-2822

Publication info

https://www.muni.cz/vyzkum/publikace/1471077

@InProceedings{10.1007/978-3-030-00794-2_32,
   author="Pelikánová, Zuzana and Nevěřilová, Zuzana",
   editor="Sojka, Petr and Hor{\'a}k, Ale{\v{s}} and Kope{\v{c}}ek, Ivan and Pala, Karel",
   title="Corpus Annotation Pipeline for Non-standard Texts",
   booktitle="Text, Speech, and Dialogue",
   year="2018",
   publisher="Springer International Publishing",
   pages="295--303",
   isbn="978-3-030-00794-2"
}

Licence

Creative Commons - Attribution-NonCommercial?-NoDerivatives? 4.0 International (CC BY-NC-ND 4.0)