= A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents = This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for named entity recognition (NER). You can [https://nlp.fi.muni.cz/projects/ahisto/ner-dataset.zip download the dataset] in the LINDAT/CLARIAH-CZ repository. == Contents == The [https://nlp.fi.muni.cz/projects/ahisto/ner-dataset.zip dataset] (1.7 GB) is structured as follows: * 8 files named `dataset_mlm_`(cross page boundaries?)`_`(only relevant pages?)`.txt`.[[BR]]These files contain sentences for unsupervised training of language models. * 17 pairs of files named `dataset_ner_`(source)`_`(cross page boundaries?)`_`(only relevant pages?)`_`(split)`.sentences.txt` and `.ner_tags.txt`.[[BR]]These files contain sentences and NER tags for supervised training of language models.[[BR]]The NER tags are human-annotated (for source `regests`), machine-generated using information retrieval (for sources `fuzzy-regex` and `manatee`), or both. * 3 files named `dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_tagged.docx`, `.sentences.txt`, and `.ner_tags.txt`.[[BR]]These files contain sentences and NER tags for supervised training of language models.[[BR]]The NER tags are human-annotated. * 17 pairs of files named `dataset_ner_`(source)`_`(cross page boundaries?)`_`(only relevant pages?)`_`(split)_automatically_tagged`.sentences.txt` and `.ner_tags.txt`.[[BR]]These files contain sentences and NER tags for supervised training of language models.[[BR]]The NER tags are machine-generated using language models. TODO: Describe filename variables.[[BR]]TODO: Describe TXT and DOCX formats. == Citing == If you use our dataset in your work, please cite the following article: TODO If you use LaTeX, you can use the following BibTeX entry: {{{ TODO }}} == Acknowledgements == This work was funded by TAČR Éta, [https://starfos.tacr.cz/en/project/TL03000365 project number TL03000365].