Version 4 (modified by před 3 lety) (diff) | ,
---|
A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents
This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for named entity recognition (NER).
You can download the dataset in the LINDAT/CLARIAH-CZ repository.
Contents
The dataset (1.7 GB) is structured as follows:
- 8 files named
dataset_mlm_
(cross page boundaries?)_
(only relevant pages?).txt
.
These files contain sentences for unsupervised training of language models. - 17 pairs of files named
dataset_ner_
(source)_
(cross page boundaries?)_
(only relevant pages?)_
(split).sentences.txt
and.ner_tags.txt
.
These files contain sentences and NER tags for supervised training of language models.
The NER tags are human-annotated (for sourceregests
), machine-generated using information retrieval (for sourcesfuzzy-regex
andmanatee
), or both. - 3 files named
dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_tagged.docx
,.sentences.txt
, and.ner_tags.txt
.
These files contain sentences and NER tags for supervised training of language models.
The NER tags are human-annotated. - 17 pairs of files named
dataset_ner_
(source)_
(cross page boundaries?)_
(only relevant pages?)_
(split)_automatically_tagged.sentences.txt
and.ner_tags.txt
.
These files contain sentences and NER tags for supervised training of language models.
The NER tags are machine-generated using language models.
TODO: Describe filename variables.
TODO: Describe TXT and DOCX formats.
Citing
If you use our dataset in your work, please cite the following article:
TODO
If you use LaTeX, you can use the following BibTeX entry:
TODO
Acknowledgements
This work was funded by TAČR Éta, project number TL03000365.