wiki:NerDataset

Version 4 (modified by xnovot32@fi.muni.cz, před 3 lety) (diff)

--

A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents

This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for named entity recognition (NER).

You can download the dataset in the LINDAT/CLARIAH-CZ repository.

Contents

The dataset (1.7 GB) is structured as follows:

  • 8 files named dataset_mlm_(cross page boundaries?)_(only relevant pages?).txt.
    These files contain sentences for unsupervised training of language models.
  • 17 pairs of files named dataset_ner_(source)_(cross page boundaries?)_(only relevant pages?)_(split).sentences.txt and .ner_tags.txt.
    These files contain sentences and NER tags for supervised training of language models.
    The NER tags are human-annotated (for source regests), machine-generated using information retrieval (for sources fuzzy-regex and manatee), or both.
  • 3 files named dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_tagged.docx, .sentences.txt, and .ner_tags.txt.
    These files contain sentences and NER tags for supervised training of language models.
    The NER tags are human-annotated.
  • 17 pairs of files named dataset_ner_(source)_(cross page boundaries?)_(only relevant pages?)_(split)_automatically_tagged.sentences.txt and .ner_tags.txt.
    These files contain sentences and NER tags for supervised training of language models.
    The NER tags are machine-generated using language models.

TODO: Describe filename variables.
TODO: Describe TXT and DOCX formats.

Citing

If you use our dataset in your work, please cite the following article:

TODO

If you use LaTeX, you can use the following BibTeX entry:

TODO

Acknowledgements

This work was funded by TAČR Éta, project number TL03000365.