wiki:NerDataset

Kontextová navigace

Version 12 (modified by xnovot32@fi.muni.cz, před 3 lety) (diff)
--

A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents

This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for named entity recognition (NER).

You can download the dataset in the LINDAT/CLARIAH-CZ repository.

The dataset is stored in archive ner-dataset.zip (1.7 GB) with following contents:

8 files named dataset_mlm_*.txt that contain sentences for unsupervised training and validation of language models.
We used the following three variables to produce the different files:
1. The sentences are extracted from book OCR texts and may therefore span several pages.
  However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.
  We either allow the sentences to cross page boundaries (all) or not (non-crossing).
2. The sentences come from all book pages (all) or just those considered relevant by human annotators (only-relevant).
3. We split the sentences roughly into 90% for training (training) and 10% for validation (validation).
16 tuples of files named dataset_ner_*.sentences.txt, .ner_tags.txt, and in one case also .docx.
These files contain sentences and NER tags for supervised training, validation, and testing of language models.
The .docx files are authored by human annotators and contain extra details¹ missing from files .sentences.txt and .ner_tags.txt.
Here are the five variables that we used to produce the different files:
1. The sentences may originate from book OCR texts using information retrieval techniques (fuzzy-regex or manatee).
  The sentences may also originate from regests (regests) or both books and regests (fuzzy-regex+regests and fuzzy-regex+manatee).
2. When sentences originate from book OCR texts, they may span several pages of a book.
  However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.
  We either allow the sentences to cross page boundaries (all) or not (non-crossing).
3. When sentences originate from book OCR texts, they may come from book pages of different relevance.
  We either use sentences from all book pages (all) or just those considered relevant by human annotators (only-relevant).
4. When sentences and NER tags originate from book OCR texts using information retrieval techniques, many entities in the sentences may lack tags.
  Therefore, we also provide NER tags completed by language models (automatically_tagged) and human annotators (tagged).
5. We split the sentences roughly into 80% for training (training), 10% for validation (validation), and 10% for testing (testing).
  For repeated testing, we subdivide the testing split (testing_001-400 and testing_401-500).

¹The extra details include nested entities such as locations in person names (e.g. “Blažek z Kralup”) and people in location names (e.g. “Kostel sv. Martina”).

Use the search.TaggedSentence.load() function from the ahisto_named_entity_search software tool to load the .docx files together with the extra details.

Citing

If you use our dataset in your work, please cite the following article:

TODO

If you use LaTeX, you can use the following BibTeX entry:

TODO

Acknowledgements

This work was funded by TAČR Éta, project number TL03000365.

Stáhnout v jiných formátech:

Čistý text

Kontextová navigace

A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents

Contents

Citing

Acknowledgements

Stáhnout v jiných formátech: