Změny mezi verzí 15 a verzí 16 u NerDataset


Ignorovat:
Časová značka:
30. 11. 2022 16:36:39 (před 3 lety)
Autor:
xnovot32@fi.muni.cz
Komentář:

--

Vysvětlivky:

Nezměněno
Přidáno
Odstraněno
Změněno
  • NerDataset

    v15 v16  
    11= A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents =
    2 This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for named entity recognition (NER).
     2This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era.[[BR]]The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
    33
    4 You can [https://nlp.fi.muni.cz/projects/ahisto/ner-dataset.zip download the dataset] in the LINDAT/CLARIAH-CZ repository.
     4You can [https://hdl.handle.net/11234/1-4936 download the dataset] in the LINDAT/CLARIAH-CZ repository.
    55
    66== Contents ==
    7 The dataset is stored in archive [https://nlp.fi.muni.cz/projects/ahisto/ner-dataset.zip ner-dataset.zip] (1.7 GB) with following contents:
     7The dataset is structured as follows:
    88
    9  * 8 files named `dataset_mlm_*.txt` that contain sentences for unsupervised training and validation of language models.[[BR]]We used the following three variables to produce the different files:
     9 * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4936/language-modeling-corpus.zip?sequence=1&isAllowed=y language-modeling-corpus.zip] (633.79 MB) contains 8 files with sentences for unsupervised training and validation of language models.[[BR]]We used the following three variables to produce the different files:
    1010   1. The sentences are extracted from book OCR texts and may therefore span several pages.[[BR]]However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.[[BR]]We either allow the sentences to cross page boundaries (`all`) or not (`non-crossing`).
    1111   1. The sentences come from all book pages (`all`) or just those considered relevant by human annotators (`only-relevant`).
    1212   1. We split the sentences roughly into 90% for training (`training`) and 10% for validation (`validation`).
    13  * 16 tuples of files named `dataset_ner_*.sentences.txt`, `.ner_tags.txt`, and in one case also `.docx`.[[BR]]These files contain sentences and NER tags for supervised training, validation, and testing of language models.[[BR]]The `.docx` files are authored by human annotators and contain extra details^1^ missing from files `.sentences.txt` and `.ner_tags.txt`.[[BR]]Here are the five variables that we used to produce the different files:
     13 * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4936/named-entity-recognition-annotations.zip?sequence=2&isAllowed=y named-entity-recognition-annotations.zip] (978.29 MB) contains 16 tuples of files named `*.sentences.txt`, `.ner_tags.txt`, and in one case also `.docx`.^1^[[BR]]These files contain sentences and NER tags for supervised training, validation, and testing of language models.[[BR]]Here are the five variables that we used to produce the different files:
    1414   1. The sentences may originate from book OCR texts using information retrieval techniques (`fuzzy-regex` or `manatee`).[[BR]]The sentences may also originate from regests (`regests`) or both books and regests (`fuzzy-regex+regests` and `fuzzy-regex+manatee`).
    1515   1. When sentences originate from book OCR texts, they may span several pages of a book.[[BR]]However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.[[BR]]We either allow the sentences to cross page boundaries (`all`) or not (`non-crossing`).
     
    1818   1. We split the sentences roughly into 80% for training (`training`), 10% for validation (`validation`), and 10% for testing (`testing`).[[BR]]For repeated testing, we subdivide the testing split (`testing_001-400` and `testing_401-500`).
    1919
    20 ^1^The extra details include nested entities such as locations in person names (e.g. “Blažek z Kralup”) and people in location names (e.g. “Kostel sv. Martina”).
     20^1 ^The `.docx` files were authored by human annotators and contain extra details missing from files `.sentences.txt` and `.ner_tags.txt`.^[[BR]]^The extra details include nested entities such as locations in person names (e.g. “Blažek z __Kralup__”) and people in location names (e.g. “Kostel __sv. Martina__”).
    2121
    2222== Citing ==