Změny mezi verzí 16 a verzí 17 u NerDataset


Ignorovat:
Časová značka:
30. 11. 2022 16:37:42 (před 20 měsíci)
Autor:
xnovot32@fi.muni.cz
Komentář:

--

Vysvětlivky:

Nezměněno
Přidáno
Odstraněno
Změněno
  • NerDataset

    v16 v17  
    22This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era.[[BR]]The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
    33
    4 You can [https://hdl.handle.net/11234/1-4936 download the dataset] in the LINDAT/CLARIAH-CZ repository.
     4You can [https://hdl.handle.net/11234/1-4936 download the dataset] in the LINDAT/CLARIAH-CZ repository.
    55
    66== Contents ==
    77The dataset is structured as follows:
    88
    9  * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4936/language-modeling-corpus.zip?sequence=1&isAllowed=y language-modeling-corpus.zip] (633.79 MB) contains 8 files with sentences for unsupervised training and validation of language models.[[BR]]We used the following three variables to produce the different files:
     9 * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4936/language-modeling-corpus.zip?sequence=1&isAllowed=y language-modeling-corpus.zip] (633.79 MB) contains 8 files with sentences for unsupervised training and validation of language models.[[BR]]We used the following three variables to produce the different files:
    1010   1. The sentences are extracted from book OCR texts and may therefore span several pages.[[BR]]However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.[[BR]]We either allow the sentences to cross page boundaries (`all`) or not (`non-crossing`).
    1111   1. The sentences come from all book pages (`all`) or just those considered relevant by human annotators (`only-relevant`).
    1212   1. We split the sentences roughly into 90% for training (`training`) and 10% for validation (`validation`).
    13  * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4936/named-entity-recognition-annotations.zip?sequence=2&isAllowed=y named-entity-recognition-annotations.zip] (978.29 MB) contains 16 tuples of files named `*.sentences.txt`, `.ner_tags.txt`, and in one case also `.docx`.^1^[[BR]]These files contain sentences and NER tags for supervised training, validation, and testing of language models.[[BR]]Here are the five variables that we used to produce the different files:
     13 * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4936/named-entity-recognition-annotations.zip?sequence=2&isAllowed=y named-entity-recognition-annotations.zip] (978.29 MB) contains 16 tuples of files named `*.sentences.txt`, `.ner_tags.txt`, and in one case also `.docx`.^1^[[BR]]These files contain sentences and NER tags for supervised training, validation, and testing of language models.[[BR]]Here are the five variables that we used to produce the different files:
    1414   1. The sentences may originate from book OCR texts using information retrieval techniques (`fuzzy-regex` or `manatee`).[[BR]]The sentences may also originate from regests (`regests`) or both books and regests (`fuzzy-regex+regests` and `fuzzy-regex+manatee`).
    1515   1. When sentences originate from book OCR texts, they may span several pages of a book.[[BR]]However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.[[BR]]We either allow the sentences to cross page boundaries (`all`) or not (`non-crossing`).
     
    1818   1. We split the sentences roughly into 80% for training (`training`), 10% for validation (`validation`), and 10% for testing (`testing`).[[BR]]For repeated testing, we subdivide the testing split (`testing_001-400` and `testing_401-500`).
    1919
    20 ^1 ^The `.docx` files were authored by human annotators and contain extra details missing from files `.sentences.txt` and `.ner_tags.txt`.^[[BR]]^The extra details include nested entities such as locations in person names (e.g. “Blažek z __Kralup__”) and people in location names (e.g. “Kostel __sv. Martina__”).
     20''^1 ^The `.docx` files were authored by human annotators and contain extra details missing from files `.sentences.txt` and `.ner_tags.txt`. The extra details include nested entities such as locations in person names (e.g. “Blažek z __Kralup__”) and people in location names (e.g. “Kostel __sv. Martina__”).''
    2121
    2222== Citing ==