Změny mezi verzí 26 a verzí 27 u NerDataset


Ignorovat:
Časová značka:
6. 1. 2023 0:15:18 (před 19 měsíci)
Autor:
xnovot32@fi.muni.cz
Komentář:

--

Vysvětlivky:

Nezměněno
Přidáno
Odstraněno
Změněno
  • NerDataset

    v26 v27  
    22This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era.[[BR]]The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
    33
    4 You can download the dataset from the LINDAT/CLARIAH-CZ repository:
    5 
    6  * [https://hdl.handle.net/11234/1-4936 The language modeling corpus and the “small” NER annotations that we used to train intermediate language models]
    7  * The “large” NER annotations that we produced with our language models
     4You can [http://hdl.handle.net/11234/1-5024 download the dataset] from the LINDAT/CLARIAH-CZ repository.
    85
    96== Contents ==
    107The dataset is structured as follows:
    118
    12  * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4936/language-modeling-corpus.zip?sequence=1&isAllowed=y language-modeling-corpus.zip] (633.79 MB) contains 8 files with sentences for unsupervised training and validation of language models.[[BR]]We used the following three variables to produce the different files:
     9 * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-5024/language-modeling-corpus.zip?sequence=1&isAllowed=y language-modeling-corpus.zip] (633.79 MB) contains 8 files with sentences for unsupervised training and validation of language models.[[BR]]We used the following three variables to produce the different files:
    1310   1. The sentences are extracted from book OCR texts and may therefore span several pages.[[BR]]However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.[[BR]]We either allow the sentences to cross page boundaries (`all`) or not (`non-crossing`).
    1411   1. The sentences come from all book pages (`all`) or just those considered relevant by human annotators (`only-relevant`).
    1512   1. We split the sentences roughly into 90% for training (`training`) and 10% for validation (`validation`).
    1613
    17 '''Table 1:''' Dataset statistics from the archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4936/language-modeling-corpus.zip?sequence=1&isAllowed=y language-modeling-corpus.zip], ordered by file size.
     14'''Table 1:''' Dataset statistics from the archive language-modeling-corpus.zip, ordered by file size.
    1815
    1916|| ||= file size =||= # sentences =||= # tokens =||= # types =||
     
    2724||=dataset_mlm_non-crossing_only-relevant_validation =|| 549.4 kB|| 2,489|| 81,293|| 22,090||
    2825
    29  * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4936/named-entity-recognition-annotations.zip?sequence=2&isAllowed=y named-entity-recognition-annotations.zip] (978.29 MB) contains 82 tuples of files named `*.sentences.txt`, `.ner_tags.txt`, and in one case also `.docx`.^1^[[BR]]These files contain sentences and NER tags for supervised training, validation, and testing of language models.[[BR]]Here are the five variables that we used to produce the different files:
     26 * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-5024/named-entity-recognition-annotations-small.zip?sequence=2&isAllowed=y named-entity-recognition-annotations-small.zip] (978.29 MB) contains 82 tuples of files named `*.sentences.txt`, `.ner_tags.txt`, and in one case also `.docx`.^1^[[BR]]These files contain sentences and NER tags for supervised training, validation, and testing of language models. We used them to produce our intermediate language models.[[BR]]These are the “small” sentences and NER tags that we used for the supervised training, validation, and testing of intermediate language models.[[BR]]Here are the five variables that we used to produce the different files:
    3027   1. The sentences may originate from book OCR texts using information retrieval techniques (`fuzzy-regex` or `manatee`).[[BR]]The sentences may also originate from regests (`regests`) or both books and regests (`fuzzy-regex+regests` and `fuzzy-regex+manatee`).
    3128   1. When sentences originate from book OCR texts, they may span several pages of a book.[[BR]]However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.[[BR]]We either allow the sentences to cross page boundaries (`all`) or not (`non-crossing`).
     
    3633''^1 ^The `.docx` files were authored by human annotators and contain extra details missing from files `.sentences.txt` and `.ner_tags.txt`. The extra details include nested entities such as locations in person names (e.g. “Blažek z __Kralup__”) and people in location names (e.g. “Kostel __sv. Martina__”).''
    3734
    38 '''Table 2:''' Dataset statistics from the archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4936/named-entity-recognition-annotations.zip?sequence=2&isAllowed=y named-entity-recognition-annotations.zip], ordered by the number of B-* tags.
     35'''Table 2:''' Dataset statistics from the archive named-entity-recognition-annotations-small.zip, ordered by the number of B-* tags.
    3936
    4037|| ||= file size =||= # sentences =||= # tokens =||= # B-* tags =||= # B-PER tags =||= # B-LOC tags =||= # types =||
     
    122119||=dataset_ner_manatee_non-crossing_only-relevant_testing_401-500 =|| 38.5 kB|| 100|| 4,507|| 110|| 55|| 55|| 2,449||
    123120
    124  * The archive [https://nlp.fi.muni.cz/projekty/ahisto/named-entity-recognition-annotations-large.zip named-entity-recognition-annotations-large.zip] (1.31 GB) contains 16 tuples of files named `*.sentences.txt` and `.ner_tags.txt`. These files contain sentences and NER tags for supervised training, validation, and testing of language models.[[BR]]Here are the four variables that we used to produce the different files:
     121 * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-5024/named-entity-recognition-annotations-large.zip?sequence=3&isAllowed=y named-entity-recognition-annotations-large.zip] (1.31 GB) contains 16 tuples of files named `*.sentences.txt` and `.ner_tags.txt`.[[BR]]These files contain sentences and NER tags for supervised training, validation, and testing of language models. We produced them with our language models.[[BR]]Here are the four variables that we used to produce the different files:
    125122   1. The sentences are extracted from book OCR texts and may therefore span several pages.[[BR]]However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.[[BR]]We either allow the sentences to cross page boundaries (`all`) or not (`non-crossing`).
    126123   1. The sentences come from all book pages (`all`) or just those considered relevant by human annotators (`only-relevant`).
     
    128125   1. We use an ensemble of a baseline model and weak fourth-generation NER models (`004`) or the final seventh-generation NER model (`007`).
    129126
    130 '''Table 3:''' Dataset statistics from the archive [https://nlp.fi.muni.cz/projekty/ahisto/named-entity-recognition-annotations-large.zip named-entity-recognition-annotations-large.zip], ordered by the number of B-* tags.
     127'''Table 3:''' Dataset statistics from the archive named-entity-recognition-annotations-large.zip, ordered by the number of B-* tags.
    131128
    132129|| ||= file size =||= # sentences =||= # tokens =||= # B-* tags =||= # B-PER tags =||= # B-LOC tags =||= # types =||