Změny mezi verzí 20 a verzí 21 u NerDataset


Ignorovat:
Časová značka:
12. 12. 2022 13:54:14 (před 19 měsíci)
Autor:
xnovot32@fi.muni.cz
Komentář:

--

Vysvětlivky:

Nezměněno
Přidáno
Odstraněno
Změněno
  • NerDataset

    v20 v21  
    2424||=dataset_mlm_non-crossing_only-relevant_validation =|| 549.4 kB|| 2489|| 81293|| 22090||
    2525
    26  * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4936/named-entity-recognition-annotations.zip?sequence=2&isAllowed=y named-entity-recognition-annotations.zip] (978.29 MB) contains 41 tuples of files named `*.sentences.txt`, `.ner_tags.txt`, and in one case also `.docx`.^1^[[BR]]These files contain sentences and NER tags for supervised training, validation, and testing of language models.[[BR]]Here are the five variables that we used to produce the different files:
     26 * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4936/named-entity-recognition-annotations.zip?sequence=2&isAllowed=y named-entity-recognition-annotations.zip] (978.29 MB) contains 82 tuples of files named `*.sentences.txt`, `.ner_tags.txt`, and in one case also `.docx`.^1^[[BR]]These files contain sentences and NER tags for supervised training, validation, and testing of language models.[[BR]]Here are the five variables that we used to produce the different files:
    2727   1. The sentences may originate from book OCR texts using information retrieval techniques (`fuzzy-regex` or `manatee`).[[BR]]The sentences may also originate from regests (`regests`) or both books and regests (`fuzzy-regex+regests` and `fuzzy-regex+manatee`).
    2828   1. When sentences originate from book OCR texts, they may span several pages of a book.[[BR]]However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.[[BR]]We either allow the sentences to cross page boundaries (`all`) or not (`non-crossing`).
     
    3333''^1 ^The `.docx` files were authored by human annotators and contain extra details missing from files `.sentences.txt` and `.ner_tags.txt`. The extra details include nested entities such as locations in person names (e.g. “Blažek z __Kralup__”) and people in location names (e.g. “Kostel __sv. Martina__”).''
    3434
    35 '''Table 2:''' Dataset statistics from the archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4936/named-entity-recognition-annotations.zip?sequence=2&isAllowed=y named-entity-recognition-annotations.zip], ordered by the number of B-* tags.
     35'''Table 2:''' Dataset statistics from the archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4936/named-entity-recognition-annotations.zip?sequence=2&isAllowed=y named-entity-recognition-annotations.zip], ordered by the number of B-* tags.
    3636
    3737|| ||= file size =||= # sentences =||= # tokens =||= # B-* tags =||= # B-PER tags =||= # B-LOC tags =||= # types =||