Změny mezi verzí 19 a verzí 20 u NerDataset


Ignorovat:
Časová značka:
12. 12. 2022 13:52:43 (před 19 měsíci)
Autor:
xnovot32@fi.muni.cz
Komentář:

--

Vysvětlivky:

Nezměněno
Přidáno
Odstraněno
Změněno
  • NerDataset

    v19 v20  
    1111   1. The sentences come from all book pages (`all`) or just those considered relevant by human annotators (`only-relevant`).
    1212   1. We split the sentences roughly into 90% for training (`training`) and 10% for validation (`validation`).
     13
     14'''Table 1:''' Dataset statistics from the archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4936/language-modeling-corpus.zip?sequence=1&isAllowed=y language-modeling-corpus.zip], ordered by file size.
     15
     16|| ||= file size =||= # sentences =||= # tokens =||= # types =||
     17||=dataset_mlm_all_all_training =|| 630.7 MB|| 3228077|| 96556612|| 6198957||
     18||=dataset_mlm_non-crossing_all_training =|| 524.1 MB|| 3009931|| 80220907|| 5362515||
     19||=dataset_mlm_all_all_validation =|| 81.8 MB|| 402184|| 12374044|| 1273737||
     20||=dataset_mlm_non-crossing_all_validation =|| 67.3 MB|| 372885|| 10157799|| 1105583||
     21||=dataset_mlm_all_only-relevant_training =|| 8.1 MB|| 47958|| 1286573|| 181845||
     22||=dataset_mlm_non-crossing_only-relevant_training =|| 6.7 MB|| 44278|| 1074734|| 157354||
     23||=dataset_mlm_all_only-relevant_validation =|| 736.7 kB|| 2791|| 108364|| 26986||
     24||=dataset_mlm_non-crossing_only-relevant_validation =|| 549.4 kB|| 2489|| 81293|| 22090||
     25
    1326 * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4936/named-entity-recognition-annotations.zip?sequence=2&isAllowed=y named-entity-recognition-annotations.zip] (978.29 MB) contains 41 tuples of files named `*.sentences.txt`, `.ner_tags.txt`, and in one case also `.docx`.^1^[[BR]]These files contain sentences and NER tags for supervised training, validation, and testing of language models.[[BR]]Here are the five variables that we used to produce the different files:
    1427   1. The sentences may originate from book OCR texts using information retrieval techniques (`fuzzy-regex` or `manatee`).[[BR]]The sentences may also originate from regests (`regests`) or both books and regests (`fuzzy-regex+regests` and `fuzzy-regex+manatee`).
     
    1932
    2033''^1 ^The `.docx` files were authored by human annotators and contain extra details missing from files `.sentences.txt` and `.ner_tags.txt`. The extra details include nested entities such as locations in person names (e.g. “Blažek z __Kralup__”) and people in location names (e.g. “Kostel __sv. Martina__”).''
     34
     35'''Table 2:''' Dataset statistics from the archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4936/named-entity-recognition-annotations.zip?sequence=2&isAllowed=y named-entity-recognition-annotations.zip], ordered by the number of B-* tags.
     36
     37|| ||= file size =||= # sentences =||= # tokens =||= # B-* tags =||= # B-PER tags =||= # B-LOC tags =||= # types =||
     38||=dataset_ner_fuzzy-regex_all_all_training_automatically_tagged =|| 230.4 MB|| 407395|| 24585832|| 2669582|| 1403789|| 1265793|| 2420836||
     39||=dataset_ner_fuzzy-regex+regests_all_all_training_automatically_tagged =|| 231.6 MB|| 411715|| 24735069|| 2640803|| 1378804|| 1261999|| 2427135||
     40||=dataset_ner_fuzzy-regex+regests_non-crossing_all_training_automatically_tagged =|| 164.4 MB|| 353301|| 17387149|| 2065805|| 1100245|| 965560|| 1850210||
     41||=dataset_ner_fuzzy-regex_non-crossing_all_training_automatically_tagged =|| 162.9 MB|| 348981|| 17237912|| 2049537|| 1089768|| 959769|| 1843163||
     42||=dataset_ner_manatee+regests_all_all_training_automatically_tagged =|| 95.4 MB|| 158759|| 10155332|| 1175031|| 563912|| 611119|| 1267107||
     43||=dataset_ner_manatee_all_all_training_automatically_tagged =|| 93.8 MB|| 154439|| 10006095|| 1158763|| 553435|| 605328|| 1258983||
     44||=dataset_ner_manatee+regests_non-crossing_all_training_automatically_tagged =|| 64.5 MB|| 134909|| 6795014|| 870613|| 423345|| 447268|| 932654||
     45||=dataset_ner_manatee_non-crossing_all_training_automatically_tagged =|| 63.0 MB|| 130589|| 6645777|| 854345|| 412868|| 441477|| 923554||
     46||=dataset_ner_fuzzy-regex+regests_all_all_validation_automatically_tagged =|| 58.3 MB|| 81651|| 6211198|| 685020|| 356017|| 329003|| 910379||
     47||=dataset_ner_fuzzy-regex_all_all_validation_automatically_tagged =|| 58.1 MB|| 81149|| 6193356|| 682993|| 354671|| 328322|| 908885||
     48||=dataset_ner_fuzzy-regex+regests_all_all_training =|| 218.0 MB|| 411715|| 24735069|| 606807|| 290530|| 316277|| 2427135||
     49||=dataset_ner_fuzzy-regex_all_all_training =|| 217.7 MB|| 407395|| 24585832|| 592822|| 281497|| 311325|| 2420836||
     50||=dataset_ner_fuzzy-regex+regests_non-crossing_all_training =|| 153.8 MB|| 353301|| 17387149|| 494302|| 238381|| 255921|| 1850210||
     51||=dataset_ner_fuzzy-regex+regests_non-crossing_all_validation_automatically_tagged =|| 37.9 MB|| 67971|| 3989670|| 487724|| 259777|| 227947|| 651387||
     52||=dataset_ner_fuzzy-regex_non-crossing_all_validation_automatically_tagged =|| 37.7 MB|| 67469|| 3971828|| 485697|| 258431|| 227266|| 649698||
     53||=dataset_ner_fuzzy-regex_non-crossing_all_training =|| 153.1 MB|| 348981|| 17237912|| 480318|| 229349|| 250969|| 1843163||
     54||=dataset_ner_manatee+regests_all_all_validation_automatically_tagged =|| 21.0 MB|| 28727|| 2249037|| 261612|| 120358|| 141254|| 427057||
     55||=dataset_ner_manatee_all_all_validation_automatically_tagged =|| 20.8 MB|| 28225|| 2231195|| 259585|| 119012|| 140573|| 425088||
     56||=dataset_ner_manatee+regests_all_all_training =|| 88.9 MB|| 158759|| 10155332|| 214566|| 79924|| 134642|| 1267107||
     57||=dataset_ner_manatee_all_all_training =|| 87.9 MB|| 154439|| 10006095|| 200582|| 70892|| 129690|| 1258983||
     58||=dataset_ner_manatee+regests_non-crossing_all_validation_automatically_tagged =|| 12.8 MB|| 23643|| 1348859|| 176809|| 83699|| 93110|| 293119||
     59||=dataset_ner_manatee+regests_non-crossing_all_training =|| 59.8 MB|| 134909|| 6795014|| 174902|| 65897|| 109005|| 932654||
     60||=dataset_ner_manatee_non-crossing_all_validation_automatically_tagged =|| 12.6 MB|| 23141|| 1331017|| 174782|| 82353|| 92429|| 290894||
     61||=dataset_ner_manatee_non-crossing_all_training =|| 58.6 MB|| 130589|| 6645777|| 160918|| 56865|| 104053|| 923554||
     62||=dataset_ner_fuzzy-regex+regests_all_all_validation =|| 54.2 MB|| 81651|| 6211198|| 92485|| 46038|| 46447|| 910379||
     63||=dataset_ner_fuzzy-regex_all_all_testing =|| 54.2 MB|| 80929|| 6167375|| 90747|| 45176|| 45571|| 908276||
     64||=dataset_ner_fuzzy-regex_all_all_validation =|| 54.4 MB|| 81149|| 6193356|| 90719|| 44878|| 45841|| 908885||
     65||=dataset_ner_fuzzy-regex+regests_non-crossing_all_validation =|| 35.0 MB|| 67971|| 3989670|| 75207|| 37496|| 37711|| 651387||
     66||=dataset_ner_fuzzy-regex+regests_all_only-relevant_training_automatically_tagged =|| 6.6 MB|| 14942|| 694242|| 73757|| 41838|| 31919|| 119272||
     67||=dataset_ner_fuzzy-regex_non-crossing_all_testing =|| 34.8 MB|| 67208|| 3938611|| 73476|| 36506|| 36970|| 644220||
     68||=dataset_ner_fuzzy-regex_non-crossing_all_validation =|| 35.1 MB|| 67469|| 3971828|| 73441|| 36336|| 37105|| 649698||
     69||=dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_training_automatically_tagged =|| 5.3 MB|| 13456|| 548928|| 61522|| 35007|| 26515|| 99275||
     70||=dataset_ner_fuzzy-regex_all_only-relevant_training_automatically_tagged =|| 5.1 MB|| 10622|| 545005|| 57489|| 31361|| 26128|| 98843||
     71||=dataset_ner_manatee+regests_all_only-relevant_training_automatically_tagged =|| 4.6 MB|| 11813|| 490147|| 51653|| 28315|| 23338|| 88535||
     72||=dataset_ner_fuzzy-regex_non-crossing_only-relevant_training_automatically_tagged =|| 3.7 MB|| 9136|| 399691|| 45254|| 24530|| 20724|| 77963||
     73||=dataset_ner_manatee+regests_non-crossing_only-relevant_training_automatically_tagged =|| 3.8 MB|| 10813|| 401164|| 44213|| 24435|| 19778|| 74376||
     74||=dataset_ner_manatee_all_only-relevant_training_automatically_tagged =|| 3.1 MB|| 7493|| 340910|| 34247|| 17193|| 17054|| 66659||
     75||=dataset_ner_manatee+regests_all_all_validation =|| 19.5 MB|| 28727|| 2249037|| 32546|| 12999|| 19547|| 427057||
     76||=dataset_ner_manatee_all_all_testing =|| 19.9 MB|| 29516|| 2279822|| 32234|| 12555|| 19679|| 437414||
     77||=dataset_ner_manatee_all_all_validation =|| 19.4 MB|| 28225|| 2231195|| 30780|| 11839|| 18941|| 425088||
     78||=dataset_ner_fuzzy-regex+regests_all_only-relevant_training =|| 6.3 MB|| 14942|| 694242|| 30455|| 19214|| 11241|| 119272||
     79||=dataset_ner_manatee_non-crossing_only-relevant_training_automatically_tagged =|| 2.3 MB|| 6493|| 251927|| 27945|| 13958|| 13987|| 51600||
     80||=dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_training =|| 5.0 MB|| 13456|| 548928|| 27324|| 17257|| 10067|| 99275||
     81||=dataset_ner_manatee+regests_non-crossing_all_validation =|| 11.8 MB|| 23643|| 1348859|| 26287|| 10498|| 15789|| 293119||
     82||=dataset_ner_manatee_non-crossing_all_testing =|| 12.2 MB|| 24420|| 1384547|| 25937|| 10068|| 15869|| 300862||
     83||=dataset_ner_manatee_non-crossing_all_validation =|| 11.7 MB|| 23141|| 1331017|| 24521|| 9338|| 15183|| 290894||
     84||=dataset_ner_manatee+regests_all_only-relevant_training =|| 4.4 MB|| 11813|| 490147|| 24212|| 13626|| 10586|| 88535||
     85||=dataset_ner_manatee+regests_non-crossing_only-relevant_training =|| 3.7 MB|| 10813|| 401164|| 22583|| 12909|| 9674|| 74376||
     86||=dataset_ner_fuzzy-regex+regests_all_only-relevant_validation_automatically_tagged =|| 1.5 MB|| 2776|| 158548|| 16901|| 9936|| 6965|| 44018||
     87||=dataset_ner_fuzzy-regex_all_only-relevant_training =|| 4.8 MB|| 10622|| 545005|| 16471|| 10182|| 6289|| 98843||
     88||=dataset_ner_regests_training_automatically_tagged =|| 1.5 MB|| 4320|| 149237|| 16268|| 10477|| 5791|| 29166||
     89||=dataset_ner_fuzzy-regex_all_only-relevant_validation_automatically_tagged =|| 1.3 MB|| 2274|| 140706|| 14874|| 8590|| 6284|| 39612||
     90||=dataset_ner_regests_training =|| 1.5 MB|| 4320|| 149237|| 13984|| 9032|| 4952|| 29166||
     91||=dataset_ner_fuzzy-regex_non-crossing_only-relevant_training =|| 3.5 MB|| 9136|| 399691|| 13340|| 8225|| 5115|| 77963||
     92||=dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_validation_automatically_tagged =|| 1.1 MB|| 2420|| 110376|| 12902|| 7592|| 5310|| 33352||
     93||=dataset_ner_fuzzy-regex_non-crossing_only-relevant_validation_automatically_tagged =|| 885.1 kB|| 1918|| 92534|| 10875|| 6246|| 4629|| 28676||
     94||=dataset_ner_manatee_all_only-relevant_training =|| 2.9 MB|| 7493|| 340910|| 10228|| 4594|| 5634|| 66659||
     95||=dataset_ner_manatee+regests_all_only-relevant_validation_automatically_tagged =|| 913.3 kB|| 1972|| 97069|| 10180|| 5592|| 4588|| 28324||
     96||=dataset_ner_manatee_non-crossing_only-relevant_training =|| 2.2 MB|| 6493|| 251927|| 8599|| 3877|| 4722|| 51600||
     97||=dataset_ner_manatee_all_only-relevant_validation_automatically_tagged =|| 730.1 kB|| 1470|| 79227|| 8153|| 4246|| 3907|| 23569||
     98||=dataset_ner_manatee+regests_non-crossing_only-relevant_validation_automatically_tagged =|| 683.4 kB|| 1751|| 71948|| 8136|| 4501|| 3635|| 22133||
     99||=dataset_ner_manatee_non-crossing_only-relevant_validation_automatically_tagged =|| 500.3 kB|| 1249|| 54106|| 6109|| 3155|| 2954|| 17138||
     100||=dataset_ner_fuzzy-regex+regests_all_only-relevant_validation =|| 1.4 MB|| 2776|| 158548|| 4421|| 2817|| 1604|| 44018||
     101||=dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_validation =|| 998.7 kB|| 2420|| 110376|| 3938|| 2519|| 1419|| 33352||
     102||=dataset_ner_manatee+regests_all_only-relevant_validation =|| 862.5 kB|| 1972|| 97069|| 3347|| 1887|| 1460|| 28324||
     103||=dataset_ner_manatee+regests_non-crossing_only-relevant_validation =|| 646.8 kB|| 1751|| 71948|| 3094|| 1774|| 1320|| 22133||
     104||=dataset_ner_fuzzy-regex_all_only-relevant_testing =|| 1.3 MB|| 2405|| 144684|| 2780|| 1784|| 996|| 39977||
     105||=dataset_ner_fuzzy-regex_all_only-relevant_validation =|| 1.2 MB|| 2274|| 140706|| 2655|| 1657|| 998|| 39612||
     106||=dataset_ner_fuzzy-regex_non-crossing_only-relevant_testing =|| 867.0 kB|| 2034|| 98659|| 2292|| 1455|| 837|| 29874||
     107||=dataset_ner_regests_testing =|| 261.7 kB|| 799|| 26148|| 2182|| 1422|| 760|| 8978||
     108||=dataset_ner_fuzzy-regex_non-crossing_only-relevant_validation =|| 818.4 kB|| 1918|| 92534|| 2172|| 1359|| 813|| 28676||
     109||=dataset_ner_regests_validation_automatically_tagged =|| 183.1 kB|| 502|| 17842|| 2027|| 1346|| 681|| 6445||
     110||=dataset_ner_regests_validation =|| 181.7 kB|| 502|| 17842|| 1766|| 1160|| 606|| 6445||
     111||=dataset_ner_manatee_all_only-relevant_validation =|| 681.8 kB|| 1470|| 79227|| 1581|| 727|| 854|| 23569||
     112||=dataset_ner_manatee_all_only-relevant_testing =|| 678.8 kB|| 1420|| 78751|| 1529|| 695|| 834|| 23949||
     113||=dataset_ner_manatee_non-crossing_only-relevant_validation =|| 465.9 kB|| 1249|| 54106|| 1328|| 614|| 714|| 17138||
     114||=dataset_ner_manatee_non-crossing_only-relevant_testing =|| 469.1 kB|| 1208|| 54391|| 1283|| 587|| 696|| 17713||
     115||=dataset_ner_regests_testing_001-400 =|| 129.8 kB|| 400|| 12811|| 1164|| 789|| 375|| 5121||
     116||=dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_tagged =|| 41.6 kB|| 100|| 4507|| 530|| 287|| 243|| 2449||
     117||=dataset_ner_manatee_non-crossing_only-relevant_testing_001-400 =|| 169.0 kB|| 400|| 19554|| 439|| 201|| 238|| 7928||
     118||=dataset_ner_manatee_non-crossing_only-relevant_testing_401-500 =|| 38.5 kB|| 100|| 4507|| 110|| 55|| 55|| 2449||
     119||=dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_automatically_tagged =|| 41.0 kB|| 100|| 4507|| 0|| 0|| 0|| 2449||
    21120
    22121== Citing ==