Změny mezi verzí 8 a verzí 9 u NerDataset


Ignorovat:
Časová značka:
28. 11. 2022 16:39:31 (před 20 měsíci)
Autor:
xnovot32@fi.muni.cz
Komentář:

--

Vysvětlivky:

Nezměněno
Přidáno
Odstraněno
Změněno
  • NerDataset

    v8 v9  
    77The dataset is stored in archive [https://nlp.fi.muni.cz/projects/ahisto/ner-dataset.zip ner-dataset.zip] (1.7 GB) with following structure:
    88
    9  * 8 files named `dataset_mlm_*.txt` that contain sentences for unsupervised training of language models.[[BR]]We used the following three variables to produce the different files:
     9 * 8 files named `dataset_mlm_*.txt` that contain sentences for unsupervised training and validation of language models.[[BR]]We used the following three variables to produce the different files:
    1010   1. The sentences are extracted from book OCR texts and may therefore span several pages.[[BR]]However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.[[BR]]We either allow the sentences to cross page boundaries (`all`) or not (`non-crossing`).
    1111   1. The sentences come from all book pages (`all`) or just those considered relevant by human annotators (`only-relevant`).