Změny mezi verzí 3 a verzí 4 u NerDataset


Ignorovat:
Časová značka:
28. 11. 2022 14:00:56 (před 20 měsíci)
Autor:
xnovot32@fi.muni.cz
Komentář:

--

Vysvětlivky:

Nezměněno
Přidáno
Odstraněno
Změněno
  • NerDataset

    v3 v4  
    22This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for named entity recognition (NER).
    33
    4 You can [wiki:TODO download the dataset] in the LINDAT/CLARIAH-CZ repository.
     4You can [https://nlp.fi.muni.cz/projects/ahisto/ner-dataset.zip download the dataset] in the LINDAT/CLARIAH-CZ repository.
    55
    66== Contents ==
     7The [https://nlp.fi.muni.cz/projects/ahisto/ner-dataset.zip dataset] (1.7 GB) is structured as follows:
     8
     9 * 8 files named `dataset_mlm_`(cross page boundaries?)`_`(only relevant pages?)`.txt`.[[BR]]These files contain sentences for unsupervised training of language models.
     10 * 17 pairs of files named `dataset_ner_`(source)`_`(cross page boundaries?)`_`(only relevant pages?)`_`(split)`.sentences.txt` and `.ner_tags.txt`.[[BR]]These files contain sentences and NER tags for supervised training of language models.[[BR]]The NER tags are human-annotated (for source `regests`), machine-generated using information retrieval (for sources `fuzzy-regex` and `manatee`), or both.
     11 * 3 files named `dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_tagged.docx`, `.sentences.txt`, and `.ner_tags.txt`.[[BR]]These files contain sentences and NER tags for supervised training of language models.[[BR]]The NER tags are human-annotated.
     12 * 17 pairs of files named `dataset_ner_`(source)`_`(cross page boundaries?)`_`(only relevant pages?)`_`(split)_automatically_tagged`.sentences.txt` and `.ner_tags.txt`.[[BR]]These files contain sentences and NER tags for supervised training of language models.[[BR]]The NER tags are machine-generated using language models.
     13
     14TODO: Describe filename variables.[[BR]]TODO: Describe TXT and DOCX formats.
     15
    716== Citing ==
    817If you use our dataset in your work, please cite the following article: