Kontextová navigace

Změny mezi verzí 3 a verzí 4 u NerDataset

Časová značka:: 28. 11. 2022 14:00:56 (před 3 lety)
Autor:: xnovot32@fi.muni.cz
Komentář:: --

Vysvětlivky:

: Nezměněno
: Přidáno
: Odstraněno
: Změněno

NerDataset

-                      v3
+                      v4
 This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for named entity recognition (NER).
 You can [wiki:TODO download the dataset] in the LINDAT/CLARIAH-CZ repository.
+You can [https://nlp.fi.muni.cz/projects/ahisto/ner-dataset.zip download the dataset] in the LINDAT/CLARIAH-CZ repository.
 == Contents ==
+The [https://nlp.fi.muni.cz/projects/ahisto/ner-dataset.zip dataset] (1.7 GB) is structured as follows:
+ * 8 files named `dataset_mlm_`(cross page boundaries?)`_`(only relevant pages?)`.txt`.[[BR]]These files contain sentences for unsupervised training of language models.
+ * 17 pairs of files named `dataset_ner_`(source)`_`(cross page boundaries?)`_`(only relevant pages?)`_`(split)`.sentences.txt` and `.ner_tags.txt`.[[BR]]These files contain sentences and NER tags for supervised training of language models.[[BR]]The NER tags are human-annotated (for source `regests`), machine-generated using information retrieval (for sources `fuzzy-regex` and `manatee`), or both.
+ * 3 files named `dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_tagged.docx`, `.sentences.txt`, and `.ner_tags.txt`.[[BR]]These files contain sentences and NER tags for supervised training of language models.[[BR]]The NER tags are human-annotated.
+ * 17 pairs of files named `dataset_ner_`(source)`_`(cross page boundaries?)`_`(only relevant pages?)`_`(split)_automatically_tagged`.sentences.txt` and `.ner_tags.txt`.[[BR]]These files contain sentences and NER tags for supervised training of language models.[[BR]]The NER tags are machine-generated using language models.
+TODO: Describe filename variables.[[BR]]TODO: Describe TXT and DOCX formats.
 == Citing ==
 If you use our dataset in your work, please cite the following article: