Kontextová navigace

Změny mezi verzí 4 a verzí 5 u NerDataset

Časová značka:: 28. 11. 2022 16:34:08 (před 3 lety)
Autor:: xnovot32@fi.muni.cz
Komentář:: --

Vysvětlivky:

: Nezměněno
: Přidáno
: Odstraněno
: Změněno

NerDataset

-                      v4
+                      v5
 == Contents ==
 The [https://nlp.fi.muni.cz/projects/ahisto/ner-dataset.zip dataset] (1.7 GB) is structured as follows:
+The dataset is stored in archive [https://nlp.fi.muni.cz/projects/ahisto/ner-dataset.zip ner-dataset.zip] (1.7 GB) with following structure:
+ * 8 files named `dataset_mlm_`(cross page boundaries?)`_`(only relevant pages?)`.txt`.[[BR]]These files contain sentences for unsupervised training of language models.
+ * 17 pairs of files named `dataset_ner_`(source)`_`(cross page boundaries?)`_`(only relevant pages?)`_`(split)`.sentences.txt` and `.ner_tags.txt`.[[BR]]These files contain sentences and NER tags for supervised training of language models.[[BR]]The NER tags are human-annotated (for source `regests`), machine-generated using information retrieval (for sources `fuzzy-regex` and `manatee`), or both.
+ * 3 files named `dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_tagged.docx`, `.sentences.txt`, and `.ner_tags.txt`.[[BR]]These files contain sentences and NER tags for supervised training of language models.[[BR]]The NER tags are human-annotated.
+ * 17 pairs of files named `dataset_ner_`(source)`_`(cross page boundaries?)`_`(only relevant pages?)`_`(split)_automatically_tagged`.sentences.txt` and `.ner_tags.txt`.[[BR]]These files contain sentences and NER tags for supervised training of language models.[[BR]]The NER tags are machine-generated using language models.
+TODO: Describe filename variables.[[BR]]TODO: Describe TXT and DOCX formats.
+ * 8 files named `dataset_mlm_*.txt` that contain sentences for unsupervised training of language models.[[BR]]We used the following three variables to produce the different files:
+. The sentences are extracted from book OCR texts and may therefore span several pages.[[BR]]However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.[[BR]]We either allow the sentences in the file to cross page boundaries (`all`) or not (`non-crossing`).
+. The sentences come from all book pages (`all`) or just those considered relevant by expert annotators (`only-relevant`).
+. We split the sentences roughly into 90% for training (`training`) and 10% for validation (`validation`).
+ * 16 tuples of files named `dataset_ner_*.sentences.txt`, `.ner_tags.txt`, and in two cases also `.docx`.[[BR]]These contain sentences and NER tags for supervised training, validation, and testing of language models.[[BR]]The `.docx` files are authored by human annotators and may contain extra details missing from files `.sentences.txt` and `.ner_tags.txt`.[[BR]]Here are the five variables that we used to produce the different files:
+. The sentences may originate from book OCR texts using information retrieval techniques (`fuzzy-regex` or `manatee`).[[BR]]The sentences may also originate from regests (`regests`).[[BR]]Furthermore, the sentences may originate both from book OCR texts and regests (`fuzzy-regex+regests` and `fuzzy-regex+manatee`).
+. When sentences originate from book OCR texts, they may span several pages of a book.[[BR]]However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.[[BR]]We either allow the sentences in the file to cross page boundaries (`all`) or not (`non-crossing`).
+. When sentences originate from book OCR texts, they may come from book pages of different relevance.[[BR]]We either use sentences from all book pages (`all`) or just those considered relevant by expert annotators (`only-relevant`).
+. When sentences and NER tags originate from book OCR texts using information retrieval techniques, many entities in the sentences may lack tags.[[BR]]Therefore, we also provide NER tags completed by language models (`automatically_tagged`) and human annotators (`tagged`).
+. We split the sentences roughly into 80% for training (`training`), 10% for validation (`validation`), and 10% for testing (`testing`).[[BR]]For repeated testing, we subdivide the testing split (`testing_001-400` and `testing_401-500`).
 == Citing ==