| 7 | The [https://nlp.fi.muni.cz/projects/ahisto/ner-dataset.zip dataset] (1.7 GB) is structured as follows: |
| 8 | |
| 9 | * 8 files named `dataset_mlm_`(cross page boundaries?)`_`(only relevant pages?)`.txt`.[[BR]]These files contain sentences for unsupervised training of language models. |
| 10 | * 17 pairs of files named `dataset_ner_`(source)`_`(cross page boundaries?)`_`(only relevant pages?)`_`(split)`.sentences.txt` and `.ner_tags.txt`.[[BR]]These files contain sentences and NER tags for supervised training of language models.[[BR]]The NER tags are human-annotated (for source `regests`), machine-generated using information retrieval (for sources `fuzzy-regex` and `manatee`), or both. |
| 11 | * 3 files named `dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_tagged.docx`, `.sentences.txt`, and `.ner_tags.txt`.[[BR]]These files contain sentences and NER tags for supervised training of language models.[[BR]]The NER tags are human-annotated. |
| 12 | * 17 pairs of files named `dataset_ner_`(source)`_`(cross page boundaries?)`_`(only relevant pages?)`_`(split)_automatically_tagged`.sentences.txt` and `.ner_tags.txt`.[[BR]]These files contain sentences and NER tags for supervised training of language models.[[BR]]The NER tags are machine-generated using language models. |
| 13 | |
| 14 | TODO: Describe filename variables.[[BR]]TODO: Describe TXT and DOCX formats. |
| 15 | |