# File managment ``` ner/ ├── data_manipulation/ │ ├── creation_gazetteers.py # function use for creating and preprocessing of gazetteers │ ├── dataset_functions.py # functions use for dataset loading and manupilation │ ├── expand_with_declension.py # function use to multiprocess expand train dataset with delusion │ └── process_ruin_data.ipynb # helping to process ruin data ├── declension/ │ └── declension.py # code copied from https://nlp.fi.muni.cz/projekty/declension/index.py ├── extended_embeddings/ │ ├── extended_embeddings_data_collator.py # adjusted data_collator for extended embeddings │ ├── extended_embeddings_model.py # adjusted roberta model for extended embeddings │ └── extended_embeddings_tokenizer.py # adjusted token classification for extended embeddings ├── .gitignore ├── create_gazetteers.py # file that create gazetteers base on config ├── eval_script.py # script for evaluating trained model ├── evaluate_models.ipynb # evaluating existed solutions ├── evaluation_functions.py # functions used for evaluating model and gazetteers matching ├── graphs.ipynb # hempling file for creating graphs ├── hyperparameter_search.py # code to find best params ├── ner_model.py # custom model ├── README.md # this file ├── requirements.txt ├── test_gpt.py # used for testing gpt3.5 turbo by api (token not provided) └── train_script.py # script for training model ``` # Run ### Instalation To install the required dependencies, run: ```bash pip install -r requirements.txt ``` ### Train model ```bash python train_script.py [OPTIONS] ``` ``` options: -h, --help show this help message and exit Generated additional train data arguments: --expand_train_data Whether to expand training data --train_gazetteers_path TRAIN_GAZETTEERS_PATH Path to gazetteers for training data expansion --gazetteers_counter GAZETTEERS_COUNTER Number of duplicates gazetteer --apply_delusion Whether to apply delusion --tagger_path TAGGER_PATH Path to tagger(need only if apply_delusion is True) Extended embeddings arguments: --apply_extended_embeddings Apply extended embedding in the model --extended_embeddings_gazetteers_path EXTENDED_EMBEDDINGS_GAZETTEERS_PATH Path to gazetteers for matching --method_for_gazetteers_matching {single,multi} Method for gazetteers matching --apply_lemmatizing Whether to apply lemmatizing Dataset arguments: --path_to_tokenized_dataset PATH_TO_TOKENIZED_DATASET Path to tokenized dataset(.hf) --cnec_dataset_dir_path CNEC_DATASET_DIR_PATH Path to CNEC dataset --contain_only_label_sentences Whether to only use sentences that contain labels --division_to_BI_tags Whether to divide tags into BI format Training arguments: --model_name MODEL_NAME tokenizer --learning_rate LEARNING_RATE Learning rate for training --num_train_epochs NUM_TRAIN_EPOCHS Number of training epochs --weight_decay WEIGHT_DECAY Weight decay for optimization --eval_steps EVAL_STEPS Number of steps between evaluations --save_steps SAVE_STEPS Number of steps between saving the model --per_device_train_batch_size PER_DEVICE_TRAIN_BATCH_SIZE Training batch size per device --per_device_eval_batch_size PER_DEVICE_EVAL_BATCH_SIZE Evaluation batch size per device --gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS Number of steps to accumulate gradients before updating --save_total_limit SAVE_TOTAL_LIMIT Maximum number of saved models --metric_for_best_model METRIC_FOR_BEST_MODEL Metric used to compare model performance --output_dir OUTPUT_DIR output_dir ``` Example: ```bash python train_ner_model.py \ --apply_extended_embeddings \ --extended_embeddings_gazetteers_path "/home/xstromp/dp/data/gazetteers/single_lemTrue_czTrue_skTrue.json" \ --method_for_gazetteers_matching "single" \ --apply_lemmatizing \ --cnec_dataset_dir_path "/nlp/projekty/gazetteer_ner/cnec2.0/data/xml" \ --division_to_BI_tags \ --model_name "ufal/robeczech-base" \ --learning_rate 3e-5 \ --num_train_epochs 10 \ --weight_decay 0.01 \ --eval_steps 50 \ --save_steps 50 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 16 \ --save_total_limit 3 \ --metric_for_best_model "eval_avg_f1" \ --output_dir "/nlp/projekty/gazetteer_ner/models/model_single_lemTrue_czTrue_skTrue" ``` ### Eval model ```bash python eval_script.py [OPTIONS] ``` ``` Evaluate model options: -h, --help show this help message and exit --path_to_model PATH_TO_MODEL Path to saved model that you want evaluate --model_name MODEL_NAME tokenizer --apply_extended_embeddings Use gazetteers info in the model --extended_embeddings_gazetteers_path EXTENDED_EMBEDDINGS_GAZETTEERS_PATH Path to gazetteers for matching --method_for_gazetteers_matching {single,multi} Method for gazetteers matching --apply_lemmatizing Whether to apply lemmatizing Dataset arguments: --path_to_tokenized_dataset PATH_TO_TOKENIZED_DATASET path to tokenized dataset .hf --cnec_dataset_dir_path CNEC_DATASET_DIR_PATH Path to CNEC dataset --contain_only_label_sentences Whether to only use sentences that contain labels --division_to_BI_tags Whether to divide tags into BI format ``` Example: ```bash python eval_script.py \ --path_to_model "/nlp/projekty/gazetteer_ner/models/now_single_lemTrue_czTrue_skTrue" \ --model_name "ufal/robeczech-base" \ --apply_extended_embeddings \ --extended_embeddings_gazetteers_path "/home/xstromp/dp/data/gazetteers/single_lemTrue_czTrue_skTrue.json" \ --method_for_gazetteers_matching "single" \ --apply_lemmatizing \ --cnec_dataset_dir_path "/nlp/projekty/gazetteer_ner/cnec2.0/data/xml" \ --division_to_BI_tags ``` # Data sources ### cnec ```bash curl --remote-name-all https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11858/00-097C-0000-0023-1B22-8{/Czech_Named_Entity_Corpus_2.0.zip} ``` ### wikiann ```bash https://s3.amazonaws.com/datasets.huggingface.co/wikiann/1.1.0/panx_dataset.zip ``` ### tagger ```bash curl --remote-name-all https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4794{/czech-morfflex2.0-pdtc1.0-220710.zip} ```