# File managment
```
ner/
├── data_manipulation/  
│ ├── creation_gazetteers.py  # function use for creating and preprocessing of gazetteers
│ ├── dataset_functions.py  # functions use for dataset loading and manupilation
│ ├── expand_with_declension.py  # function use to multiprocess expand train dataset with delusion
│ └── process_ruin_data.ipynb  # helping to process ruin data
├── declension/
│ └── declension.py  # code copied from https://nlp.fi.muni.cz/projekty/declension/index.py
├── extended_embeddings/
│ ├── extended_embeddings_data_collator.py # adjusted data_collator for extended embeddings
│ ├── extended_embeddings_model.py # adjusted roberta model for extended embeddings
│ └── extended_embeddings_tokenizer.py # adjusted token classification for extended embeddings
├── .gitignore
├── create_gazetteers.py # file that create gazetteers base on config
├── eval_script.py # script for evaluating trained model
├── evaluate_models.ipynb # evaluating existed solutions
├── evaluation_functions.py # functions used for evaluating model and gazetteers matching
├── graphs.ipynb # hempling file for creating graphs
├── hyperparameter_search.py # code to find best params
├── ner_model.py # custom model 
├── README.md # this file
├── requirements.txt 
├── test_gpt.py  # used for testing gpt3.5 turbo by api (token not provided)
└── train_script.py # script for training model
```
# Run
### Instalation
To install the required dependencies, run:

```bash
pip install -r requirements.txt
```
### Train model
```bash
python train_script.py [OPTIONS]
```
```
options:
  -h, --help            show this help message and exit

Generated additional train data arguments:
  --expand_train_data   Whether to expand training data
  --train_gazetteers_path TRAIN_GAZETTEERS_PATH
                        Path to gazetteers for training data expansion
  --gazetteers_counter GAZETTEERS_COUNTER
                        Number of duplicates gazetteer
  --apply_delusion      Whether to apply delusion
  --tagger_path TAGGER_PATH
                        Path to tagger(need only if apply_delusion is True)

Extended embeddings arguments:
  --apply_extended_embeddings
                        Apply extended embedding in the model
  --extended_embeddings_gazetteers_path EXTENDED_EMBEDDINGS_GAZETTEERS_PATH
                        Path to gazetteers for matching
  --method_for_gazetteers_matching {single,multi}
                        Method for gazetteers matching
  --apply_lemmatizing   Whether to apply lemmatizing

Dataset arguments:
  --path_to_tokenized_dataset PATH_TO_TOKENIZED_DATASET
                        Path to tokenized dataset(.hf)
  --cnec_dataset_dir_path CNEC_DATASET_DIR_PATH
                        Path to CNEC dataset
  --contain_only_label_sentences
                        Whether to only use sentences that contain labels
  --division_to_BI_tags
                        Whether to divide tags into BI format

Training arguments:
  --model_name MODEL_NAME
                        tokenizer
  --learning_rate LEARNING_RATE
                        Learning rate for training
  --num_train_epochs NUM_TRAIN_EPOCHS
                        Number of training epochs
  --weight_decay WEIGHT_DECAY
                        Weight decay for optimization
  --eval_steps EVAL_STEPS
                        Number of steps between evaluations
  --save_steps SAVE_STEPS
                        Number of steps between saving the model
  --per_device_train_batch_size PER_DEVICE_TRAIN_BATCH_SIZE
                        Training batch size per device
  --per_device_eval_batch_size PER_DEVICE_EVAL_BATCH_SIZE
                        Evaluation batch size per device
  --gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS
                        Number of steps to accumulate gradients before updating
  --save_total_limit SAVE_TOTAL_LIMIT
                        Maximum number of saved models
  --metric_for_best_model METRIC_FOR_BEST_MODEL
                        Metric used to compare model performance
  --output_dir OUTPUT_DIR
                        output_dir
```
Example:
```bash
python train_ner_model.py \
    --apply_extended_embeddings \
    --extended_embeddings_gazetteers_path "/home/xstromp/dp/data/gazetteers/single_lemTrue_czTrue_skTrue.json" \
    --method_for_gazetteers_matching "single" \
    --apply_lemmatizing \
    --cnec_dataset_dir_path "/nlp/projekty/gazetteer_ner/cnec2.0/data/xml" \
    --division_to_BI_tags \
    --model_name "ufal/robeczech-base" \
    --learning_rate 3e-5 \
    --num_train_epochs 10 \
    --weight_decay 0.01 \
    --eval_steps 50 \
    --save_steps 50 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --save_total_limit 3 \
    --metric_for_best_model "eval_avg_f1" \
    --output_dir "/nlp/projekty/gazetteer_ner/models/model_single_lemTrue_czTrue_skTrue"
```

### Eval model
```bash
python eval_script.py [OPTIONS]
```
```
Evaluate model

options:
  -h, --help            show this help message and exit
  --path_to_model PATH_TO_MODEL
                        Path to saved model that you want evaluate
  --model_name MODEL_NAME
                        tokenizer
  --apply_extended_embeddings
                        Use gazetteers info in the model
  --extended_embeddings_gazetteers_path EXTENDED_EMBEDDINGS_GAZETTEERS_PATH
                        Path to gazetteers for matching
  --method_for_gazetteers_matching {single,multi}
                        Method for gazetteers matching
  --apply_lemmatizing   Whether to apply lemmatizing

Dataset arguments:
  --path_to_tokenized_dataset PATH_TO_TOKENIZED_DATASET
                        path to tokenized dataset .hf
  --cnec_dataset_dir_path CNEC_DATASET_DIR_PATH
                        Path to CNEC dataset
  --contain_only_label_sentences
                        Whether to only use sentences that contain labels
  --division_to_BI_tags
                        Whether to divide tags into BI format
```
Example:
```bash
python eval_script.py \
    --path_to_model "/nlp/projekty/gazetteer_ner/models/now_single_lemTrue_czTrue_skTrue" \
    --model_name "ufal/robeczech-base" \
    --apply_extended_embeddings \
    --extended_embeddings_gazetteers_path "/home/xstromp/dp/data/gazetteers/single_lemTrue_czTrue_skTrue.json" \
    --method_for_gazetteers_matching "single" \
    --apply_lemmatizing \
    --cnec_dataset_dir_path "/nlp/projekty/gazetteer_ner/cnec2.0/data/xml" \
    --division_to_BI_tags
```


# Data sources
### cnec
```bash
curl --remote-name-all https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11858/00-097C-0000-0023-1B22-8{/Czech_Named_Entity_Corpus_2.0.zip}
```
### wikiann
```bash
https://s3.amazonaws.com/datasets.huggingface.co/wikiann/1.1.0/panx_dataset.zip
```
### tagger
```bash
curl --remote-name-all https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4794{/czech-morfflex2.0-pdtc1.0-220710.zip}
```