#Training NER model from BERT + WikiAnn

In this colab, we will use the WikiAnn corpora. WikiAnn is annotated from Wikipedia pages and their categories. Check the WikiAnn paper at https://aclanthology.org/P17-1178.pdf

WikiAnn were used for tranfer learning of NER from well-resourced languages into under-resources languages. Check the paper at https://github.com/afshinrahimi/mmner

The WikiAnn corpora are described at https://huggingface.co/datasets/wikiann

For training, we will use the BERT model for downstream tasks.

Both the model and the dataset are stored in huggingface, so we will use huggingface modules `datasets`, `tokenizers` for training, `sequeval` for evaluation, and `transformers` for prediction.


In [None]:
!pip install datasets
!pip install tokenizers
!pip install transformers
!pip install seqeval

Make sure, we are using the GPU. If GPU is not set up, go to `Runtime`/`Change runtime type` and select `GPU`.

In [None]:
import tensorflow as tf
tf.test.gpu_device_name()

Here, we load the WikiAnn corpus. We can use the huggingface `dataset` module. Check https://huggingface.co/datasets/wikiann for available langauges and data sizes.

In [None]:
from datasets import load_dataset

dataset = load_dataset("wikiann", "sk")

By loading the WikiAnn datasets, we obtain the `DatasetDict`. The data itself is under `DatasetDict.data`, however, we will work with the dictionary.

In [None]:
type(dataset)

In [None]:
dataset

In [None]:
type(dataset['train'])

In [None]:
dataset["train"].features

In [None]:
label_names = dataset["train"].features["ner_tags"].feature.names
label_names

**TASK 1**: Display some examples in your language to get familiar with the WikiAnn data. Write down some examples and your observations.

In [None]:
example_no = 405
dataset.data['train']['tokens'][example_no]

In [None]:
dataset.data['train']['ner_tags'][example_no]

In [None]:
dataset['train'][example_no]

Next, we have to use *the same tokenizer* as for the pretrained model. Different tokenizers can split sentences in different ways but we need the data to be split exactly the same way it is in the pretrained model.

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")


**TASK 2**: Check tokenization on few sentences in your language. Write down your observations.

In [None]:
text = "JA som tu! Bývám v Liptovskom Mikuláši."
tokenized = tokenizer(text)
tokenized

In [None]:
tokenizer.tokenize(text)

The tokens are converted to token IDs, and these are converted to tensors.

We can see the tokens are often smaller units than words. However, we have NER tags for words. The next function spreads the token class (the NER tag) for all subwords of a token.

The code is copied from https://www.freecodecamp.org/news/getting-started-with-ner-models-using-huggingface/

In [None]:
def tokenize_adjust_labels(all_samples_per_split):
  tokenized_samples = tokenizer.batch_encode_plus(all_samples_per_split["tokens"], is_split_into_words=True, max_length=50)

  #tokenized_samples is not a datasets object so this alone won't work with Trainer API, hence map is used
  #so the new keys [input_ids, labels (after adjustment)]
  #can be added to the datasets dict for each train test validation split
  total_adjusted_labels = []
  print(len(tokenized_samples["input_ids"]))
  for k in range(0, len(tokenized_samples["input_ids"])):
    prev_wid = -1
    word_ids_list = tokenized_samples.word_ids(batch_index=k)
    existing_label_ids = all_samples_per_split["ner_tags"][k]
    i = -1
    adjusted_label_ids = []

    for wid in word_ids_list:
      if(wid is None):
        adjusted_label_ids.append(-100)
      elif(wid!=prev_wid):
        i = i + 1
        adjusted_label_ids.append(existing_label_ids[i])
        prev_wid = wid
      else:
        label_name = label_names[existing_label_ids[i]]
        adjusted_label_ids.append(existing_label_ids[i])

    total_adjusted_labels.append(adjusted_label_ids)
  tokenized_samples["labels"] = total_adjusted_labels
  return tokenized_samples

tokenized_dataset = dataset.map(tokenize_adjust_labels, batched=True)

The method above adds `input_ids`, `token_type_ids` and `attention_mask` from the tokenizer to the dataset. These will be used later by the trainer. In contrast, `ner_tags`, `langs`, `tokens`, and `spans` won't be used.

In [None]:
example_no = 1
dataset['train'][example_no].keys(), tokenized_dataset['train'][example_no].keys()

For training, we need the samples to be the same length. A simple padding can be done by the tokenizer itself. Example code:

In [None]:
batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True)
print(encoded_input)

Data Collator preprocessess the input data into batches, optionally using methods such as padding and truncation. By default the `DataCollatorForTokenClassification` pads samples to `max_length`. Depending on your data and available memory, it may be useful to set `max_length` as a parameter for the `DataCollatorForTokenClassification`. See documentation for more details: https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorForTokenClassification

In [None]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer)

Next, we load the BERT model. For languages different from English, the multilingual model is most suitable. For scripts that distinguish upper case and lower case, the cased model is more suitable. We will load model weights, so that further training does not require much data. Moreover, even though the base model does not contain NER classification, we can benefit from the weights for the NER classification task.

In [None]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

In [None]:
model = AutoModelForTokenClassification.from_pretrained("bert-base-multilingual-cased",
                                                        num_labels=len(label_names))

Next step is to set the metric. Since NER is a task on sequences, we use `sequeval` for evaluation. The `sequeval` is able to work with different IOB schemata. By default, it evaluates incorrect B-tags and I-tags as true positives (e.g. York is predicted as I-LOC but it should be B-LOC - still it is considered correct). More information is available here: https://huggingface.co/spaces/evaluate-metric/seqeval

The method below considers the boundary tags (-100), plus it flattens the results to make them more readable. The code is copied from https://www.freecodecamp.org/news/getting-started-with-ner-models-using-huggingface/

In [None]:
import numpy as np
from datasets import load_metric
metric = load_metric("seqeval")
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_names[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    flattened_results = {
        "overall_precision": results["overall_precision"],
        "overall_recall": results["overall_recall"],
        "overall_f1": results["overall_f1"],
        "overall_accuracy": results["overall_accuracy"],
    }
    for k in results.keys():
      if(k not in flattened_results.keys()):
        flattened_results[k+"_f1"]=results[k]["f1"]

    return flattened_results

##Training
Run the training with the below parameters.
**TASK 3**: Observe the training step results. Does the model improve? We train only in one epoch. What would you expect if the number of epochs increases?

In [None]:
training_args = TrainingArguments(
    output_dir="./fine_tune_bert_output",
    evaluation_strategy="steps",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=1000,
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [None]:
trainer.train()

In [None]:
out_dir = './bert_ner'
trainer.save_model(out_dir)

##Predictions
Load the fine-tuned model into the pipeline and run prediction on sentences in your language.

**TASK 4**: Put down some observations. Where does the model perform well? How does it deal with rare or OOV words?

In [None]:
import transformers

In [None]:
token_classifier = transformers.pipeline(
    "token-classification", model="./bert_ner", aggregation_strategy="first"
)

In [None]:
label_names

In [None]:
text = "Ja som Katka z Blavy. Bola som na Slovensku, no teraz som v Brne."
token_classifier(text)