private/NlpInPracticeCourse/NamedEntityRecognition: Train_your_own_BERT_NER.ipynb

File Train_your_own_BERT_NER.ipynb, 17.9 KB (added by Zuzana Nevěřilová, 7 months ago)
Line 
1{
2  "nbformat": 4,
3  "nbformat_minor": 0,
4  "metadata": {
5    "colab": {
6      "provenance": []
7    },
8    "kernelspec": {
9      "name": "python3",
10      "display_name": "Python 3"
11    },
12    "language_info": {
13      "name": "python"
14    },
15    "accelerator": "GPU",
16    "gpuClass": "standard"
17  },
18  "cells": [
19    {
20      "cell_type": "markdown",
21      "source": [
22        "#Training NER model from BERT + WikiAnn\n",
23        "\n",
24        "In this colab, we will use the WikiAnn corpora. WikiAnn is annotated from Wikipedia pages and their categories. Check the WikiAnn paper at https://aclanthology.org/P17-1178.pdf\n",
25        "\n",
26        "WikiAnn were used for tranfer learning of NER from well-resourced languages into under-resources languages. Check the paper at https://github.com/afshinrahimi/mmner\n",
27        "\n",
28        "The WikiAnn corpora are described at https://huggingface.co/datasets/wikiann\n",
29        "\n",
30        "For training, we will use the BERT model for downstream tasks.\n",
31        "\n",
32        "Both the model and the dataset are stored in huggingface, so we will use huggingface modules `datasets`, `tokenizers` for training, `sequeval` for evaluation, and `transformers` for prediction.\n"
33      ],
34      "metadata": {
35        "id": "pmBLtRRTWEWm"
36      }
37    },
38    {
39      "cell_type": "code",
40      "execution_count": null,
41      "metadata": {
42        "id": "llooKX2-WCRm"
43      },
44      "outputs": [],
45      "source": [
46        "!pip install datasets\n",
47        "!pip install tokenizers\n",
48        "!pip install transformers\n",
49        "!pip install seqeval"
50      ]
51    },
52    {
53      "cell_type": "markdown",
54      "source": [
55        "Make sure, we are using the GPU. If GPU is not set up, go to `Runtime`/`Change runtime type` and select `GPU`."
56      ],
57      "metadata": {
58        "id": "_53cGbGEE-TA"
59      }
60    },
61    {
62      "cell_type": "code",
63      "source": [
64        "import tensorflow as tf\n",
65        "tf.test.gpu_device_name()"
66      ],
67      "metadata": {
68        "id": "7_EUzq-boboJ"
69      },
70      "execution_count": null,
71      "outputs": []
72    },
73    {
74      "cell_type": "markdown",
75      "source": [
76        "Here, we load the WikiAnn corpus. We can use the huggingface `dataset` module. Check https://huggingface.co/datasets/wikiann for available langauges and data sizes."
77      ],
78      "metadata": {
79        "id": "kXd9oSPEFNVa"
80      }
81    },
82    {
83      "cell_type": "code",
84      "source": [
85        "from datasets import load_dataset\n",
86        "\n",
87        "dataset = load_dataset(\"wikiann\", \"sk\")"
88      ],
89      "metadata": {
90        "id": "Vw-CwOigYSPE"
91      },
92      "execution_count": null,
93      "outputs": []
94    },
95    {
96      "cell_type": "markdown",
97      "source": [
98        "By loading the WikiAnn datasets, we obtain the `DatasetDict`. The data itself is under `DatasetDict.data`, however, we will work with the dictionary."
99      ],
100      "metadata": {
101        "id": "xN2cYTFwFmtA"
102      }
103    },
104    {
105      "cell_type": "code",
106      "source": [
107        "type(dataset)"
108      ],
109      "metadata": {
110        "id": "9974i-vSY6YT"
111      },
112      "execution_count": null,
113      "outputs": []
114    },
115    {
116      "cell_type": "code",
117      "source": [
118        "dataset"
119      ],
120      "metadata": {
121        "id": "MVnPJl2TYa7i"
122      },
123      "execution_count": null,
124      "outputs": []
125    },
126    {
127      "cell_type": "code",
128      "source": [
129        "type(dataset['train'])"
130      ],
131      "metadata": {
132        "id": "6877QCS5g9Ps"
133      },
134      "execution_count": null,
135      "outputs": []
136    },
137    {
138      "cell_type": "code",
139      "source": [
140        "dataset[\"train\"].features"
141      ],
142      "metadata": {
143        "id": "6jBiyLRHYeax"
144      },
145      "execution_count": null,
146      "outputs": []
147    },
148    {
149      "cell_type": "code",
150      "source": [
151        "label_names = dataset[\"train\"].features[\"ner_tags\"].feature.names\n",
152        "label_names"
153      ],
154      "metadata": {
155        "id": "OwrH1k1OYaD1"
156      },
157      "execution_count": null,
158      "outputs": []
159    },
160    {
161      "cell_type": "markdown",
162      "source": [
163        "**TASK 1**: Display some examples in your language to get familiar with the WikiAnn data. Write down some examples and your observations."
164      ],
165      "metadata": {
166        "id": "52f-Gm5bF41i"
167      }
168    },
169    {
170      "cell_type": "code",
171      "source": [
172        "example_no = 405\n",
173        "dataset.data['train']['tokens'][example_no]"
174      ],
175      "metadata": {
176        "id": "DdBWVGwGZLXi"
177      },
178      "execution_count": null,
179      "outputs": []
180    },
181    {
182      "cell_type": "code",
183      "source": [
184        "dataset.data['train']['ner_tags'][example_no]"
185      ],
186      "metadata": {
187        "id": "30ycCdcwfyDz"
188      },
189      "execution_count": null,
190      "outputs": []
191    },
192    {
193      "cell_type": "code",
194      "source": [
195        "dataset['train'][example_no]"
196      ],
197      "metadata": {
198        "id": "LDVEk2Kwkhjm"
199      },
200      "execution_count": null,
201      "outputs": []
202    },
203    {
204      "cell_type": "markdown",
205      "source": [
206        "Next, we have to use *the same tokenizer* as for the pretrained model. Different tokenizers can split sentences in different ways but we need the data to be split exactly the same way it is in the pretrained model."
207      ],
208      "metadata": {
209        "id": "_5Vz3KeVGlJt"
210      }
211    },
212    {
213      "cell_type": "code",
214      "source": [
215        "from transformers import AutoTokenizer\n",
216        "tokenizer = AutoTokenizer.from_pretrained(\"bert-base-multilingual-cased\")\n"
217      ],
218      "metadata": {
219        "id": "FD_U6dMtZh5e"
220      },
221      "execution_count": null,
222      "outputs": []
223    },
224    {
225      "cell_type": "markdown",
226      "source": [
227        "**TASK 2**: Check tokenization on few sentences in your language. Write down your observations."
228      ],
229      "metadata": {
230        "id": "xyHF_msoG82n"
231      }
232    },
233    {
234      "cell_type": "code",
235      "source": [
236        "text = \"JA som tu! Bývám v Liptovskom Mikuláši.\"\n",
237        "tokenized = tokenizer(text)\n",
238        "tokenized"
239      ],
240      "metadata": {
241        "id": "V2EEYO6SZyH_"
242      },
243      "execution_count": null,
244      "outputs": []
245    },
246    {
247      "cell_type": "code",
248      "source": [
249        "tokenizer.tokenize(text)"
250      ],
251      "metadata": {
252        "id": "TA_hOGIAyYkY"
253      },
254      "execution_count": null,
255      "outputs": []
256    },
257    {
258      "cell_type": "markdown",
259      "source": [
260        "The tokens are converted to token IDs, and these are converted to tensors.\n",
261        "\n",
262        "We can see the tokens are often smaller units than words. However, we have NER tags for words. The next function spreads the token class (the NER tag) for all subwords of a token.\n",
263        "\n",
264        "The code is copied from https://www.freecodecamp.org/news/getting-started-with-ner-models-using-huggingface/"
265      ],
266      "metadata": {
267        "id": "baqEwuB1HJvP"
268      }
269    },
270    {
271      "cell_type": "code",
272      "source": [
273        "def tokenize_adjust_labels(all_samples_per_split):\n",
274        "  tokenized_samples = tokenizer.batch_encode_plus(all_samples_per_split[\"tokens\"], is_split_into_words=True, max_length=50)\n",
275        "\n",
276        "  #tokenized_samples is not a datasets object so this alone won't work with Trainer API, hence map is used\n",
277        "  #so the new keys [input_ids, labels (after adjustment)]\n",
278        "  #can be added to the datasets dict for each train test validation split\n",
279        "  total_adjusted_labels = []\n",
280        "  print(len(tokenized_samples[\"input_ids\"]))\n",
281        "  for k in range(0, len(tokenized_samples[\"input_ids\"])):\n",
282        "    prev_wid = -1\n",
283        "    word_ids_list = tokenized_samples.word_ids(batch_index=k)\n",
284        "    existing_label_ids = all_samples_per_split[\"ner_tags\"][k]\n",
285        "    i = -1\n",
286        "    adjusted_label_ids = []\n",
287        "\n",
288        "    for wid in word_ids_list:\n",
289        "      if(wid is None):\n",
290        "        adjusted_label_ids.append(-100)\n",
291        "      elif(wid!=prev_wid):\n",
292        "        i = i + 1\n",
293        "        adjusted_label_ids.append(existing_label_ids[i])\n",
294        "        prev_wid = wid\n",
295        "      else:\n",
296        "        label_name = label_names[existing_label_ids[i]]\n",
297        "        adjusted_label_ids.append(existing_label_ids[i])\n",
298        "\n",
299        "    total_adjusted_labels.append(adjusted_label_ids)\n",
300        "  tokenized_samples[\"labels\"] = total_adjusted_labels\n",
301        "  return tokenized_samples\n",
302        "\n",
303        "tokenized_dataset = dataset.map(tokenize_adjust_labels, batched=True)"
304      ],
305      "metadata": {
306        "id": "RfnzuDNOh_5y"
307      },
308      "execution_count": null,
309      "outputs": []
310    },
311    {
312      "cell_type": "markdown",
313      "source": [
314        "The method above adds `input_ids`, `token_type_ids` and `attention_mask` from the tokenizer to the dataset. These will be used later by the trainer. In contrast, `ner_tags`, `langs`, `tokens`, and `spans` won't be used."
315      ],
316      "metadata": {
317        "id": "Br3CIPkbIVtJ"
318      }
319    },
320    {
321      "cell_type": "code",
322      "source": [
323        "example_no = 1\n",
324        "dataset['train'][example_no].keys(), tokenized_dataset['train'][example_no].keys()"
325      ],
326      "metadata": {
327        "id": "_8U7Z6KliXqb"
328      },
329      "execution_count": null,
330      "outputs": []
331    },
332    {
333      "cell_type": "markdown",
334      "source": [
335        "For training, we need the samples to be the same length. A simple padding can be done by the tokenizer itself. Example code:"
336      ],
337      "metadata": {
338        "id": "mKexuW4kH1EN"
339      }
340    },
341    {
342      "cell_type": "code",
343      "source": [
344        "batch_sentences = [\n",
345        "    \"But what about second breakfast?\",\n",
346        "    \"Don't think he knows about second breakfast, Pip.\",\n",
347        "    \"What about elevensies?\",\n",
348        "]\n",
349        "encoded_input = tokenizer(batch_sentences, padding=True)\n",
350        "print(encoded_input)"
351      ],
352      "metadata": {
353        "id": "SEmASRFzIA8U"
354      },
355      "execution_count": null,
356      "outputs": []
357    },
358    {
359      "cell_type": "markdown",
360      "source": [
361        "Data Collator preprocessess the input data into batches, optionally using methods such as padding and truncation. By default the `DataCollatorForTokenClassification` pads samples to `max_length`. Depending on your data and available memory, it may be useful to set `max_length` as a parameter for the `DataCollatorForTokenClassification`. See documentation for more details: https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorForTokenClassification"
362      ],
363      "metadata": {
364        "id": "E7FEvqXRDl92"
365      }
366    },
367    {
368      "cell_type": "code",
369      "source": [
370        "from transformers import DataCollatorForTokenClassification\n",
371        "\n",
372        "data_collator = DataCollatorForTokenClassification(tokenizer)"
373      ],
374      "metadata": {
375        "id": "r5CqBtRfizti"
376      },
377      "execution_count": null,
378      "outputs": []
379    },
380    {
381      "cell_type": "markdown",
382      "source": [
383        "Next, we load the BERT model. For languages different from English, the multilingual model is most suitable. For scripts that distinguish upper case and lower case, the cased model is more suitable. We will load model weights, so that further training does not require much data. Moreover, even though the base model does not contain NER classification, we can benefit from the weights for the NER classification task."
384      ],
385      "metadata": {
386        "id": "TB-BAykhOTEw"
387      }
388    },
389    {
390      "cell_type": "code",
391      "source": [
392        "from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer"
393      ],
394      "metadata": {
395        "id": "-4C3DCpYCADW"
396      },
397      "execution_count": null,
398      "outputs": []
399    },
400    {
401      "cell_type": "code",
402      "source": [
403        "model = AutoModelForTokenClassification.from_pretrained(\"bert-base-multilingual-cased\",\n",
404        "                                                        num_labels=len(label_names))"
405      ],
406      "metadata": {
407        "id": "_9Ej6579C7kd"
408      },
409      "execution_count": null,
410      "outputs": []
411    },
412    {
413      "cell_type": "markdown",
414      "source": [
415        "Next step is to set the metric. Since NER is a task on sequences, we use `sequeval` for evaluation. The `sequeval` is able to work with different IOB schemata. By default, it evaluates incorrect B-tags and I-tags as true positives (e.g. York is predicted as I-LOC but it should be B-LOC - still it is considered correct). More information is available here: https://huggingface.co/spaces/evaluate-metric/seqeval\n",
416        "\n",
417        "The method below considers the boundary tags (-100), plus it flattens the results to make them more readable. The code is copied from https://www.freecodecamp.org/news/getting-started-with-ner-models-using-huggingface/"
418      ],
419      "metadata": {
420        "id": "MS5lkLvHJAq8"
421      }
422    },
423    {
424      "cell_type": "code",
425      "source": [
426        "import numpy as np\n",
427        "from datasets import load_metric\n",
428        "metric = load_metric(\"seqeval\")\n",
429        "def compute_metrics(p):\n",
430        "    predictions, labels = p\n",
431        "    predictions = np.argmax(predictions, axis=2)\n",
432        "\n",
433        "    # Remove ignored index (special tokens)\n",
434        "    true_predictions = [\n",
435        "        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]\n",
436        "        for prediction, label in zip(predictions, labels)\n",
437        "    ]\n",
438        "    true_labels = [\n",
439        "        [label_names[l] for (p, l) in zip(prediction, label) if l != -100]\n",
440        "        for prediction, label in zip(predictions, labels)\n",
441        "    ]\n",
442        "\n",
443        "    results = metric.compute(predictions=true_predictions, references=true_labels)\n",
444        "    flattened_results = {\n",
445        "        \"overall_precision\": results[\"overall_precision\"],\n",
446        "        \"overall_recall\": results[\"overall_recall\"],\n",
447        "        \"overall_f1\": results[\"overall_f1\"],\n",
448        "        \"overall_accuracy\": results[\"overall_accuracy\"],\n",
449        "    }\n",
450        "    for k in results.keys():\n",
451        "      if(k not in flattened_results.keys()):\n",
452        "        flattened_results[k+\"_f1\"]=results[k][\"f1\"]\n",
453        "\n",
454        "    return flattened_results"
455      ],
456      "metadata": {
457        "id": "BzdRelLblvM0"
458      },
459      "execution_count": null,
460      "outputs": []
461    },
462    {
463      "cell_type": "markdown",
464      "source": [
465        "##Training\n",
466        "Run the training with the below parameters.\n",
467        "**TASK 3**: Observe the training step results. Does the model improve? We train only in one epoch. What would you expect if the number of epochs increases?"
468      ],
469      "metadata": {
470        "id": "t3gcntinPolE"
471      }
472    },
473    {
474      "cell_type": "code",
475      "source": [
476        "training_args = TrainingArguments(\n",
477        "    output_dir=\"./fine_tune_bert_output\",\n",
478        "    evaluation_strategy=\"steps\",\n",
479        "    learning_rate=2e-5,\n",
480        "    per_device_train_batch_size=2,\n",
481        "    per_device_eval_batch_size=2,\n",
482        "    num_train_epochs=1,\n",
483        "    weight_decay=0.01,\n",
484        "    logging_steps=1000,\n",
485        ")"
486      ],
487      "metadata": {
488        "id": "e-YvBlyxgUDN"
489      },
490      "execution_count": null,
491      "outputs": []
492    },
493    {
494      "cell_type": "code",
495      "source": [
496        "trainer = Trainer(\n",
497        "    model=model,\n",
498        "    args=training_args,\n",
499        "    train_dataset=tokenized_dataset[\"train\"],\n",
500        "    eval_dataset=tokenized_dataset[\"validation\"],\n",
501        "    compute_metrics=compute_metrics,\n",
502        "    data_collator=data_collator,\n",
503        "    tokenizer=tokenizer,\n",
504        ")"
505      ],
506      "metadata": {
507        "id": "_laBeyBngXSJ"
508      },
509      "execution_count": null,
510      "outputs": []
511    },
512    {
513      "cell_type": "code",
514      "source": [
515        "trainer.train()"
516      ],
517      "metadata": {
518        "id": "9QJK2WeFhMK_"
519      },
520      "execution_count": null,
521      "outputs": []
522    },
523    {
524      "cell_type": "code",
525      "source": [
526        "out_dir = './bert_ner'\n",
527        "trainer.save_model(out_dir)"
528      ],
529      "metadata": {
530        "id": "E-1VaguisuyM"
531      },
532      "execution_count": null,
533      "outputs": []
534    },
535    {
536      "cell_type": "markdown",
537      "source": [
538        "##Predictions\n",
539        "Load the fine-tuned model into the pipeline and run prediction on sentences in your language.\n",
540        "\n",
541        "**TASK 4**: Put down some observations. Where does the model perform well? How does it deal with rare or OOV words?"
542      ],
543      "metadata": {
544        "id": "P56tSzwaPTId"
545      }
546    },
547    {
548      "cell_type": "code",
549      "source": [
550        "import transformers"
551      ],
552      "metadata": {
553        "id": "aV8XGEAp0moz"
554      },
555      "execution_count": null,
556      "outputs": []
557    },
558    {
559      "cell_type": "code",
560      "source": [
561        "token_classifier = transformers.pipeline(\n",
562        "    \"token-classification\", model=\"./bert_ner\", aggregation_strategy=\"first\"\n",
563        ")"
564      ],
565      "metadata": {
566        "id": "YsFgiPdC1k4Q"
567      },
568      "execution_count": null,
569      "outputs": []
570    },
571    {
572      "cell_type": "code",
573      "source": [
574        "label_names"
575      ],
576      "metadata": {
577        "id": "Vv6Qu29rJ3ZK"
578      },
579      "execution_count": null,
580      "outputs": []
581    },
582    {
583      "cell_type": "code",
584      "source": [
585        "text = \"Ja som Katka z Blavy. Bola som na Slovensku, no teraz som v Brne.\"\n",
586        "token_classifier(text)"
587      ],
588      "metadata": {
589        "id": "ff-ormyL2ggR"
590      },
591      "execution_count": null,
592      "outputs": []
593    }
594  ]
595}