private/NlpInPracticeCourse/InformationExtraction: IA161_Hypernym_Extraction.ipynb

File IA161_Hypernym_Extraction.ipynb, 7.6 KB (added by Zuzana Nevěřilová, 9 months ago)
Line 
1{
2  "nbformat": 4,
3  "nbformat_minor": 0,
4  "metadata": {
5    "colab": {
6      "provenance": []
7    },
8    "kernelspec": {
9      "name": "python3",
10      "display_name": "Python 3"
11    },
12    "language_info": {
13      "name": "python"
14    }
15  },
16  "cells": [
17    {
18      "cell_type": "markdown",
19      "source": [
20        "# Hypernym Extraction\n",
21        "Hypernym extraction is crucial in lexicography (a good start of a dictionary definition is the hypernym) and ontology engineering (hypernym-hyponym or subclass-superclass is the backbone relation is ontologies).\n",
22        "\n",
23        "## Get the data\n",
24        "\n",
25        "Download input data using wget.\n",
26        "Check the nature of the data by listing first n lines (by default n=10)."
27      ],
28      "metadata": {
29        "id": "LCYhjKC9TosE"
30      }
31    },
32    {
33      "cell_type": "code",
34      "source": [
35        "!wget http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/NlpInPracticeCourse/RelationExtraction/input.txt"
36      ],
37      "metadata": {
38        "id": "_zIYBky4TzlW"
39      },
40      "execution_count": null,
41      "outputs": []
42    },
43    {
44      "cell_type": "code",
45      "source": [
46        "!head input.txt"
47      ],
48      "metadata": {
49        "id": "thqNaB3zEcq6"
50      },
51      "execution_count": null,
52      "outputs": []
53    },
54    {
55      "cell_type": "markdown",
56      "source": [
57        "## Tools\n",
58        "\n",
59        "### Tokenizer and PoS tagger\n",
60        "\n",
61        "We use NLTK standard tokenizer to split the texts into words. And then NLTK Part-of-Speech tagger to get PoS information."
62      ],
63      "metadata": {
64        "id": "7U_pqaF6UOe_"
65      }
66    },
67    {
68      "cell_type": "code",
69      "source": [
70        "import nltk\n",
71        "from nltk.tokenize import word_tokenize\n",
72        "\n",
73        "nltk.download('punkt')\n",
74        "nltk.download('averaged_perceptron_tagger')"
75      ],
76      "metadata": {
77        "id": "sHUnUjjZUoCt"
78      },
79      "execution_count": null,
80      "outputs": []
81    },
82    {
83      "cell_type": "markdown",
84      "source": [
85        "## Extract hypernyms\n",
86        "\n",
87        "Our script reads the lines from input.txt (in the form: word|definition) and calls *find_hyper* method to detect hypernym from the definition. The detection is quite naive, it takes first word tagged as noun (NN). You should extend and modify *find_hyper* method to produce better results."
88      ],
89      "metadata": {
90        "id": "Bcfjj8zjUt99"
91      }
92    },
93    {
94      "cell_type": "code",
95      "source": [
96        "def find_hyper(word, text):\n",
97        "    tokenized = word_tokenize(text)\n",
98        "    pos_tagged = nltk.pos_tag(tokenized)\n",
99        "    # find first noun\n",
100        "    nouns = [pos_tagged[i] for i in range(len(pos_tagged)) if pos_tagged[i][1] == 'NN' and (i>0 and pos_tagged[i-1][1] in ['JJ', 'DT', 'NNP','NN']) and (i<len(pos_tagged)-1 and pos_tagged[i+1][1]!='NN')]\n",
101        "    if len(nouns) > 0:\n",
102        "        return (word, nouns[0][0])\n",
103        "    return (word, u'')\n",
104        "\n",
105        "\n",
106        "if __name__ == \"__main__\":\n",
107        "\tinput_data = 'input.txt'\n",
108        "\tfor line in open(input_data, 'r'):\n",
109        "\t\tif line.strip() != '':\n",
110        "\t\t\tline_split = line.strip().split('|')\n",
111        "\t\t\thyper = find_hyper(line_split[0], line_split[1])\n",
112        "\t\t\tprint(\"%s = %s\" % hyper)\n"
113      ],
114      "metadata": {
115        "id": "teuFO7p_VE0y"
116      },
117      "execution_count": null,
118      "outputs": []
119    },
120    {
121      "cell_type": "code",
122      "source": [
123        "for line in open(input_data, 'r'):\n",
124        "  if line.strip() != '':\n",
125        "    line_split = line.strip().split('|')\n",
126        "    sentence = line_split[1]\n",
127        "    print(sentence)\n",
128        "    tokenized = word_tokenize(sentence)\n",
129        "    print(\"tokenized\")\n",
130        "    print(tokenized)\n",
131        "    pos_tagged = nltk.pos_tag(tokenized)\n",
132        "    print(\"POS-tagged\")\n",
133        "    print(pos_tagged)\n",
134        "    tags = [y for x,y in pos_tagged]\n",
135        "    print('-'*30)"
136      ],
137      "metadata": {
138        "id": "l06TvsEFxzBk"
139      },
140      "execution_count": null,
141      "outputs": []
142    },
143    {
144      "cell_type": "markdown",
145      "source": [
146        "## Evaluate\n",
147        "For evaluation, select some straighforward metric such as accuracy. Calculate and publish the performance of the naive approach, then re-calculate and publish performance for improved `find_hyper` function. For evaluation, use the `gold_en.txt`."
148      ],
149      "metadata": {
150        "id": "jcKqLv7mY0X-"
151      }
152    },
153    {
154      "cell_type": "markdown",
155      "source": [
156        "# Generative Models"
157      ],
158      "metadata": {
159        "id": "V9o6gHFyJ8Z_"
160      }
161    },
162    {
163      "cell_type": "code",
164      "source": [
165        "!pip install openai"
166      ],
167      "metadata": {
168        "id": "dyVOSQZVKIjg"
169      },
170      "execution_count": null,
171      "outputs": []
172    },
173    {
174      "cell_type": "code",
175      "source": [
176        "API_KEY=\"\" # your key"
177      ],
178      "metadata": {
179        "id": "kaPQR8PKJ-qh"
180      },
181      "execution_count": null,
182      "outputs": []
183    },
184    {
185      "cell_type": "code",
186      "source": [
187        "import openai\n",
188        "openai.__version__"
189      ],
190      "metadata": {
191        "id": "kzhHwHdJKGYa"
192      },
193      "execution_count": null,
194      "outputs": []
195    },
196    {
197      "cell_type": "code",
198      "source": [
199        "\n",
200        "# Specify the model\n",
201        "model = \"gpt-4o-mini\"\n",
202        "#model = \"gpt-3.5-turbo-1106\"\n",
203        "\n",
204        "# Define the prompt for the completion\n",
205        "prompt = \"\"\"\n",
206        "I give you definitions of words in the format: word|definition. Extract the nearest hypernym of the word from the definition.\n",
207        "{}\n",
208        "\"\"\"\n",
209        "\n",
210        "data = \"\"\"\n",
211        "chair|a piece of furniture for one person to sit on, with a back, legs, and sometimes two arms\n",
212        "table|a piece of furniture that consists of a flat surface held above the floor, usually by legs\n",
213        "rose|a flower that has a sweet smell and thorns (=sharp pieces) on its stem\n",
214        "herb|a plant used for adding flavour to food or as a medicine\n",
215        "dog|an animal kept as a pet, for guarding buildings, or for hunting\n",
216        "tiger|a large Asian wild animal that has yellowish fur with black lines and is a member of the cat family\n",
217        "hippopotamus|a large African animal with a wide head and mouth and thick grey skin\n",
218        "sunflower|a very tall plant that has large yellow flowers with a round brown centre\n",
219        "horse|a large animal that people ride\n",
220        "human|a person\n",
221        "\"\"\"\n",
222        "\n",
223        "import os\n",
224        "from openai import OpenAI\n",
225        "\n",
226        "client = OpenAI(\n",
227        "    # This is the default and can be omitted\n",
228        "    api_key=API_KEY\n",
229        ")\n",
230        "\n",
231        "chat_completion = client.chat.completions.create(\n",
232        "    messages=[\n",
233        "        {\n",
234        "            \"role\": \"user\",\n",
235        "            \"content\": prompt.format(data),\n",
236        "        }\n",
237        "    ],\n",
238        "    model=model, #\"gpt-3.5-turbo\",\n",
239        ")"
240      ],
241      "metadata": {
242        "id": "QURXL17EKLr-"
243      },
244      "execution_count": null,
245      "outputs": []
246    },
247    {
248      "cell_type": "code",
249      "source": [
250        "print(chat_completion.choices[0].message.content)"
251      ],
252      "metadata": {
253        "id": "YNogJwXXKtkh"
254      },
255      "execution_count": null,
256      "outputs": []
257    }
258  ]
259}