1 | { |
---|
2 | "nbformat": 4, |
---|
3 | "nbformat_minor": 0, |
---|
4 | "metadata": { |
---|
5 | "colab": { |
---|
6 | "provenance": [] |
---|
7 | }, |
---|
8 | "kernelspec": { |
---|
9 | "name": "python3", |
---|
10 | "display_name": "Python 3" |
---|
11 | }, |
---|
12 | "language_info": { |
---|
13 | "name": "python" |
---|
14 | } |
---|
15 | }, |
---|
16 | "cells": [ |
---|
17 | { |
---|
18 | "cell_type": "markdown", |
---|
19 | "source": [ |
---|
20 | "# Hypernym Extraction\n", |
---|
21 | "Hypernym extraction is crucial in lexicography (a good start of a dictionary definition is the hypernym) and ontology engineering (hypernym-hyponym or subclass-superclass is the backbone relation is ontologies).\n", |
---|
22 | "\n", |
---|
23 | "## Get the data\n", |
---|
24 | "\n", |
---|
25 | "Download input data using wget.\n", |
---|
26 | "Check the nature of the data by listing first n lines (by default n=10)." |
---|
27 | ], |
---|
28 | "metadata": { |
---|
29 | "id": "LCYhjKC9TosE" |
---|
30 | } |
---|
31 | }, |
---|
32 | { |
---|
33 | "cell_type": "code", |
---|
34 | "source": [ |
---|
35 | "!wget http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/NlpInPracticeCourse/RelationExtraction/input.txt" |
---|
36 | ], |
---|
37 | "metadata": { |
---|
38 | "id": "_zIYBky4TzlW" |
---|
39 | }, |
---|
40 | "execution_count": null, |
---|
41 | "outputs": [] |
---|
42 | }, |
---|
43 | { |
---|
44 | "cell_type": "code", |
---|
45 | "source": [ |
---|
46 | "!head input.txt" |
---|
47 | ], |
---|
48 | "metadata": { |
---|
49 | "id": "thqNaB3zEcq6" |
---|
50 | }, |
---|
51 | "execution_count": null, |
---|
52 | "outputs": [] |
---|
53 | }, |
---|
54 | { |
---|
55 | "cell_type": "markdown", |
---|
56 | "source": [ |
---|
57 | "## Tools\n", |
---|
58 | "\n", |
---|
59 | "### Tokenizer and PoS tagger\n", |
---|
60 | "\n", |
---|
61 | "We use NLTK standard tokenizer to split the texts into words. And then NLTK Part-of-Speech tagger to get PoS information." |
---|
62 | ], |
---|
63 | "metadata": { |
---|
64 | "id": "7U_pqaF6UOe_" |
---|
65 | } |
---|
66 | }, |
---|
67 | { |
---|
68 | "cell_type": "code", |
---|
69 | "source": [ |
---|
70 | "import nltk\n", |
---|
71 | "from nltk.tokenize import word_tokenize\n", |
---|
72 | "\n", |
---|
73 | "nltk.download('punkt')\n", |
---|
74 | "nltk.download('averaged_perceptron_tagger')" |
---|
75 | ], |
---|
76 | "metadata": { |
---|
77 | "id": "sHUnUjjZUoCt" |
---|
78 | }, |
---|
79 | "execution_count": null, |
---|
80 | "outputs": [] |
---|
81 | }, |
---|
82 | { |
---|
83 | "cell_type": "markdown", |
---|
84 | "source": [ |
---|
85 | "## Extract hypernyms\n", |
---|
86 | "\n", |
---|
87 | "Our script reads the lines from input.txt (in the form: word|definition) and calls *find_hyper* method to detect hypernym from the definition. The detection is quite naive, it takes first word tagged as noun (NN). You should extend and modify *find_hyper* method to produce better results." |
---|
88 | ], |
---|
89 | "metadata": { |
---|
90 | "id": "Bcfjj8zjUt99" |
---|
91 | } |
---|
92 | }, |
---|
93 | { |
---|
94 | "cell_type": "code", |
---|
95 | "source": [ |
---|
96 | "def find_hyper(word, text):\n", |
---|
97 | " tokenized = word_tokenize(text)\n", |
---|
98 | " pos_tagged = nltk.pos_tag(tokenized)\n", |
---|
99 | " # find first noun\n", |
---|
100 | " nouns = [pos_tagged[i] for i in range(len(pos_tagged)) if pos_tagged[i][1] == 'NN' and (i>0 and pos_tagged[i-1][1] in ['JJ', 'DT', 'NNP','NN']) and (i<len(pos_tagged)-1 and pos_tagged[i+1][1]!='NN')]\n", |
---|
101 | " if len(nouns) > 0:\n", |
---|
102 | " return (word, nouns[0][0])\n", |
---|
103 | " return (word, u'')\n", |
---|
104 | "\n", |
---|
105 | "\n", |
---|
106 | "if __name__ == \"__main__\":\n", |
---|
107 | "\tinput_data = 'input.txt'\n", |
---|
108 | "\tfor line in open(input_data, 'r'):\n", |
---|
109 | "\t\tif line.strip() != '':\n", |
---|
110 | "\t\t\tline_split = line.strip().split('|')\n", |
---|
111 | "\t\t\thyper = find_hyper(line_split[0], line_split[1])\n", |
---|
112 | "\t\t\tprint(\"%s = %s\" % hyper)\n" |
---|
113 | ], |
---|
114 | "metadata": { |
---|
115 | "id": "teuFO7p_VE0y" |
---|
116 | }, |
---|
117 | "execution_count": null, |
---|
118 | "outputs": [] |
---|
119 | }, |
---|
120 | { |
---|
121 | "cell_type": "code", |
---|
122 | "source": [ |
---|
123 | "for line in open(input_data, 'r'):\n", |
---|
124 | " if line.strip() != '':\n", |
---|
125 | " line_split = line.strip().split('|')\n", |
---|
126 | " sentence = line_split[1]\n", |
---|
127 | " print(sentence)\n", |
---|
128 | " tokenized = word_tokenize(sentence)\n", |
---|
129 | " print(\"tokenized\")\n", |
---|
130 | " print(tokenized)\n", |
---|
131 | " pos_tagged = nltk.pos_tag(tokenized)\n", |
---|
132 | " print(\"POS-tagged\")\n", |
---|
133 | " print(pos_tagged)\n", |
---|
134 | " tags = [y for x,y in pos_tagged]\n", |
---|
135 | " print('-'*30)" |
---|
136 | ], |
---|
137 | "metadata": { |
---|
138 | "id": "l06TvsEFxzBk" |
---|
139 | }, |
---|
140 | "execution_count": null, |
---|
141 | "outputs": [] |
---|
142 | }, |
---|
143 | { |
---|
144 | "cell_type": "markdown", |
---|
145 | "source": [ |
---|
146 | "## Evaluate\n", |
---|
147 | "For evaluation, select some straighforward metric such as accuracy. Calculate and publish the performance of the naive approach, then re-calculate and publish performance for improved `find_hyper` function. For evaluation, use the `gold_en.txt`." |
---|
148 | ], |
---|
149 | "metadata": { |
---|
150 | "id": "jcKqLv7mY0X-" |
---|
151 | } |
---|
152 | }, |
---|
153 | { |
---|
154 | "cell_type": "markdown", |
---|
155 | "source": [ |
---|
156 | "# Generative Models" |
---|
157 | ], |
---|
158 | "metadata": { |
---|
159 | "id": "V9o6gHFyJ8Z_" |
---|
160 | } |
---|
161 | }, |
---|
162 | { |
---|
163 | "cell_type": "code", |
---|
164 | "source": [ |
---|
165 | "!pip install openai" |
---|
166 | ], |
---|
167 | "metadata": { |
---|
168 | "id": "dyVOSQZVKIjg" |
---|
169 | }, |
---|
170 | "execution_count": null, |
---|
171 | "outputs": [] |
---|
172 | }, |
---|
173 | { |
---|
174 | "cell_type": "code", |
---|
175 | "source": [ |
---|
176 | "API_KEY=\"\" # your key" |
---|
177 | ], |
---|
178 | "metadata": { |
---|
179 | "id": "kaPQR8PKJ-qh" |
---|
180 | }, |
---|
181 | "execution_count": null, |
---|
182 | "outputs": [] |
---|
183 | }, |
---|
184 | { |
---|
185 | "cell_type": "code", |
---|
186 | "source": [ |
---|
187 | "import openai\n", |
---|
188 | "openai.__version__" |
---|
189 | ], |
---|
190 | "metadata": { |
---|
191 | "id": "kzhHwHdJKGYa" |
---|
192 | }, |
---|
193 | "execution_count": null, |
---|
194 | "outputs": [] |
---|
195 | }, |
---|
196 | { |
---|
197 | "cell_type": "code", |
---|
198 | "source": [ |
---|
199 | "\n", |
---|
200 | "# Specify the model\n", |
---|
201 | "model = \"gpt-4o-mini\"\n", |
---|
202 | "#model = \"gpt-3.5-turbo-1106\"\n", |
---|
203 | "\n", |
---|
204 | "# Define the prompt for the completion\n", |
---|
205 | "prompt = \"\"\"\n", |
---|
206 | "I give you definitions of words in the format: word|definition. Extract the nearest hypernym of the word from the definition.\n", |
---|
207 | "{}\n", |
---|
208 | "\"\"\"\n", |
---|
209 | "\n", |
---|
210 | "data = \"\"\"\n", |
---|
211 | "chair|a piece of furniture for one person to sit on, with a back, legs, and sometimes two arms\n", |
---|
212 | "table|a piece of furniture that consists of a flat surface held above the floor, usually by legs\n", |
---|
213 | "rose|a flower that has a sweet smell and thorns (=sharp pieces) on its stem\n", |
---|
214 | "herb|a plant used for adding flavour to food or as a medicine\n", |
---|
215 | "dog|an animal kept as a pet, for guarding buildings, or for hunting\n", |
---|
216 | "tiger|a large Asian wild animal that has yellowish fur with black lines and is a member of the cat family\n", |
---|
217 | "hippopotamus|a large African animal with a wide head and mouth and thick grey skin\n", |
---|
218 | "sunflower|a very tall plant that has large yellow flowers with a round brown centre\n", |
---|
219 | "horse|a large animal that people ride\n", |
---|
220 | "human|a person\n", |
---|
221 | "\"\"\"\n", |
---|
222 | "\n", |
---|
223 | "import os\n", |
---|
224 | "from openai import OpenAI\n", |
---|
225 | "\n", |
---|
226 | "client = OpenAI(\n", |
---|
227 | " # This is the default and can be omitted\n", |
---|
228 | " api_key=API_KEY\n", |
---|
229 | ")\n", |
---|
230 | "\n", |
---|
231 | "chat_completion = client.chat.completions.create(\n", |
---|
232 | " messages=[\n", |
---|
233 | " {\n", |
---|
234 | " \"role\": \"user\",\n", |
---|
235 | " \"content\": prompt.format(data),\n", |
---|
236 | " }\n", |
---|
237 | " ],\n", |
---|
238 | " model=model, #\"gpt-3.5-turbo\",\n", |
---|
239 | ")" |
---|
240 | ], |
---|
241 | "metadata": { |
---|
242 | "id": "QURXL17EKLr-" |
---|
243 | }, |
---|
244 | "execution_count": null, |
---|
245 | "outputs": [] |
---|
246 | }, |
---|
247 | { |
---|
248 | "cell_type": "code", |
---|
249 | "source": [ |
---|
250 | "print(chat_completion.choices[0].message.content)" |
---|
251 | ], |
---|
252 | "metadata": { |
---|
253 | "id": "YNogJwXXKtkh" |
---|
254 | }, |
---|
255 | "execution_count": null, |
---|
256 | "outputs": [] |
---|
257 | } |
---|
258 | ] |
---|
259 | } |
---|