Context Navigation

Back to private/NlpInPracticeCourse/InformationExtraction

InformationExtraction: IA161_Hypernym_Extraction.ipynb

File IA161_Hypernym_Extraction.ipynb, 7.6 KB (added by Zuzana Nevěřilová, 9 months ago)

Line
1	{
2	"nbformat": 4,
3	"nbformat_minor": 0,
4	"metadata": {
5	"colab": {
6	"provenance": []
7	},
8	"kernelspec": {
9	"name": "python3",
10	"display_name": "Python 3"
11	},
12	"language_info": {
13	"name": "python"
14	}
15	},
16	"cells": [
17	{
18	"cell_type": "markdown",
19	"source": [
20	"# Hypernym Extraction\n",
21	"Hypernym extraction is crucial in lexicography (a good start of a dictionary definition is the hypernym) and ontology engineering (hypernym-hyponym or subclass-superclass is the backbone relation is ontologies).\n",
22	"\n",
23	"## Get the data\n",
24	"\n",
25	"Download input data using wget.\n",
26	"Check the nature of the data by listing first n lines (by default n=10)."
27	],
28	"metadata": {
29	"id": "LCYhjKC9TosE"
30	}
31	},
32	{
33	"cell_type": "code",
34	"source": [
35	"!wget http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/NlpInPracticeCourse/RelationExtraction/input.txt"
36	],
37	"metadata": {
38	"id": "_zIYBky4TzlW"
39	},
40	"execution_count": null,
41	"outputs": []
42	},
43	{
44	"cell_type": "code",
45	"source": [
46	"!head input.txt"
47	],
48	"metadata": {
49	"id": "thqNaB3zEcq6"
50	},
51	"execution_count": null,
52	"outputs": []
53	},
54	{
55	"cell_type": "markdown",
56	"source": [
57	"## Tools\n",
58	"\n",
59	"### Tokenizer and PoS tagger\n",
60	"\n",
61	"We use NLTK standard tokenizer to split the texts into words. And then NLTK Part-of-Speech tagger to get PoS information."
62	],
63	"metadata": {
64	"id": "7U_pqaF6UOe_"
65	}
66	},
67	{
68	"cell_type": "code",
69	"source": [
70	"import nltk\n",
71	"from nltk.tokenize import word_tokenize\n",
72	"\n",
73	"nltk.download('punkt')\n",
74	"nltk.download('averaged_perceptron_tagger')"
75	],
76	"metadata": {
77	"id": "sHUnUjjZUoCt"
78	},
79	"execution_count": null,
80	"outputs": []
81	},
82	{
83	"cell_type": "markdown",
84	"source": [
85	"## Extract hypernyms\n",
86	"\n",
87	"Our script reads the lines from input.txt (in the form: word\|definition) and calls find_hyper method to detect hypernym from the definition. The detection is quite naive, it takes first word tagged as noun (NN). You should extend and modify find_hyper method to produce better results."
88	],
89	"metadata": {
90	"id": "Bcfjj8zjUt99"
91	}
92	},
93	{
94	"cell_type": "code",
95	"source": [
96	"def find_hyper(word, text):\n",
97	" tokenized = word_tokenize(text)\n",
98	" pos_tagged = nltk.pos_tag(tokenized)\n",
99	" # find first noun\n",
100	" nouns = [pos_tagged[i] for i in range(len(pos_tagged)) if pos_tagged[i][1] == 'NN' and (i>0 and pos_tagged[i-1][1] in ['JJ', 'DT', 'NNP','NN']) and (i<len(pos_tagged)-1 and pos_tagged[i+1][1]!='NN')]\n",
101	" if len(nouns) > 0:\n",
102	" return (word, nouns[0][0])\n",
103	" return (word, u'')\n",
104	"\n",
105	"\n",
106	"if __name__ == \"__main__\":\n",
107	"\tinput_data = 'input.txt'\n",
108	"\tfor line in open(input_data, 'r'):\n",
109	"\t\tif line.strip() != '':\n",
110	"\t\t\tline_split = line.strip().split('\|')\n",
111	"\t\t\thyper = find_hyper(line_split[0], line_split[1])\n",
112	"\t\t\tprint(\"%s = %s\" % hyper)\n"
113	],
114	"metadata": {
115	"id": "teuFO7p_VE0y"
116	},
117	"execution_count": null,
118	"outputs": []
119	},
120	{
121	"cell_type": "code",
122	"source": [
123	"for line in open(input_data, 'r'):\n",
124	" if line.strip() != '':\n",
125	" line_split = line.strip().split('\|')\n",
126	" sentence = line_split[1]\n",
127	" print(sentence)\n",
128	" tokenized = word_tokenize(sentence)\n",
129	" print(\"tokenized\")\n",
130	" print(tokenized)\n",
131	" pos_tagged = nltk.pos_tag(tokenized)\n",
132	" print(\"POS-tagged\")\n",
133	" print(pos_tagged)\n",
134	" tags = [y for x,y in pos_tagged]\n",
135	" print('-'*30)"
136	],
137	"metadata": {
138	"id": "l06TvsEFxzBk"
139	},
140	"execution_count": null,
141	"outputs": []
142	},
143	{
144	"cell_type": "markdown",
145	"source": [
146	"## Evaluate\n",
147	"For evaluation, select some straighforward metric such as accuracy. Calculate and publish the performance of the naive approach, then re-calculate and publish performance for improved `find_hyper` function. For evaluation, use the `gold_en.txt`."
148	],
149	"metadata": {
150	"id": "jcKqLv7mY0X-"
151	}
152	},
153	{
154	"cell_type": "markdown",
155	"source": [
156	"# Generative Models"
157	],
158	"metadata": {
159	"id": "V9o6gHFyJ8Z_"
160	}
161	},
162	{
163	"cell_type": "code",
164	"source": [
165	"!pip install openai"
166	],
167	"metadata": {
168	"id": "dyVOSQZVKIjg"
169	},
170	"execution_count": null,
171	"outputs": []
172	},
173	{
174	"cell_type": "code",
175	"source": [
176	"API_KEY=\"\" # your key"
177	],
178	"metadata": {
179	"id": "kaPQR8PKJ-qh"
180	},
181	"execution_count": null,
182	"outputs": []
183	},
184	{
185	"cell_type": "code",
186	"source": [
187	"import openai\n",
188	"openai.__version__"
189	],
190	"metadata": {
191	"id": "kzhHwHdJKGYa"
192	},
193	"execution_count": null,
194	"outputs": []
195	},
196	{
197	"cell_type": "code",
198	"source": [
199	"\n",
200	"# Specify the model\n",
201	"model = \"gpt-4o-mini\"\n",
202	"#model = \"gpt-3.5-turbo-1106\"\n",
203	"\n",
204	"# Define the prompt for the completion\n",
205	"prompt = \"\"\"\n",
206	"I give you definitions of words in the format: word\|definition. Extract the nearest hypernym of the word from the definition.\n",
207	"{}\n",
208	"\"\"\"\n",
209	"\n",
210	"data = \"\"\"\n",
211	"chair\|a piece of furniture for one person to sit on, with a back, legs, and sometimes two arms\n",
212	"table\|a piece of furniture that consists of a flat surface held above the floor, usually by legs\n",
213	"rose\|a flower that has a sweet smell and thorns (=sharp pieces) on its stem\n",
214	"herb\|a plant used for adding flavour to food or as a medicine\n",
215	"dog\|an animal kept as a pet, for guarding buildings, or for hunting\n",
216	"tiger\|a large Asian wild animal that has yellowish fur with black lines and is a member of the cat family\n",
217	"hippopotamus\|a large African animal with a wide head and mouth and thick grey skin\n",
218	"sunflower\|a very tall plant that has large yellow flowers with a round brown centre\n",
219	"horse\|a large animal that people ride\n",
220	"human\|a person\n",
221	"\"\"\"\n",
222	"\n",
223	"import os\n",
224	"from openai import OpenAI\n",
225	"\n",
226	"client = OpenAI(\n",
227	" # This is the default and can be omitted\n",
228	" api_key=API_KEY\n",
229	")\n",
230	"\n",
231	"chat_completion = client.chat.completions.create(\n",
232	" messages=[\n",
233	" {\n",
234	" \"role\": \"user\",\n",
235	" \"content\": prompt.format(data),\n",
236	" }\n",
237	" ],\n",
238	" model=model, #\"gpt-3.5-turbo\",\n",
239	")"
240	],
241	"metadata": {
242	"id": "QURXL17EKLr-"
243	},
244	"execution_count": null,
245	"outputs": []
246	},
247	{
248	"cell_type": "code",
249	"source": [
250	"print(chat_completion.choices[0].message.content)"
251	],
252	"metadata": {
253	"id": "YNogJwXXKtkh"
254	},
255	"execution_count": null,
256	"outputs": []
257	}
258	]
259	}

Download in other formats:

Original Format