1 | { |
---|
2 | "nbformat": 4, |
---|
3 | "nbformat_minor": 0, |
---|
4 | "metadata": { |
---|
5 | "colab": { |
---|
6 | "provenance": [] |
---|
7 | }, |
---|
8 | "kernelspec": { |
---|
9 | "name": "python3", |
---|
10 | "display_name": "Python 3" |
---|
11 | }, |
---|
12 | "language_info": { |
---|
13 | "name": "python" |
---|
14 | } |
---|
15 | }, |
---|
16 | "cells": [ |
---|
17 | { |
---|
18 | "cell_type": "markdown", |
---|
19 | "source": [ |
---|
20 | "# Hypernym Extraction\n", |
---|
21 | "Hypernym extraction is crucial in lexicography (a good start of a dictionary definition is the hypernym) and ontology engineering (hypernym-hyponym or subclass-superclass is the backbone relation is ontologies).\n", |
---|
22 | "\n", |
---|
23 | "## Get the data\n", |
---|
24 | "\n", |
---|
25 | "Download input data using wget.\n", |
---|
26 | "Check the nature of the data by listing first n lines (by default n=10)." |
---|
27 | ], |
---|
28 | "metadata": { |
---|
29 | "id": "LCYhjKC9TosE" |
---|
30 | } |
---|
31 | }, |
---|
32 | { |
---|
33 | "cell_type": "code", |
---|
34 | "source": [ |
---|
35 | "!wget http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/NlpInPracticeCourse/RelationExtraction/input.txt" |
---|
36 | ], |
---|
37 | "metadata": { |
---|
38 | "id": "_zIYBky4TzlW" |
---|
39 | }, |
---|
40 | "execution_count": null, |
---|
41 | "outputs": [] |
---|
42 | }, |
---|
43 | { |
---|
44 | "cell_type": "code", |
---|
45 | "source": [ |
---|
46 | "!head input.txt" |
---|
47 | ], |
---|
48 | "metadata": { |
---|
49 | "id": "thqNaB3zEcq6" |
---|
50 | }, |
---|
51 | "execution_count": null, |
---|
52 | "outputs": [] |
---|
53 | }, |
---|
54 | { |
---|
55 | "cell_type": "markdown", |
---|
56 | "source": [ |
---|
57 | "## Tools\n", |
---|
58 | "\n", |
---|
59 | "### Tokenizer and PoS tagger\n", |
---|
60 | "\n", |
---|
61 | "We use NLTK standard tokenizer to split the texts into words. And then NLTK Part-of-Speech tagger to get PoS information." |
---|
62 | ], |
---|
63 | "metadata": { |
---|
64 | "id": "7U_pqaF6UOe_" |
---|
65 | } |
---|
66 | }, |
---|
67 | { |
---|
68 | "cell_type": "code", |
---|
69 | "source": [ |
---|
70 | "import nltk\n", |
---|
71 | "from nltk.tokenize import word_tokenize\n", |
---|
72 | "\n", |
---|
73 | "nltk.download('punkt')\n", |
---|
74 | "nltk.download('averaged_perceptron_tagger')" |
---|
75 | ], |
---|
76 | "metadata": { |
---|
77 | "id": "sHUnUjjZUoCt" |
---|
78 | }, |
---|
79 | "execution_count": null, |
---|
80 | "outputs": [] |
---|
81 | }, |
---|
82 | { |
---|
83 | "cell_type": "markdown", |
---|
84 | "source": [ |
---|
85 | "## Extract hypernyms\n", |
---|
86 | "\n", |
---|
87 | "Our script reads the lines from input.txt (in the form: word|definition) and calls *find_hyper* method to detect hypernym from the definition. The detection is quite naive, it takes first word tagged as noun (NN). You should extend and modify *find_hyper* method to produce better results." |
---|
88 | ], |
---|
89 | "metadata": { |
---|
90 | "id": "Bcfjj8zjUt99" |
---|
91 | } |
---|
92 | }, |
---|
93 | { |
---|
94 | "cell_type": "code", |
---|
95 | "source": [ |
---|
96 | "def find_hyper(word, text):\n", |
---|
97 | " tokenized = word_tokenize(text)\n", |
---|
98 | " pos_tagged = nltk.pos_tag(tokenized)\n", |
---|
99 | " # find first noun\n", |
---|
100 | " nouns = [tup for tup in pos_tagged if tup[1] == 'NN']\n", |
---|
101 | " if len(nouns) > 0:\n", |
---|
102 | " return (word, nouns[0][0])\n", |
---|
103 | " return (word, u'')\n", |
---|
104 | "\n", |
---|
105 | "\n", |
---|
106 | "if __name__ == \"__main__\":\n", |
---|
107 | "\tinput_data = 'input.txt'\n", |
---|
108 | "\tfor line in open(input_data, 'r'):\n", |
---|
109 | "\t\tif line.strip() != '':\n", |
---|
110 | "\t\t\tline_split = line.strip().split('|')\n", |
---|
111 | "\t\t\thyper = find_hyper(line_split[0], line_split[1])\n", |
---|
112 | "\t\t\tprint(\"%s = %s\" % hyper)\n" |
---|
113 | ], |
---|
114 | "metadata": { |
---|
115 | "id": "teuFO7p_VE0y" |
---|
116 | }, |
---|
117 | "execution_count": null, |
---|
118 | "outputs": [] |
---|
119 | }, |
---|
120 | { |
---|
121 | "cell_type": "markdown", |
---|
122 | "source": [ |
---|
123 | "## Evaluate\n", |
---|
124 | "For evaluation, select some straighforward metric such as accuracy. Calculate and publish the performance of the naive approach, then re-calculate and publish performance for improved `find_hyper` function. For evaluation, use the `gold_en.txt`." |
---|
125 | ], |
---|
126 | "metadata": { |
---|
127 | "id": "jcKqLv7mY0X-" |
---|
128 | } |
---|
129 | } |
---|
130 | ] |
---|
131 | } |
---|