private/NlpInPracticeCourse/InformationExtraction: IA161_Hypernym_Extraction.ipynb

File IA161_Hypernym_Extraction.ipynb, 3.8 KB (added by Zuzana Nevěřilová, 7 months ago)
Line 
1{
2  "nbformat": 4,
3  "nbformat_minor": 0,
4  "metadata": {
5    "colab": {
6      "provenance": []
7    },
8    "kernelspec": {
9      "name": "python3",
10      "display_name": "Python 3"
11    },
12    "language_info": {
13      "name": "python"
14    }
15  },
16  "cells": [
17    {
18      "cell_type": "markdown",
19      "source": [
20        "# Hypernym Extraction\n",
21        "Hypernym extraction is crucial in lexicography (a good start of a dictionary definition is the hypernym) and ontology engineering (hypernym-hyponym or subclass-superclass is the backbone relation is ontologies).\n",
22        "\n",
23        "## Get the data\n",
24        "\n",
25        "Download input data using wget.\n",
26        "Check the nature of the data by listing first n lines (by default n=10)."
27      ],
28      "metadata": {
29        "id": "LCYhjKC9TosE"
30      }
31    },
32    {
33      "cell_type": "code",
34      "source": [
35        "!wget http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/NlpInPracticeCourse/RelationExtraction/input.txt"
36      ],
37      "metadata": {
38        "id": "_zIYBky4TzlW"
39      },
40      "execution_count": null,
41      "outputs": []
42    },
43    {
44      "cell_type": "code",
45      "source": [
46        "!head input.txt"
47      ],
48      "metadata": {
49        "id": "thqNaB3zEcq6"
50      },
51      "execution_count": null,
52      "outputs": []
53    },
54    {
55      "cell_type": "markdown",
56      "source": [
57        "## Tools\n",
58        "\n",
59        "### Tokenizer and PoS tagger\n",
60        "\n",
61        "We use NLTK standard tokenizer to split the texts into words. And then NLTK Part-of-Speech tagger to get PoS information."
62      ],
63      "metadata": {
64        "id": "7U_pqaF6UOe_"
65      }
66    },
67    {
68      "cell_type": "code",
69      "source": [
70        "import nltk\n",
71        "from nltk.tokenize import word_tokenize\n",
72        "\n",
73        "nltk.download('punkt')\n",
74        "nltk.download('averaged_perceptron_tagger')"
75      ],
76      "metadata": {
77        "id": "sHUnUjjZUoCt"
78      },
79      "execution_count": null,
80      "outputs": []
81    },
82    {
83      "cell_type": "markdown",
84      "source": [
85        "## Extract hypernyms\n",
86        "\n",
87        "Our script reads the lines from input.txt (in the form: word|definition) and calls *find_hyper* method to detect hypernym from the definition. The detection is quite naive, it takes first word tagged as noun (NN). You should extend and modify *find_hyper* method to produce better results."
88      ],
89      "metadata": {
90        "id": "Bcfjj8zjUt99"
91      }
92    },
93    {
94      "cell_type": "code",
95      "source": [
96        "def find_hyper(word, text):\n",
97        "    tokenized = word_tokenize(text)\n",
98        "    pos_tagged = nltk.pos_tag(tokenized)\n",
99        "    # find first noun\n",
100        "    nouns = [tup for tup in pos_tagged if tup[1] == 'NN']\n",
101        "    if len(nouns) > 0:\n",
102        "        return (word, nouns[0][0])\n",
103        "    return (word, u'')\n",
104        "\n",
105        "\n",
106        "if __name__ == \"__main__\":\n",
107        "\tinput_data = 'input.txt'\n",
108        "\tfor line in open(input_data, 'r'):\n",
109        "\t\tif line.strip() != '':\n",
110        "\t\t\tline_split = line.strip().split('|')\n",
111        "\t\t\thyper = find_hyper(line_split[0], line_split[1])\n",
112        "\t\t\tprint(\"%s = %s\" % hyper)\n"
113      ],
114      "metadata": {
115        "id": "teuFO7p_VE0y"
116      },
117      "execution_count": null,
118      "outputs": []
119    },
120    {
121      "cell_type": "markdown",
122      "source": [
123        "## Evaluate\n",
124        "For evaluation, select some straighforward metric such as accuracy. Calculate and publish the performance of the naive approach, then re-calculate and publish performance for improved `find_hyper` function. For evaluation, use the `gold_en.txt`."
125      ],
126      "metadata": {
127        "id": "jcKqLv7mY0X-"
128      }
129    }
130  ]
131}