# Hypernym Extraction
Hypernym extraction is crucial in lexicography (a good start of a dictionary definition is the hypernym) and ontology engineering (hypernym-hyponym or subclass-superclass is the backbone relation is ontologies).

## Get the data

Download input data using wget.
Check the nature of the data by listing first n lines (by default n=10).

In [None]:
!wget http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/NlpInPracticeCourse/RelationExtraction/input.txt

In [None]:
!head input.txt

## Tools

### Tokenizer and PoS tagger

We use NLTK standard tokenizer to split the texts into words. And then NLTK Part-of-Speech tagger to get PoS information.

In [None]:
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

## Extract hypernyms

Our script reads the lines from input.txt (in the form: word|definition) and calls *find_hyper* method to detect hypernym from the definition. The detection is quite naive, it takes first word tagged as noun (NN). You should extend and modify *find_hyper* method to produce better results.

In [None]:
def find_hyper(word, text):
    tokenized = word_tokenize(text)
    pos_tagged = nltk.pos_tag(tokenized)
    # find first noun
    nouns = [tup for tup in pos_tagged if tup[1] == 'NN']
    if len(nouns) > 0:
        return (word, nouns[0][0])
    return (word, u'')


if __name__ == "__main__":
	input_data = 'input.txt'
	for line in open(input_data, 'r'):
		if line.strip() != '':
			line_split = line.strip().split('|')
			hyper = find_hyper(line_split[0], line_split[1])
			print("%s = %s" % hyper)


## Evaluate
For evaluation, select some straighforward metric such as accuracy. Calculate and publish the performance of the naive approach, then re-calculate and publish performance for improved `find_hyper` function. For evaluation, use the `gold_en.txt`.