Extracting structured information from text
IA161 NLP in Practice Course, Course Guarantee: Aleš Horák
Prepared by: Zuzana Nevěřilová
State of the Art
Information extraction (IE) is a technology based on analyzing natural language in order to extract snippets of information. The process takes texts (and sometimes speech) as input and produces fixed-format, unambiguous data as output. This data may be used directly for display to users, or may be stored in a database or spreadsheet for later analysis, or may be used for indexing purposes in information retrieval (IR) applications such as Internet search engines like Google.
References
- Aydar, Mehmet, Ozge Bozal, and Furkan Ozbay. Neural relation extraction: a survey. arXiv e-prints (2020).
- Li, Qing, et al. "A comprehensive exploration of semantic relation extraction via pre-trained CNNs." Knowledge-Based Systems (2020): 105488.
- Xu, D., Chen, W., Peng, W., Zhang, C., Xu, T., Zhao, X., Wu, X., Zheng, Y., Wang, Y., and Chen, E. Large language models for generative information extraction: A survey. arXiv (2024).
Practical Session
The task will proceed using Python notebook run in web browser in the Google Colaboratory environment with the MU G-Suite disk access.
In case of running the codes in a local environment, the requirements are Python 3, and NLTK module.
The tagset of the NLTK POS tagger is based on Penn Treebank, you can check the meaning of the POS tags. Find more about the NLTK tagger in the NLTK Book, chapter 5
- Create
<YOUR_FILE>
, a text file namedia161-UCO.txt
where UCO is your university ID. - Access the Python notebook in the Google Colab environment and make your own copy. Do not forget to save your work if you want to see your changes later, leaving the browser will throw away all changes!
- The colab reads file
input.txt
(each line is word|definition) and outputs hypernym for each word. - Default approach is naive: first noun in definition is hypernym
- Using the gold standard, evaluate the naive approach.
- Improve the
find_hyper()
function to provide better results. Evaluate the new version. - Copy the updated function
find_hyper()
and the output into<YOUR_FILE>
. Please don't submit the whole notebook. - Optionally, compare the output with a generative model output.
Gold standard to evaluate your result: gold_en.txt