Version 10 (modified by 8 years ago) (diff) | ,
---|
Extracting structured information from text
IA161 Advanced NLP Course?, Course Guarantee: Aleš Horák
Prepared by: Vojtěch Kovář
State of the Art
Information extraction (IE) is a technology based on analyzing natural language in order to extract snippets of information. The process takes texts (and sometimes speech) as input and produces fixed-format, unambiguous data as output. This data may be used directly for display to users, or may be stored in a database or spreadsheet for later analysis, or may be used for indexing purposes in information retrieval (IR) applications such as Internet search engines like Google.
References
- Cunningham, Hamish. An Introduction to Information Extraction. Encyclopedia of Language and Linguistics, 2nd Edition. Elsevier, 2005.
- Chang, Chia-Hui, et al.A Survey of Web Information Extraction Systems. Knowledge and Data Engineering, IEEE Transactions on 18.10 (2006).
- Banko, Michele, et al. Open information extraction for the web. IJCAI. Vol. 7. 2007.
- Fader, Anthony, Soderland, Stephen and Etzioni, Oren. Identifying relations for open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '11). Association for Computational Linguistics, Stroudsburg, PA, USA, 2011.
Practical Session
You are given few short excerpts from Czech wikipedia as a plain text. They were analyzed by automatic sentence detection, tokenization (unitok tool), morphological analysis and tagging (desamb tool), and syntactic analysis (SET tool, with --long-phrases option) and this is the result.
Write a short program in Python which will extract simple information about who was who, from the parsed file. The result should look like this file.
You may modify or draw inspiration from this demo script.
Attachments (9)
- wiki.phrases (25.1 KB) - added by 8 years ago.
- wiki.output (883 bytes) - added by 8 years ago.
- demo.py (1.3 KB) - added by 8 years ago.
- text1.txt (1.4 KB) - added by 6 years ago.
- text2.txt (355 bytes) - added by 6 years ago.
- text3.txt (2.1 KB) - added by 6 years ago.
- annie.png (8.9 KB) - added by 6 years ago.
- gold_en.txt (411 bytes) - added by 10 days ago.
- IA161_Hypernym_Extraction.ipynb (3.8 KB) - added by 2 days ago.
Download all attachments as: .zip