Version 1 (modified by 4 years ago) (diff) | ,
---|
Extracting structured information from text
IA161 Advanced NLP Course?, Course Guarantee: Aleš Horák
Prepared by: Zuzana Nevěřilová
State of the Art
Information extraction (IE) is a technology based on analyzing natural language in order to extract snippets of information. The process takes texts (and sometimes speech) as input and produces fixed-format, unambiguous data as output. This data may be used directly for display to users, or may be stored in a database or spreadsheet for later analysis, or may be used for indexing purposes in information retrieval (IR) applications such as Internet search engines like Google.
References
- Cunningham, Hamish. An Introduction to Information Extraction. Encyclopedia of Language and Linguistics, 2nd Edition. Elsevier, 2005.
- Chang, Chia-Hui, et al.A Survey of Web Information Extraction Systems. Knowledge and Data Engineering, IEEE Transactions on 18.10 (2006).
- Banko, Michele, et al. Open information extraction for the web. IJCAI. Vol. 7. 2007.
- Fader, Anthony, Soderland, Stephen and Etzioni, Oren. Identifying relations for open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '11). Association for Computational Linguistics, Stroudsburg, PA, USA, 2011.
- Piskorski, J. and Yangarber, R. Information Extraction: Past, Present and Future, pages 23–49. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013.
Practical Session
We will extract information from news articles using GATE.
- Create
<YOUR_FILE>
, a text file namedia161-UCO-08.txt
where UCO is your university ID. - Download and install GATE (Java 8 is necessary) from https://gate.ac.uk/download/
java -jar gate-<VERSION>-installer.jar
- Run GATE
GATE_Developer_<VERSION>/bin/gate.sh
- Load ANNIE (with defaults), read about its components
- Create document(s):
- Create corpus:
- right click on
Language Resources/New/GATE Corpus
in the left menu - drag and drop the document in order to put them into the corpus
- right click on
- Run ANNIE: Click on
Applications/Annie
in the left menu, selectCorpus
- Observe the annotated results, click on a document, then
Annotation Sets
and/orAnnotation List
.
So far, GATE did not much more than Stanford NER in lecture 04. Note, however, that all tokens are annotated and POS-tagged. Also note the annotation type Lookup.
We add rules for extracting job titles and the respective person names. The rules are defined in the grammars jobtitle.jape and jobtitleperson.jape
- Right click
Processing Resources/New/JAPE Transducer
in the left menu - Download the grammar(s).
- Click on
grammmarUrl
and choose the grammar filejobtitle.jape
- Click on
Applications/Annie
in the left menu and add the JAPE Transducer to the ANNIE pipeline (Selected Processing Resources) - Run ANNIE again: Click on
Applications/Annie
in the left menu - Observe the annotated results, click on a document, then
Annotation Sets
and/orAnnotation List
. If applicable, you can see new annotationJobTitle
. - Observe the grammars
jobtitle.jape
andjobtitleperson.jape
Add new transducer with the grammar jobtitleperson.jape
and observe the results.
Optionally, you can add further documents and observe how universal the jobtitleperson.jape
grammar is.
Write your observations to <YOUR_FILE>
: Particularly, comment how well the Gazetteer and NE Transducer perform, describe how well the grammar works. Note that no coreference resolution is used (optionally, you can try one).
Attachments (10)
- wiki.txt (1.8 KB) - added by 4 years ago.
- wiki.phrases (25.1 KB) - added by 4 years ago.
- wiki.output (883 bytes) - added by 4 years ago.
- demo.py (1.3 KB) - added by 4 years ago.
- jobtitle.jape (322 bytes) - added by 4 years ago.
- jobtitleperson.jape (649 bytes) - added by 4 years ago.
- text1.txt (1.4 KB) - added by 4 years ago.
- text2.txt (355 bytes) - added by 4 years ago.
- text3.txt (2.1 KB) - added by 4 years ago.
- annie.png (8.9 KB) - added by 4 years ago.
Download all attachments as: .zip