Version 12 (modified by Zuzana Nevěřilová, 4 years ago) (diff)


Extracting structured information from text

IA161 Advanced NLP Course, Course Guarantee: Aleš Horák

Prepared by: Vojtěch Kovář

State of the Art

Information extraction (IE) is a technology based on analyzing natural language in order to extract snippets of information. The process takes texts (and sometimes speech) as input and produces fixed-format, unambiguous data as output. This data may be used directly for display to users, or may be stored in a database or spreadsheet for later analysis, or may be used for indexing purposes in information retrieval (IR) applications such as Internet search engines like Google.


  1. Cunningham, Hamish. An Introduction to Information Extraction. Encyclopedia of Language and Linguistics, 2nd Edition. Elsevier, 2005.
  2. Chang, Chia-Hui, et al.A Survey of Web Information Extraction Systems. Knowledge and Data Engineering, IEEE Transactions on 18.10 (2006).
  3. Banko, Michele, et al. Open information extraction for the web. IJCAI. Vol. 7. 2007.
  4. Fader, Anthony, Soderland, Stephen and Etzioni, Oren. Identifying relations for open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '11). Association for Computational Linguistics, Stroudsburg, PA, USA, 2011.
  5. Piskorski, J. and Yangarber, R. Information Extraction: Past, Present and Future, pages 23–49. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013.

Practical Session

We will extract information from news articles using GATE.

  1. Create <YOUR_FILE>, a text file named ia161-UCO-08.txt where UCO is your university ID.
  2. Download and install GATE (Java 8 is necessary) from
  3. Run GATE
  4. Load ANNIE (with defaults)
  5. Create language resources:
    • right click on Language Resources/New/GATE Document in the left menu
    • change markupAware to false
    • change sourceUrl to stringContent and paste some news text
    • you can find three sample texts here:
  6. Create corpus:
    • right click on Language Resources/New/GATE Corpus in the left menu
    • drag and drop the document in order to put them into the corpus
  7. Run ANNIE: Click on Applications/Annie in the left menu, select Corpus
  8. Observe the annotated results, click on a document, then Annotation Sets and/or Annotation List.

So far, GATE did not much more than Stanford NER in lecture 04. Note, however, that all tokens are annotated and POS-tagged.

We add rules for extracting job titles and the respective person names:

  1. Right click Processing Resources/New/JAPE Transducer in the left menu
  2. Click on grammmarUrl and choose grammar jobtitle.jape
  3. Click on Applications/Annie in the left menu and add the JAPE Transducer to the ANNIE pipeline
  4. Run ANNIE again: Click on Applications/Annie in the left menu, select Corpus
  5. Observe the annotated results, click on a document, then Annotation Sets and/or Annotation List. If applicable, you can see new annotation JobTitle?.
  6. Observer the grammars jobtitle.jape and jobtitleperson.jape

Add new grammar jobtitleperson.jape and observe the results.

Optionally, you can add further documents and observe how universal the jobtitleperson.jape grammar is.

Write your observations to <YOUR_FILE>.

You may modify or draw inspiration from this demo script.

Attachments (10)

Download all attachments as: .zip