Changes between Initial Version and Version 1 of en/AdvancedNlpCourse2018/InformationExtraction

Sep 12, 2019, 11:11:55 AM (18 months ago)
Ales Horak

copied from private/AdvancedNlpCourse/InformationExtraction


  • en/AdvancedNlpCourse2018/InformationExtraction

    v1 v1  
     1= Extracting structured information from text =
     3[[|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák
     5Prepared by: Zuzana Nevěřilová
     8== State of the Art ==
     10Information extraction (IE) is a technology based on
     11analyzing natural language in order to extract snippets
     12of information. The process takes texts (and sometimes
     13speech) as input and produces fixed-format, unambiguous
     14data as output. This data may be used directly for
     15display to users, or may be stored in a database or
     16spreadsheet for later analysis, or may be used for
     17indexing purposes in information retrieval (IR) applications
     18such as Internet search engines like Google.
     20=== References ===
     22 1. Cunningham, Hamish. [ An Introduction to Information Extraction]. Encyclopedia of Language and Linguistics, 2nd Edition. Elsevier, 2005.
     23 1. Chang, Chia-Hui, et al.[ A Survey of Web Information Extraction Systems]. Knowledge and Data Engineering, IEEE Transactions on 18.10 (2006).
     24 1. Banko, Michele, et al. [ Open information extraction for the web]. IJCAI. Vol. 7. 2007.
     25 1. Fader, Anthony, Soderland, Stephen and Etzioni, Oren. [ Identifying relations for open information extraction]. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '11). Association for Computational Linguistics, Stroudsburg, PA, USA, 2011.
     26 1. Piskorski, J. and Yangarber, R. Information Extraction: Past, Present and Future, pages 23–49. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013.
     28== Practical Session ==
     30We will extract information from news articles using GATE.
     32 1. Create {{{<YOUR_FILE>}}}, a text file named {{{ia161-UCO-08.txt}}} where '''UCO''' is your university ID.
     33 1. Download and install GATE (Java 8 is necessary) from
     34 {{{
     35java -jar gate-<VERSION>-installer.jar
     37 1. Run GATE
     38 {{{
     41 1. Load ANNIE (with defaults), read about its components [[br]]
     42 [[Image(annie.png)]]
     43 1. Create document(s):
     44   * right click on `Language Resources/New/GATE Document` in the left menu
     45   * change {{{markupAware}}} to {{{false}}}
     46   * change {{{sourceUrl}}} to {{{stringContent}}} and paste some news text
     47   * repeat these steps
     48   * you can find three sample texts here: [raw-attachment:text1.txt text1.txt], [raw-attachment:text2.txt text2.txt], [raw-attachment:text3.txt text3.txt]
     49 1. Create corpus:
     50   * right click on `Language Resources/New/GATE Corpus` in the left menu
     51   * drag and drop the document in order to put them into the corpus
     52 1. Run ANNIE: Click on `Applications/Annie` in the left menu, select `Corpus`
     53 1. Observe the annotated results, click on a document, then `Annotation Sets` and/or `Annotation List`.
     55So far, GATE did not much more than Stanford NER in lecture 04. Note, however, that all tokens are annotated and POS-tagged. Also note the annotation type Lookup.
     57We add rules for extracting ''job titles'' and the respective ''person names''. The rules are defined in the grammars [raw-attachment:jobtitle.jape] and [raw-attachment:jobtitleperson.jape]
     59 1. Right click `Processing Resources/New/JAPE Transducer` in the left menu
     60 1. Download the grammar(s).
     61 1. Click on {{{grammmarUrl}}} and choose the grammar file {{{jobtitle.jape}}}
     62 1. Click on `Applications/Annie` in the left menu and add the JAPE Transducer to the ANNIE pipeline (Selected Processing Resources)
     63 1. Run ANNIE again: Click on `Applications/Annie` in the left menu
     64 1. Observe the annotated results, click on a document, then `Annotation Sets` and/or `Annotation List`. If applicable, you can see new annotation `JobTitle`.
     65 1. Observe the grammars {{{jobtitle.jape}}} and {{{jobtitleperson.jape}}}
     67Add new transducer with the grammar {{{jobtitleperson.jape}}} and observe the results.
     69Optionally, you can add further documents and observe how universal the {{{jobtitleperson.jape}}} grammar is.
     71Write your observations to {{{<YOUR_FILE>}}}: Particularly, comment how well the  Gazetteer and NE Transducer perform, describe how well the grammar works. Note that no coreference resolution is used (optionally, you can try one).