= Extracting structured information from text = [[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák Prepared by: Zuzana Nevěřilová == State of the Art == Information extraction (IE) is a technology based on analyzing natural language in order to extract snippets of information. The process takes texts (and sometimes speech) as input and produces fixed-format, unambiguous data as output. This data may be used directly for display to users, or may be stored in a database or spreadsheet for later analysis, or may be used for indexing purposes in information retrieval (IR) applications such as Internet search engines like Google. === References === 1. Cunningham, Hamish. [https://gate.ac.uk/sale/ell2/ie/ An Introduction to Information Extraction]. Encyclopedia of Language and Linguistics, 2nd Edition. Elsevier, 2005. 1. Chang, Chia-Hui, et al.[https://www.researchgate.net/profile/Khaled_Shaalan/publication/200110627_A_Survey_of_Web_Information_Extraction_Systems/links/0912f50abd8c6b314d000000.pdf A Survey of Web Information Extraction Systems]. Knowledge and Data Engineering, IEEE Transactions on 18.10 (2006). 1. Banko, Michele, et al. [http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-429.pdf Open information extraction for the web]. IJCAI. Vol. 7. 2007. 1. Fader, Anthony, Soderland, Stephen and Etzioni, Oren. [http://dl.acm.org/citation.cfm?id=2145596 Identifying relations for open information extraction]. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '11). Association for Computational Linguistics, Stroudsburg, PA, USA, 2011. 1. Piskorski, J. and Yangarber, R. Information Extraction: Past, Present and Future, pages 23–49. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013. == Practical Session == We will extract information from news articles using GATE. 1. Create {{{}}}, a text file named {{{ia161-UCO-08.txt}}} where '''UCO''' is your university ID. 1. Download and install GATE (Java 8 is necessary) from https://gate.ac.uk/download/ 1. Run GATE 1. Load ANNIE (with defaults) 1. Create language resources: * right click on Language !Resources/New/GATE Document in the left menu * change {{{markupAware}}} to {{{false}}} * change {{{sourceUrl}}} to {{{stringContent}}} and paste some news text * you can find three sample texts here: [raw-attachment:text1.txt text1.txt], [raw-attachment:text2.txt text2.txt], [raw-attachment:text3.txt text3.txt] 1. Create corpus: * right click on Language !Resources/New/GATE Corpus in the left menu * drag and drop the document in order to put them into the corpus 1. Run ANNIE: Click on !Applications/Annie in the left menu, select Corpus 1. Observe the annotated results, click on a document, then Annotation Sets and/or Annotation List. So far, GATE did not much more than Stanford NER in lecture 04. Note, however, that all tokens are annotated and POS-tagged. We add rules for extracting job titles and the respective person names. The rules are defined in the grammars [raw-attachment:jobtitle.jape] and [raw-attachment:jobtitleperson.jape] 1. Right click Processing !Resources/New/JAPE Transducer in the left menu 1. Click on {{{grammmarUrl}}} and choose grammar {{{jobtitle.jape}}} 1. Click on !Applications/Annie in the left menu and add the JAPE Transducer to the ANNIE pipeline 1. Run ANNIE again: Click on !Applications/Annie in the left menu, select Corpus 1. Observe the annotated results, click on a document, then Annotation Sets and/or Annotation List. If applicable, you can see new annotation JobTitle. 1. Observer the grammars {{{jobtitle.jape}}} and {{{jobtitleperson.jape}}} Add new transducer with the grammar {{{jobtitleperson.jape}}} and observe the results. Optionally, you can add further documents and observe how universal the {{{jobtitleperson.jape}}} grammar is. Write your observations to {{{}}}.