| 1 | = Extracting structured information from text = |
| 2 | |
| 3 | [[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák |
| 4 | |
| 5 | Prepared by: Zuzana Nevěřilová |
| 6 | |
| 7 | |
| 8 | == State of the Art == |
| 9 | |
| 10 | Information extraction (IE) is a technology based on |
| 11 | analyzing natural language in order to extract snippets |
| 12 | of information. The process takes texts (and sometimes |
| 13 | speech) as input and produces fixed-format, unambiguous |
| 14 | data as output. This data may be used directly for |
| 15 | display to users, or may be stored in a database or |
| 16 | spreadsheet for later analysis, or may be used for |
| 17 | indexing purposes in information retrieval (IR) applications |
| 18 | such as Internet search engines like Google. |
| 19 | |
| 20 | === References === |
| 21 | |
| 22 | 1. Cunningham, Hamish. [https://gate.ac.uk/sale/ell2/ie/ An Introduction to Information Extraction]. Encyclopedia of Language and Linguistics, 2nd Edition. Elsevier, 2005. |
| 23 | 1. Chang, Chia-Hui, et al.[https://www.researchgate.net/profile/Khaled_Shaalan/publication/200110627_A_Survey_of_Web_Information_Extraction_Systems/links/0912f50abd8c6b314d000000.pdf A Survey of Web Information Extraction Systems]. Knowledge and Data Engineering, IEEE Transactions on 18.10 (2006). |
| 24 | 1. Banko, Michele, et al. [http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-429.pdf Open information extraction for the web]. IJCAI. Vol. 7. 2007. |
| 25 | 1. Fader, Anthony, Soderland, Stephen and Etzioni, Oren. [http://dl.acm.org/citation.cfm?id=2145596 Identifying relations for open information extraction]. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '11). Association for Computational Linguistics, Stroudsburg, PA, USA, 2011. |
| 26 | 1. Piskorski, J. and Yangarber, R. Information Extraction: Past, Present and Future, pages 23–49. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013. |
| 27 | |
| 28 | == Practical Session == |
| 29 | |
| 30 | We will extract information from news articles using GATE. |
| 31 | |
| 32 | 1. Create {{{<YOUR_FILE>}}}, a text file named {{{ia161-UCO-08.txt}}} where '''UCO''' is your university ID. |
| 33 | 1. Download and install GATE (Java 8 is necessary) from https://gate.ac.uk/download/. Either run the MS installer or the Java installer, install and run as app or in the command line: |
| 34 | {{{ |
| 35 | java -jar gate-<VERSION>-installer.jar |
| 36 | }}} |
| 37 | 1. Run GATE |
| 38 | {{{ |
| 39 | GATE_Developer_<VERSION>/bin/gate.sh |
| 40 | }}} |
| 41 | 1. Load ANNIE (with defaults), read about its components [[br]] |
| 42 | [[Image(annie.png)]] |
| 43 | 1. Create document(s): |
| 44 | * right click on `Language Resources/New/GATE Document` in the left menu |
| 45 | * change {{{markupAware}}} to {{{false}}} |
| 46 | * change {{{sourceUrl}}} to {{{stringContent}}} and paste some news text |
| 47 | * repeat these steps |
| 48 | * you can find three sample texts here: [raw-attachment:text1.txt text1.txt], [raw-attachment:text2.txt text2.txt], [raw-attachment:text3.txt text3.txt] |
| 49 | 1. Create corpus: |
| 50 | * right click on `Language Resources/New/GATE Corpus` in the left menu |
| 51 | * drag and drop the document in order to put them into the corpus |
| 52 | 1. Run ANNIE: Click on `Applications/Annie` in the left menu, select `Corpus` |
| 53 | 1. Observe the annotated results, click on a document, then `Annotation Sets` and/or `Annotation List`. |
| 54 | |
| 55 | So far, GATE did not much more than Stanford NER. Note, however, that all tokens are annotated and POS-tagged. Also note the annotation type Lookup. |
| 56 | |
| 57 | We add rules for extracting ''job titles'' and the respective ''person names''. The rules are defined in the grammars [raw-attachment:jobtitle.jape] and [raw-attachment:jobtitleperson.jape] |
| 58 | |
| 59 | 1. Right click `Processing Resources/New/JAPE Transducer` in the left menu |
| 60 | 1. Download the grammar(s). |
| 61 | 1. Click on {{{grammmarUrl}}} and choose the grammar file {{{jobtitle.jape}}} |
| 62 | 1. Click on `Applications/Annie` in the left menu and add the JAPE Transducer to the ANNIE pipeline (Selected Processing Resources) |
| 63 | 1. Run ANNIE again: Click on `Applications/Annie` in the left menu |
| 64 | 1. Observe the annotated results, click on a document, then `Annotation Sets` and/or `Annotation List`. If applicable, you can see new annotation `JobTitle`. |
| 65 | 1. Observe the grammars {{{jobtitle.jape}}} and {{{jobtitleperson.jape}}} |
| 66 | |
| 67 | Add new transducer with the grammar {{{jobtitleperson.jape}}} and observe the results. |
| 68 | |
| 69 | Optionally, you can add further documents and observe how universal the {{{jobtitleperson.jape}}} grammar is. |
| 70 | |
| 71 | Write your observations to {{{<YOUR_FILE>}}}: Particularly, comment how well the Gazetteer and NE Transducer perform, describe how well the grammar works. Note that no coreference resolution is used (optionally, you can try one). |