Changes between Initial Version and Version 1 of en/AdvancedNlpCourse2020/InformationExtraction


Ignore:
Timestamp:
Aug 31, 2021, 2:12:04 PM (3 years ago)
Author:
Ales Horak
Comment:

copied from private/AdvancedNlpCourse/InformationExtraction

Legend:

Unmodified
Added
Removed
Modified
  • en/AdvancedNlpCourse2020/InformationExtraction

    v1 v1  
     1= Extracting structured information from text =
     2
     3[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák
     4
     5Prepared by: Zuzana Nevěřilová
     6
     7
     8== State of the Art ==
     9
     10Information extraction (IE) is a technology based on
     11analyzing natural language in order to extract snippets
     12of information. The process takes texts (and sometimes
     13speech) as input and produces fixed-format, unambiguous
     14data as output. This data may be used directly for
     15display to users, or may be stored in a database or
     16spreadsheet for later analysis, or may be used for
     17indexing purposes in information retrieval (IR) applications
     18such as Internet search engines like Google.
     19
     20=== References ===
     21
     22 1. Cunningham, Hamish. [https://gate.ac.uk/sale/ell2/ie/ An Introduction to Information Extraction]. Encyclopedia of Language and Linguistics, 2nd Edition. Elsevier, 2005.
     23 1. Chang, Chia-Hui, et al.[https://www.researchgate.net/profile/Khaled_Shaalan/publication/200110627_A_Survey_of_Web_Information_Extraction_Systems/links/0912f50abd8c6b314d000000.pdf A Survey of Web Information Extraction Systems]. Knowledge and Data Engineering, IEEE Transactions on 18.10 (2006).
     24 1. Banko, Michele, et al. [http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-429.pdf Open information extraction for the web]. IJCAI. Vol. 7. 2007.
     25 1. Fader, Anthony, Soderland, Stephen and Etzioni, Oren. [http://dl.acm.org/citation.cfm?id=2145596 Identifying relations for open information extraction]. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '11). Association for Computational Linguistics, Stroudsburg, PA, USA, 2011.
     26 1. Piskorski, J. and Yangarber, R. Information Extraction: Past, Present and Future, pages 23–49. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013.
     27
     28== Practical Session ==
     29
     30We will extract information from news articles using GATE.
     31
     32 1. Create {{{<YOUR_FILE>}}}, a text file named {{{ia161-UCO-08.txt}}} where '''UCO''' is your university ID.
     33 1. Download and install GATE (Java 8 is necessary) from https://gate.ac.uk/download/. Either run the MS installer or the Java installer, install and run as app or in the command line:
     34 {{{
     35java -jar gate-<VERSION>-installer.jar
     36}}}
     37 1. Run GATE
     38 {{{
     39GATE_Developer_<VERSION>/bin/gate.sh
     40}}}
     41 1. Load ANNIE (with defaults), read about its components [[br]]
     42 [[Image(annie.png)]]
     43 1. Create document(s):
     44   * right click on `Language Resources/New/GATE Document` in the left menu
     45   * change {{{markupAware}}} to {{{false}}}
     46   * change {{{sourceUrl}}} to {{{stringContent}}} and paste some news text
     47   * repeat these steps
     48   * you can find three sample texts here: [raw-attachment:text1.txt text1.txt], [raw-attachment:text2.txt text2.txt], [raw-attachment:text3.txt text3.txt]
     49 1. Create corpus:
     50   * right click on `Language Resources/New/GATE Corpus` in the left menu
     51   * drag and drop the document in order to put them into the corpus
     52 1. Run ANNIE: Click on `Applications/Annie` in the left menu, select `Corpus`
     53 1. Observe the annotated results, click on a document, then `Annotation Sets` and/or `Annotation List`.
     54
     55So far, GATE did not much more than Stanford NER. Note, however, that all tokens are annotated and POS-tagged. Also note the annotation type Lookup.
     56
     57We add rules for extracting ''job titles'' and the respective ''person names''. The rules are defined in the grammars [raw-attachment:jobtitle.jape] and [raw-attachment:jobtitleperson.jape]
     58
     59 1. Right click `Processing Resources/New/JAPE Transducer` in the left menu
     60 1. Download the grammar(s).
     61 1. Click on {{{grammmarUrl}}} and choose the grammar file {{{jobtitle.jape}}}
     62 1. Click on `Applications/Annie` in the left menu and add the JAPE Transducer to the ANNIE pipeline (Selected Processing Resources)
     63 1. Run ANNIE again: Click on `Applications/Annie` in the left menu
     64 1. Observe the annotated results, click on a document, then `Annotation Sets` and/or `Annotation List`. If applicable, you can see new annotation `JobTitle`.
     65 1. Observe the grammars {{{jobtitle.jape}}} and {{{jobtitleperson.jape}}}
     66
     67Add new transducer with the grammar {{{jobtitleperson.jape}}} and observe the results.
     68
     69Optionally, you can add further documents and observe how universal the {{{jobtitleperson.jape}}} grammar is.
     70
     71Write your observations to {{{<YOUR_FILE>}}}: Particularly, comment how well the  Gazetteer and NE Transducer perform, describe how well the grammar works. Note that no coreference resolution is used (optionally, you can try one).