41 | | 1. get the data: download CNEC from LINDAT/Clarin repository (https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-1B22-8) |
42 | | 1. open the NE hierarchy: |
43 | | {{{ |
44 | | evince cnec2.0/doc/ne-type-hierarchy.pdf |
45 | | }}} |
46 | | |
47 | | 1. the data is organized into 3 disjoint datasets: the training data is called `train`, the development test data is called `dtest` and the final evaluation data is called `etest`. |
48 | | 1. convert the train data to the Stanford NER format: |
49 | | {{{ |
50 | | python convert_cnec_stanford.py cnec2.0/data/xml/named_ent_train.xml \ |
51 | | > named_ent_train.tsv |
52 | | }}} |
53 | | |
54 | | Note that we removed documents that did not contain NEs. You can experiment with this option later. |
55 | | 1. download the Stanford NE recognizer http://nlp.stanford.edu/software/CRF-NER.shtml (and read about it) |
56 | | 1. train the model using the default settings (cnec.prop), N.B. that the `convert_cnec_stanford.py` only recognizes PERSON, LOCATION and ORGANIZATION, you can extend the markup conversion later: |
57 | | {{{ |
58 | | java -cp stanford-ner-2018-10-16/stanford-ner.jar \ |
59 | | edu.stanford.nlp.ie.crf.CRFClassifier \ |
60 | | -prop cnec.prop |
61 | | }}} |
62 | | 1. convert the test data to the Stanford NER format: |
63 | | {{{ |
64 | | python convert_cnec_stanford.py cnec2.0/data/xml/named_ent_dtest.xml \ |
65 | | > named_ent_dtest.tsv |
66 | | }}} |
67 | | 1. evaluate the model on `dtest`: |
68 | | {{{ |
69 | | java -cp stanford-ner-2018-10-16/stanford-ner.jar \ |
70 | | edu.stanford.nlp.ie.crf.CRFClassifier \ |
71 | | -loadClassifier cnec-3class-model.ser.gz \ |
72 | | -testFile named_ent_dtest.tsv |
73 | | }}} |
74 | | |
75 | | You should see results like: |
76 | | {{{ |
77 | | CRFClassifier tagged 19993 words in 900 documents at 2388.94 words per second. |
78 | | Entity P R F1 TP FP FN |
79 | | LOC 0.7064 0.7586 0.7316 308 128 98 |
80 | | ORG 0.6943 0.5576 0.6185 184 81 146 |
81 | | OTHER 0.6224 0.6498 0.6358 590 358 318 |
82 | | PER 0.7727 0.8236 0.7974 425 125 91 |
83 | | Totals 0.6853 0.6977 0.6914 1507 692 653 |
84 | | }}} |
85 | | In the output, the first column is the input tokens, the second column is the correct (gold) answers. Observe the differences. Copy the training result to `<YOUR_FILE>`. Try to estimate in how many cases the model missed an entity, detected incorrectly the boundaries, or classified an entity incorrectly. |
86 | | 10. evaluate the model on `dtest` with only NEs that are not present in the train data. First, you need to filter out only those documents that do not contain NERs from the training data. Use the script `get_uknown.py`, then run the NER: |
87 | | {{{ |
88 | | java -cp stanford-ner-2018-10-16/stanford-ner.jar \ |
89 | | edu.stanford.nlp.ie.crf.CRFClassifier \ |
90 | | -loadClassifier cnec-3class-model.ser.gz \ |
91 | | -testFile named_ent_dtest_unknown.tsv |
92 | | }}} |
93 | | |
94 | | Copy the result to `<YOUR_FILE>`. |
95 | | 11. test on your own input: |
96 | | {{{ |
97 | | java -mx600m -cp stanford-ner-2018-10-16/stanford-ner.jar \ |
98 | | edu.stanford.nlp.ie.crf.CRFClassifier \ |
99 | | -loadClassifier cnec-3class-model.ser.gz -textFile sample.txt |
100 | | }}} |
101 | | |
102 | | Copy the result to `<YOUR_FILE>`. |
103 | | |
104 | | 12. (optional) try to improve the train data suggestions: set `useKnownLCWords` to false, add gazetteers, remove punctuation, try to change the wordshape (something following the pattern: `dan[12](bio)?(UseLC)?, jenny1(useLC)?, chris[1234](useLC)?, cluster1)` or word shape features (see the documentation). Copy the result to `<YOUR_FILE>`. |
105 | | 13. (optional) evaluate the model on `dtest`, final evaluation on `etest`. |
106 | | |
| 41 | 1. Open Google Colab at [[https://colab.research.google.com/drive/1mnz-P30CLxrxQ0yyqpcLwVJgi7e59shi?usp=sharing]] |
| 42 | 1. Follow the instructions in the notebook. There are three obligatory tasks. Write down your answers to `<YOUR_FILE>`. |