19 | | [http://nlp.fi.muni.cz/~xsuchom2/anlp-05-LanguageResourcesFromWeb.pdf] |
| 24 | == Practical Session task == |
| 25 | === Plagiators vs. plagiarism detectors === |
| 26 | 1. Create 5 documents (with a similar topic) and 5 plagiarisms of these documents, 10 documents total. 100 words $\leq$ document lenght $\leq$ 500 words. 20 \% $\leq$ plagiarism content $\leq$ 75 \% (100 \% if done well). |
| 27 | 1. Select detection algorithm and implement it in Python. At least 1 own plagiarism must be detected, at least 1 must be not detected by your own script. |
| 28 | 1. Input format: POS tagged vertical consisting of 10 sctructures `doc` with attributes `author`, `id`, `class`, `source`. Pair author, id is unique. Class is "original" or "plagiarism". Source is the id of the source (in case of plagiarism) or own id (in case of original).\footnote{{\tiny For the sake of simplicity: A plagiarism cannot have more sources here.}} |
| 29 | 1. Output format: One plagiarism per line: id `TAB` detected source id `TAB` real source id. Evaluation line: precision, recall F1 measure. |
| 30 | 1. Your script will be evaluated using data made by others. |
| 31 | |
| 32 | === Text processing pipelines === |
| 33 | * Czech: {{{alba:/opt/majka/majka-desamb-czech.sh | cut -f1-3}}} |
| 34 | * English: {{{alba:/opt/TreeTagger/tools/tt-english\_v2.sh | awk '{print \$1"\textbackslash t"\$3"\textbackslash t"\$2}'}}} |
| 35 | |
| 36 | === Input data example (2 + 3 documents only) === |
| 37 | {{{ |
| 38 | <doc author="Já První" id="1" class="original" source="1"> |
| 39 | <s> |
| 40 | Dnes dnes k6eAd1 |
| 41 | je být k5eAaImIp3nS |
| 42 | pěkný pěkný k2eAgInSc4d1 pěkný |
| 43 | den den k1gInSc4 den |
| 44 | <g/> |
| 45 | ! ! k? |
| 46 | </s> |
| 47 | </doc> |
| 48 | <doc author="Já První" id="2" class="original" source="1"> |
| 49 | <s> |
| 50 | Dnes dnes k6eAd1 |
| 51 | je být k5eAaImIp3nS |
| 52 | moc mnoho k6 |
| 53 | pěkný pěkný k2eAgInSc4d1 pěkný |
| 54 | den den k1gInSc4 den |
| 55 | <g/> |
| 56 | ! ! k? |
| 57 | </s> |
| 58 | </doc> |
| 59 | <doc author="Já První" id="3" class="plagiarism" source="1"> |
| 60 | <s> |
| 61 | Dnes dnes k6eAd1 |
| 62 | je být k5eAaImIp3nS |
| 63 | ale ale k9 |
| 64 | pěkný pěkný k2eAgInSc4d1 pěkný |
| 65 | den den k1gInSc4 den |
| 66 | <g/> |
| 67 | ! ! k? |
| 68 | </s> |
| 69 | </doc> |
| 70 | <doc author="Já První" id="4" class="plagiarism" source="2"> |
| 71 | <s> |
| 72 | Dnes dnes k6eAd1 |
| 73 | je být k5eAaImIp3nS |
| 74 | ale ale k9 |
| 75 | moc mnoho k6 |
| 76 | pěkný pěkný k2eAgInSc4d1 pěkný |
| 77 | den den k1gInSc4 den |
| 78 | <g/> |
| 79 | ! ! k? |
| 80 | </s> |
| 81 | </doc> |
| 82 | <doc author="Já První" id="5" class="plagiarism" source="1"> |
| 83 | <s> |
| 84 | xx xx x |
| 85 | xx xx x |
| 86 | xx xx x |
| 87 | xx xx x |
| 88 | xx xx x |
| 89 | <g/> |
| 90 | xx xx x |
| 91 | </s> |
| 92 | </doc> |
| 93 | }}} |
| 94 | |
| 95 | === Output example === |
| 96 | {{{ |
| 97 | Doc set by Já První |
| 98 | 3 1 1 |
| 99 | 4 2 2 |
| 100 | 5 5 1 |
| 101 | Set precision: 0.67, recall: 1.00, F1: 0.80 |
| 102 | }}} |
| 103 | |
| 104 | === Frame script implementing a simple plagiarism detection technique to extend === |
| 105 | [raw-attachment:plagiarism_simple.py] |
| 106 | |
| 107 | Usage (Czech or English input): |
| 108 | {{{ |
| 109 | cat *.vert | /opt/majka/majka-desamb-czech.sh | cut -f1-3 | python plagiarism_simple.py |
| 110 | cat *.vert | /opt/TreeTagger/tools/tt-english\v2.sh | awk '{print $1"\t"$3"\t"$2}' | python plagiarism_simple.py |
| 111 | }}} |