| 1 | = Building Language Resources from the Web = |
| 2 | Web crawling, boilerplate removal, de-duplication and plagiarism detection. |
| 3 | |
| 4 | [[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák |
| 5 | |
| 6 | Prepared by: Vít Suchomel |
| 7 | |
| 8 | == State of the Art == |
| 9 | |
| 10 | === References === |
| 11 | 1. Chapters 19 and 20 from C. D. Manning et al. "Introduction to Information Retrieval". Cambridge University Press, 2008. |
| 12 | 1. Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Dissertation thesis. Masaryk University, 2011. |
| 13 | 1. Hacohen-Kerner, Yaakov, Aharon Tayeb, and Natan Ben-Dror. "Detection of simple plagiarism in computer science papers." Coling, 2010. |
| 14 | 1. Potthast, M., Hagen, M., Goering, S., Rosso, P., and Stein, B. (2015). Towards data submissions for shared tasks: first experiences for the task of text alignment. Working Notes Papers of the CLEF, pages 1613–0073. |
| 15 | 1. [http://www.uni-weimar.de/medien/webis/events/pan-15/pan15-web/ 13th evaluation lab on uncovering plagiarism, authorship, and social software misuse] |
| 16 | {{{#!comment |
| 17 | 1. Potthast et al. "Overview of the 6th International Competition on Plagiarism Detection." CLEF, 2014. |
| 18 | 1. Potthast, M., Stein, B., Barron-Cedeno, A., and Rosso, P. (2010). An Evaluation Framework for Plagiarism Detection. In Proceedings of COLING 2010, Beijing, China. ACL. |
| 19 | }}} |
| 20 | |
| 21 | == Slides == |
| 22 | [https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/AdvancedNlpCourse/anlp-05-LanguageResourcesFromWeb.pdf] |
| 23 | |
| 24 | == Practical Session task == |
| 25 | === Plagiators vs. plagiarism detectors === |
| 26 | 1. Create 5 documents (with a similar topic) and 5 plagiarisms of these documents, 10 documents total. 100 words $\leq$ document lenght $\leq$ 500 words. 20 \% $\leq$ plagiarism content $\leq$ 75 \% (100 \% if done well). |
| 27 | 1. Select detection algorithm and implement it in Python. At least 1 own plagiarism must be detected, at least 1 must be not detected by your own script. |
| 28 | 1. Input format: POS tagged vertical consisting of 10 sctructures `doc` with attributes `author`, `id`, `class`, `source`. Pair author, id is unique. Class is "original" or "plagiarism". Source is the id of the source (in case of plagiarism) or own id (in case of original).\footnote{{\tiny For the sake of simplicity: A plagiarism cannot have more sources here.}} |
| 29 | 1. Output format: One plagiarism per line: id `TAB` detected source id `TAB` real source id. Evaluation line: precision, recall F1 measure. |
| 30 | 1. Your script will be evaluated using data made by others. |
| 31 | |
| 32 | === Text processing pipelines === |
| 33 | * Czech: {{{alba:/opt/majka/majka-desamb-czech.sh | cut -f1-3}}} |
| 34 | * Czech [http://nlp.fi.muni.cz/projekty/rule_ind/index.cgi web interface] |
| 35 | * English: {{{alba:/opt/TreeTagger/tools/tt-english\_v2.sh | awk '{print \$1"\textbackslash t"\$3"\textbackslash t"\$2}'}}} |
| 36 | |
| 37 | === Input data example (2 + 3 documents only) === |
| 38 | {{{ |
| 39 | <doc author="Já První" id="1" class="original" source="1"> |
| 40 | <s> |
| 41 | Dnes dnes k6eAd1 |
| 42 | je být k5eAaImIp3nS |
| 43 | pěkný pěkný k2eAgInSc4d1 pěkný |
| 44 | den den k1gInSc4 den |
| 45 | <g/> |
| 46 | ! ! k? |
| 47 | </s> |
| 48 | </doc> |
| 49 | <doc author="Já První" id="2" class="original" source="1"> |
| 50 | <s> |
| 51 | Dnes dnes k6eAd1 |
| 52 | je být k5eAaImIp3nS |
| 53 | moc mnoho k6 |
| 54 | pěkný pěkný k2eAgInSc4d1 pěkný |
| 55 | den den k1gInSc4 den |
| 56 | <g/> |
| 57 | ! ! k? |
| 58 | </s> |
| 59 | </doc> |
| 60 | <doc author="Já První" id="3" class="plagiarism" source="1"> |
| 61 | <s> |
| 62 | Dnes dnes k6eAd1 |
| 63 | je být k5eAaImIp3nS |
| 64 | ale ale k9 |
| 65 | pěkný pěkný k2eAgInSc4d1 pěkný |
| 66 | den den k1gInSc4 den |
| 67 | <g/> |
| 68 | ! ! k? |
| 69 | </s> |
| 70 | </doc> |
| 71 | <doc author="Já První" id="4" class="plagiarism" source="2"> |
| 72 | <s> |
| 73 | Dnes dnes k6eAd1 |
| 74 | je být k5eAaImIp3nS |
| 75 | ale ale k9 |
| 76 | moc mnoho k6 |
| 77 | pěkný pěkný k2eAgInSc4d1 pěkný |
| 78 | den den k1gInSc4 den |
| 79 | <g/> |
| 80 | ! ! k? |
| 81 | </s> |
| 82 | </doc> |
| 83 | <doc author="Já První" id="5" class="plagiarism" source="1"> |
| 84 | <s> |
| 85 | xx xx x |
| 86 | xx xx x |
| 87 | xx xx x |
| 88 | xx xx x |
| 89 | xx xx x |
| 90 | <g/> |
| 91 | xx xx x |
| 92 | </s> |
| 93 | </doc> |
| 94 | }}} |
| 95 | |
| 96 | === Output example === |
| 97 | {{{ |
| 98 | Doc set by Já První |
| 99 | 3 1 1 |
| 100 | 4 2 2 |
| 101 | 5 5 1 |
| 102 | Set precision: 0.67, recall: 1.00, F1: 0.80 |
| 103 | }}} |
| 104 | |
| 105 | === Frame script implementing a simple plagiarism detection technique to extend === |
| 106 | [attachment:plagiarism_simple.py] |
| 107 | |
| 108 | Usage (Czech or English input): |
| 109 | {{{ |
| 110 | cat *.vert | /opt/majka/majka-desamb-czech.sh | cut -f1-3 | python plagiarism_simple.py |
| 111 | cat *.vert | /opt/TreeTagger/tools/tt-english\v2.sh | awk '{print $1"\t"$3"\t"$2}' | python plagiarism_simple.py |
| 112 | }}} |