= Building Language Resources from the Web = Web crawling, boilerplate removal, de-duplication and plagiarism detection. [[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák Prepared by: Vít Suchomel == State of the Art == === References === 1. Chapters 19 and 20 from C. D. Manning et al. "Introduction to Information Retrieval". Cambridge University Press, 2008. 1. Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Dissertation thesis. Masaryk University, 2011. 1. Hacohen-Kerner, Yaakov, Aharon Tayeb, and Natan Ben-Dror. "Detection of simple plagiarism in computer science papers." Coling, 2010. 1. Potthast, M., Hagen, M., Goering, S., Rosso, P., and Stein, B. (2015). Towards data submissions for shared tasks: first experiences for the task of text alignment. Working Notes Papers of the CLEF, pages 1613–0073. 1. [http://www.uni-weimar.de/medien/webis/events/pan-15/pan15-web/ 13th evaluation lab on uncovering plagiarism, authorship, and social software misuse] {{{#!comment 1. Potthast et al. "Overview of the 6th International Competition on Plagiarism Detection." CLEF, 2014. 1. Potthast, M., Stein, B., Barron-Cedeno, A., and Rosso, P. (2010). An Evaluation Framework for Plagiarism Detection. In Proceedings of COLING 2010, Beijing, China. ACL. }}} == Slides == [https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/AdvancedNlpCourse/anlp-05-LanguageResourcesFromWeb.pdf] == Practical Session task == === Plagiators vs. plagiarism detectors === 1. Create 5 documents (with a similar topic) and 5 plagiarisms of these documents, 10 documents total. 100 words $\leq$ document lenght $\leq$ 500 words. 20 \% $\leq$ plagiarism content $\leq$ 75 \% (100 \% if done well). 1. Select detection algorithm and implement it in Python. At least 1 own plagiarism must be detected, at least 1 must be not detected by your own script. 1. Input format: POS tagged vertical consisting of 10 sctructures `doc` with attributes `author`, `id`, `class`, `source`. Pair author, id is unique. Class is "original" or "plagiarism". Source is the id of the source (in case of plagiarism) or own id (in case of original).\footnote{{\tiny For the sake of simplicity: A plagiarism cannot have more sources here.}} 1. Output format: One plagiarism per line: id `TAB` detected source id `TAB` real source id. Evaluation line: precision, recall F1 measure. 1. Your script will be evaluated using data made by others. === Text processing pipelines === * Czech: {{{alba:/opt/majka/majka-desamb-czech.sh | cut -f1-3}}} * Czech [http://nlp.fi.muni.cz/projekty/rule_ind/index.cgi web interface] * English: {{{alba:/opt/TreeTagger/tools/tt-english\_v2.sh | awk '{print \$1"\textbackslash t"\$3"\textbackslash t"\$2}'}}} === Input data example (2 + 3 documents only) === {{{ Dnes dnes k6eAd1 je být k5eAaImIp3nS pěkný pěkný k2eAgInSc4d1 pěkný den den k1gInSc4 den ! ! k? Dnes dnes k6eAd1 je být k5eAaImIp3nS moc mnoho k6 pěkný pěkný k2eAgInSc4d1 pěkný den den k1gInSc4 den ! ! k? Dnes dnes k6eAd1 je být k5eAaImIp3nS ale ale k9 pěkný pěkný k2eAgInSc4d1 pěkný den den k1gInSc4 den ! ! k? Dnes dnes k6eAd1 je být k5eAaImIp3nS ale ale k9 moc mnoho k6 pěkný pěkný k2eAgInSc4d1 pěkný den den k1gInSc4 den ! ! k? xx xx x xx xx x xx xx x xx xx x xx xx x xx xx x }}} === Output example === {{{ Doc set by Já První 3 1 1 4 2 2 5 5 1 Set precision: 0.67, recall: 1.00, F1: 0.80 }}} === Frame script implementing a simple plagiarism detection technique to extend === [attachment:plagiarism_simple.py] Usage (Czech or English input): {{{ cat *.vert | /opt/majka/majka-desamb-czech.sh | cut -f1-3 | python plagiarism_simple.py cat *.vert | /opt/TreeTagger/tools/tt-english\v2.sh | awk '{print $1"\t"$3"\t"$2}' | python plagiarism_simple.py }}}