= Building Language Resources from the Web = Web crawling, boilerplate removal, de-duplication and plagiarism detection. [[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák Prepared by: Vít Suchomel == State of the Art == === References === 1. Chapters 19 and 20 from C. D. Manning et al. "Introduction to Information Retrieval". Cambridge University Press, 2008. 1. Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Dissertation thesis. Masaryk University, 2011. 1. Hacohen-Kerner, Yaakov, Aharon Tayeb, and Natan Ben-Dror. "Detection of simple plagiarism in computer science papers." Coling, 2010. 1. Potthast et al. "Overview of the 6th International Competition on Plagiarism Detection." CLEF, 2014. 1. [http://www.uni-weimar.de/medien/webis/events/pan-15/pan15-web/ 13th evaluation lab on uncovering plagiarism, authorship, and social software misuse] == Slides including Practical Session task == [http://nlp.fi.muni.cz/~xsuchom2/anlp-05-LanguageResourcesFromWeb.pdf]