Version 7 (modified by xsuchom2, 5 years ago) (diff)


Building Language Resources from the Web

A new topic proposal:

Duplicities on the Web – deduplication and plagiarism detection

IA161 Advanced NLP Course, Course Guarantee: Aleš Horák

Prepared by: Vít Suchomel

State of the Art


  1. Chapters 19 and 20 from C. D. Manning et al. "Introduction to Information Retrieval". Cambridge University Press, 2008.
  2. Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Dissertation thesis. Masaryk University, 2011.
  3. Hacohen-Kerner, Yaakov, Aharon Tayeb, and Natan Ben-Dror. "Detection of simple plagiarism in computer science papers." Coling, 2010.
  4. Potthast et al. "Overview of the 6th International Competition on Plagiarism Detection." CLEF, 2014.
  5. 13th evaluation lab on uncovering plagiarism, authorship, and social software misuse

Slides including Practical Session task

Attachments (3)