Building Language Resources from the Web

Web crawling, boilerplate removal, de-duplication and plagiarism detection.

IA161 Advanced NLP Course, Course Guarantee: Aleš Horák

Prepared by: Vít Suchomel

State of the Art


  1. Chapters 19 and 20 from C. D. Manning et al. "Introduction to Information Retrieval". Cambridge University Press, 2008.
  2. Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Dissertation thesis. Masaryk University, 2011.
  3. Hacohen-Kerner, Yaakov, Aharon Tayeb, and Natan Ben-Dror. "Detection of simple plagiarism in computer science papers." Coling, 2010.
  4. Potthast et al. "Overview of the 6th International Competition on Plagiarism Detection." CLEF, 2014.
  5. 13th evaluation lab on uncovering plagiarism, authorship, and social software misuse

