Version 3 (modified by Ales Horak, 8 years ago) (diff)

zruseni TODO

Building Language Resources from the Web

A new topic proposal:

Duplicities on the Web – deduplication and plagiarism detection

IA161 Advanced NLP Course, Course Guarantee: Aleš Horák

Prepared by: Vít Suchomel

State of the Art


Approx 3 current papers (preferably from best NLP conferences/journals, eg. ACL Anthology) that will be used as a source for the one-hour lecture:

  1. Chapters 19 and 20 from C. D. Manning et al. "Introduction to Information Retrieval". Cambridge University Press, 2008.
  2. Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Dissertation thesis. Masaryk University, 2011.
  3. HaCohen?-Kerner, Yaakov, Aharon Tayeb, and Natan Ben-Dror. "Detection of simple plagiarism in computer science papers." Coling, 2010.
  4. !TODO another plagiarism detection paper

Practical Session

Concrete description of work assignment for students for the second one-hour part of the lecture. The work will consist of tasks connected with practical implementations of algorithms connected with the current topic (probably not the state-of-the-art algorithms mentioned in the first part) and with real data. Students can test the algorithms, evaluate them and possibly try some short adaptations for various subtasks.

Students can also be required to generate some results of their work and hand them in to prove completing the tasks.


  • A set of documents and plagiates !TODO
  • A frame script in Python for plagiarism detection !TODO
  • A description of several basic methods for plagiarism detection evaluated by HaCohen?-Kerner et al. !TODO

The task:

  • !TODO instructions
  • The student will choose a method for plagiarism detection and implement it as a function in the frame script.
  • Evaluation: precision, recall, F1 (the calculaton will be a part of the frame script).

Attachments (3)