Version 6 (modified by 8 years ago) (diff) | ,
---|
Building Language Resources from the Web
A new topic proposal:
Duplicities on the Web – deduplication and plagiarism detection
IA161 Advanced NLP Course?, Course Guarantee: Aleš Horák
Prepared by: Vít Suchomel
State of the Art
References
- Chapters 19 and 20 from C. D. Manning et al. "Introduction to Information Retrieval". Cambridge University Press, 2008.
- Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Dissertation thesis. Masaryk University, 2011.
- Hacohen-Kerner, Yaakov, Aharon Tayeb, and Natan Ben-Dror. "Detection of simple plagiarism in computer science papers." Coling, 2010.
- Potthast et al. "Overview of the 6th International Competition on Plagiarism Detection." CLEF, 2014.
- 13th evaluation lab on uncovering plagiarism, authorship, and social software misuse
Slides including Practical Session task
[nlp.fi.muni.cz/~xsuchom2/anlp-05-LanguageResourcesFromWeb.pdf]
Attachments (3)
- training_data.vert (491.9 KB) - added by 4 years ago.
- plagiarism_simple.py (6.7 KB) - added by 5 months ago.
- anlp-05-LanguageResourcesFromWeb.pdf (6.2 MB) - added by 5 months ago.