Changes between Version 1 and Version 2 of private/NlpInPracticeCourse/LanguageResourcesFromWeb


Ignore:
Timestamp:
Jun 5, 2015, 2:37:29 PM (9 years ago)
Author:
xsuchom2
Comment:

draft

Legend:

Unmodified
Added
Removed
Modified
  • private/NlpInPracticeCourse/LanguageResourcesFromWeb

    v1 v2  
    11= Building Language Resources from the Web =
     2
     3A new topic proposal:
     4= Duplicities on the Web – deduplication and plagiarism detection =
    25
    36[[https://is.muni.cz/auth/predmet/fi/ia161|IA161 Advanced NLP Course]], Course Guarantee: Aleš Horák
     
    1619Approx 3 current papers (preferably from best NLP conferences/journals, eg. [[https://www.aclweb.org/anthology/|ACL Anthology]]) that will be used as a source for the one-hour lecture:
    1720
    18  1. paper1
    19  1. paper2
    20  1. paper3
     21 1. Chapters 19 and 20 from C. D. Manning et al. "Introduction to Information Retrieval". Cambridge University Press, 2008.
     22 1. Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Dissertation thesis. Masaryk University, 2011.
     23 1. HaCohen-Kerner, Yaakov, Aharon Tayeb, and Natan Ben-Dror. "Detection of simple plagiarism in computer science papers." Coling, 2010.
     24 1. !TODO another plagiarism detection paper
    2125
    2226== Practical Session ==
     
    2529
    2630Students can also be required to generate some results of their work and hand them in to prove completing the tasks.
     31
     32Resources:
     33- A set of documents and plagiates !TODO
     34- A frame script in Python for plagiarism detection !TODO
     35- A description of several basic methods for plagiarism detection evaluated by HaCohen-Kerner et al. !TODO
     36
     37The task:
     38- !TODO instructions
     39- The student will choose a method for plagiarism detection and implement it as a function in the frame script.
     40- Evaluation: precision, recall, F1 (the calculaton will be a part of the frame script).
     41
     42