Changes between Version 34 and Version 35 of private/NlpInPracticeCourse/LanguageResourcesFromWeb
- Timestamp:
- Apr 10, 2022, 9:09:22 PM (3 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
private/NlpInPracticeCourse/LanguageResourcesFromWeb
v34 v35 2 2 Web crawling, boilerplate removal, de-duplication and plagiarism detection. 3 3 4 [[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/ AdvancedNlpCourse|Advanced NLPCourse]], Course Guarantee: Aleš Horák4 [[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/NlpInPracticeCourse|NLP in Practice Course]], Course Guarantee: Aleš Horák 5 5 6 6 7 7 Prepared by: Vít Suchomel[[BR]] 8 [http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/ AdvancedNlpCourse/anlp-05-LanguageResourcesFromWeb.pdf Slides]8 [http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/NlpInPracticeCourse/anlp-05-LanguageResourcesFromWeb.pdf Slides] 9 9 10 10 … … 44 44 45 45 Or: Select a detection algorithm and implement it in Python. //The right homework if you want to learn something.// 46 * A basic detection script to extend: [https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/ AdvancedNlpCourse/LanguageResourcesFromWeb/plagiarism_simple.py plagiarism_simple.py] -- usage: {{{python plagiarism_simple.py < training_data.vert}}}46 * A basic detection script to extend: [https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/NlpInPracticeCourse/LanguageResourcesFromWeb/plagiarism_simple.py plagiarism_simple.py] -- usage: {{{python plagiarism_simple.py < training_data.vert}}} 47 47 * A bag of words + cosine similarity of word vectors approach is implemented in this script. //(For the sake of simplicity: A plagiarism cannot have more sources here.)// 48 48 * You can modify the script to … … 50 50 * or implement other lexical/syntactic based detection approach, e.g. n-grams of words or Levenshtein's distance, 51 51 * or implement other semantic based detection approach, e.g. the similarity of {{{word2vec}}} vectors. 52 * Input format: A 3-column vertical, see above. [https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/ AdvancedNlpCourse/LanguageResourcesFromWeb/training_data.vert training_data.vert]52 * Input format: A 3-column vertical, see above. [https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/NlpInPracticeCourse/LanguageResourcesFromWeb/training_data.vert training_data.vert] 53 53 * Output: One plagiarism per line: id TAB detected source id TAB real source id. Evaluation line: precision, recall F1 measure. 54 54 * Your script will be evaluated using data made by others.