Changes between Version 40 and Version 41 of private/NlpInPracticeCourse/LanguageResourcesFromWeb
- Timestamp:
- Oct 17, 2023, 9:38:36 AM (7 months ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
private/NlpInPracticeCourse/LanguageResourcesFromWeb
v40 v41 43 43 * For each plagiarism: 44 44 * describe plagiarsim technique(s) used 45 * which detection methods might be able to reveal it --give reasons46 * which detection methods might not be able to reveal it --give reasons45 * which detection methods might be able to reveal it – give reasons 46 * which detection methods might not be able to reveal it – give reasons 47 47 * **Submit a text file containing 10 documents according to the requirements + 1 text file describing techniques used and your estimation which detection techniques may or may not work.** 48 48 49 49 Or: Select a detection algorithm and implement it in Python. //The right homework if you want to learn something.// 50 * A basic detection script to extend: [https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/NlpInPracticeCourse/LanguageResourcesFromWeb/plagiarism_simple.py plagiarism_simple.py] --usage: {{{python plagiarism_simple.py < training_data.vert}}}50 * A basic detection script to extend: [https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/NlpInPracticeCourse/LanguageResourcesFromWeb/plagiarism_simple.py plagiarism_simple.py] – usage: {{{python plagiarism_simple.py < training_data.vert}}} 51 51 * A bag of words + cosine similarity of word vectors approach is implemented in this script. //(For the sake of simplicity: A plagiarism cannot have more sources here.)// 52 52 * You can modify the script to … … 54 54 * or implement other lexical/syntactic based detection approach, e.g. n-grams of words or Levenshtein's distance 55 55 * or implement other semantic based detection approach, e.g. the similarity of {{{word2vec}}} vectors 56 * or do it another way, be creative --describe how it works in comments in the code.56 * or do it another way, be creative – describe how it works in comments in the code. 57 57 * Input format: A 3-column vertical, see above. [https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/NlpInPracticeCourse/LanguageResourcesFromWeb/training_data.vert training_data.vert] 58 58 * Output: One plagiarism per line: id TAB detected source id TAB real source id. Evaluation line: precision, recall F1 measure. 59 59 * Your script will be evaluated using data made by others. 60 * Describe which plagiarism detection technique(s) were implemented --put it in a comment in the beginning of your script.60 * Describe which plagiarism detection technique(s) were implemented – put it in a comment in the beginning of your script. 61 61 * **Submit the modified script (or your own script) with a short description in a comment.** (The training set output of the script is not required.) 62 62