Changes between Version 40 and Version 41 of private/NlpInPracticeCourse/LanguageResourcesFromWeb


Ignore:
Timestamp:
Oct 17, 2023, 9:38:36 AM (7 months ago)
Author:
xsuchom2
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • private/NlpInPracticeCourse/LanguageResourcesFromWeb

    v40 v41  
    4343  * For each plagiarism:
    4444    * describe plagiarsim technique(s) used
    45     * which detection methods might be able to reveal it -- give reasons
    46     * which detection methods might not be able to reveal it -- give reasons
     45    * which detection methods might be able to reveal it give reasons
     46    * which detection methods might not be able to reveal it give reasons
    4747  * **Submit a text file containing 10 documents according to the requirements + 1 text file describing techniques used and your estimation which detection techniques may or may not work.**
    4848
    4949Or: Select a detection algorithm and implement it in Python. //The right homework if you want to learn something.//
    50   * A basic detection script to extend: [https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/NlpInPracticeCourse/LanguageResourcesFromWeb/plagiarism_simple.py plagiarism_simple.py] -- usage: {{{python plagiarism_simple.py < training_data.vert}}}
     50  * A basic detection script to extend: [https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/NlpInPracticeCourse/LanguageResourcesFromWeb/plagiarism_simple.py plagiarism_simple.py] usage: {{{python plagiarism_simple.py < training_data.vert}}}
    5151    * A bag of words + cosine similarity of word vectors approach is implemented in this script. //(For the sake of simplicity: A plagiarism cannot have more sources here.)//
    5252    * You can modify the script to
     
    5454      * or implement other lexical/syntactic based detection approach, e.g. n-grams of words or Levenshtein's distance
    5555      * or implement other semantic based detection approach, e.g. the similarity of {{{word2vec}}} vectors
    56       * or do it another way, be creative -- describe how it works in comments in the code.
     56      * or do it another way, be creative describe how it works in comments in the code.
    5757  * Input format: A 3-column vertical, see above. [https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/NlpInPracticeCourse/LanguageResourcesFromWeb/training_data.vert training_data.vert]
    5858  * Output: One plagiarism per line: id TAB detected source id TAB real source id. Evaluation line: precision, recall F1 measure.
    5959  * Your script will be evaluated using data made by others.
    60   * Describe which plagiarism detection technique(s) were implemented -- put it in a comment in the beginning of your script.
     60  * Describe which plagiarism detection technique(s) were implemented put it in a comment in the beginning of your script.
    6161  * **Submit the modified script (or your own script) with a short description in a comment.** (The training set output of the script is not required.)
    6262