Changes between Version 34 and Version 35 of private/NlpInPracticeCourse/LanguageResourcesFromWeb


Ignore:
Timestamp:
Apr 10, 2022, 9:09:22 PM (8 months ago)
Author:
Ales Horak
Comment:

edited by hales in edit_page_in_vim.py

Legend:

Unmodified
Added
Removed
Modified
  • private/NlpInPracticeCourse/LanguageResourcesFromWeb

    v34 v35  
    22Web crawling, boilerplate removal, de-duplication and plagiarism detection.
    33
    4 [[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák
     4[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/NlpInPracticeCourse|NLP in Practice Course]], Course Guarantee: Aleš Horák
    55
    66
    77Prepared by: Vít Suchomel[[BR]]
    8 [http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/AdvancedNlpCourse/anlp-05-LanguageResourcesFromWeb.pdf Slides]
     8[http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/NlpInPracticeCourse/anlp-05-LanguageResourcesFromWeb.pdf Slides]
    99
    1010
     
    4444
    4545Or: Select a detection algorithm and implement it in Python. //The right homework if you want to learn something.//
    46   * A basic detection script to extend: [https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/AdvancedNlpCourse/LanguageResourcesFromWeb/plagiarism_simple.py plagiarism_simple.py] -- usage: {{{python plagiarism_simple.py < training_data.vert}}}
     46  * A basic detection script to extend: [https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/NlpInPracticeCourse/LanguageResourcesFromWeb/plagiarism_simple.py plagiarism_simple.py] -- usage: {{{python plagiarism_simple.py < training_data.vert}}}
    4747    * A bag of words + cosine similarity of word vectors approach is implemented in this script. //(For the sake of simplicity: A plagiarism cannot have more sources here.)//
    4848    * You can modify the script to
     
    5050      * or implement other lexical/syntactic based detection approach, e.g. n-grams of words or Levenshtein's distance,
    5151      * or implement other semantic based detection approach, e.g. the similarity of {{{word2vec}}} vectors.
    52   * Input format: A 3-column vertical, see above. [https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/AdvancedNlpCourse/LanguageResourcesFromWeb/training_data.vert training_data.vert]
     52  * Input format: A 3-column vertical, see above. [https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/NlpInPracticeCourse/LanguageResourcesFromWeb/training_data.vert training_data.vert]
    5353  * Output: One plagiarism per line: id TAB detected source id TAB real source id. Evaluation line: precision, recall F1 measure.
    5454  * Your script will be evaluated using data made by others.