Changes between Version 19 and Version 20 of private/AdvancedNlpCourse/LanguageResourcesFromWeb


Ignore:
Timestamp:
Oct 23, 2017, 5:11:46 PM (3 years ago)
Author:
xsuchom2
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • private/AdvancedNlpCourse/LanguageResourcesFromWeb

    v19 v20  
    2323== Practical Session task ==
    2424=== Plagiators vs. plagiarism detectors ===
    25 See [http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/AdvancedNlpCourse/anlp-05-LanguageResourcesFromWeb.pdf the slides].
    2625
    27 === Text processing pipelines ===
    28  * Czech: {{{alba:/opt/majka/majka-desamb-czech.sh | cut -f1-3}}}
    29  * Czech [http://nlp.fi.muni.cz/projekty/rule_ind/index.cgi web interface]
    30  * English: {{{alba:/opt/TreeTagger/tools/tt-english\_v2.sh | awk '{print \$1"\textbackslash t"\$3"\textbackslash t"\$2}'}}}
     26Either: Create 5 documents (with a similar topic) and 5 plagiarisms of these documents, 10 documents total. //(For the sake of simplicity: A plagiarism cannot have more sources here.)// //The minimal homework.//
     27  * 100 words <= document lenght <= 500 words
     28  * 20 % <= plagiarism content <= 90 %
     29  * File format: A POS tagged vertical consisting of structures {{{doc}}} with attributes {{{author}}}, {{{id}}}, {{{class}}}, {{{source}}}. Pair author, id is unique. Start with id = 1. Class is "original" or "plagiarism". Source is the id of the source (in the case of plagiarism) or the same as the id of the document (in the case of original).
     30  * POS tagged text: 3 columns: word, lemma (the base form of the word), POS/morphological tag.
     31  * Text processing pipelines for converting a text file to a 3-column vertical:
     32    * Czech: {{{alba:/opt/majka/majka-desamb-czech.sh | cut -f1-3}}} or a [http://nlp.fi.muni.cz/projekty/rule_ind/index.cgi web interface] (short documents only)
     33    * English: {{{alba:/opt/TreeTagger/tools/tt-english\_v2.sh | awk '{print \$1"\textbackslash t"\$3"\textbackslash t"\$2}'}}}
     34  * For each plagiarism:
     35    * describe plagiarsim technique(s) used
     36    * which detection methods might be able to reveal it -- give reasons
     37    * which detection methods might not be able to reveal it -- give reasons
     38  * Submit a text file containing 10 documents according to the requirements + 1 text file describing techniques used and your estimation which detection techniques may or may not work.
    3139
    32 === Input data example (2 + 3 documents only) ===
     40Or: Select a detection algorithm and implement it in Python. //The right homework if you want to learn something.//
     41  * A basic detection script to extend: [raw-attachment:plagiarism_simple.py] -- usage: {{{python plagiarism_simple.py < training_data.vert}}}
     42    * A bag of words + cosine similarity of word vectors approach is implemented in this script. //(For the sake of simplicity: A plagiarism cannot have more sources here.)//
     43    * You can modify the script to
     44      * use other input attributes than the word or a combination of attributes, e.g. the lemma or the morphological tag,
     45      * or implement other lexical/syntactic based detection approach, e.g. n-grams of words or Levenshtein's distance,
     46      * or implement other semantic based detection approach, e.g. the similarity of {{{word2vec}}} vectors.
     47  * Input format: see above. [raw-attachment:training_data.vert]
     48  * Output: One plagiarism per line: id TAB detected source id TAB real source id. Evaluation line: precision, recall F1 measure.
     49  * Your script will be evaluated using data made by others.
     50  * Describe which plagiarism detection technique(s) were implemented -- put it in a comment in the beginning of your script.
     51
     52=== Examples of a source document and a plagiarism document ===
    3353{{{
    3454<doc author="Já První" id="1" class="original" source="1">
    3555<s>
    3656Dnes    dnes    k6eAd1
    37 je  být k5eAaImIp3nS
     57je      být    k5eAaImIp3nS
    3858pěkný   pěkný   k2eAgInSc4d1    pěkný
    39 den den k1gInSc4    den
     59den     den     k1gInSc4        den
    4060<g/>
    41 !   !   k?
     61!       !       k?
    4262</s>
    4363</doc>
    44 <doc author="Já První" id="2" class="original" source="1">
     64<doc author="Já První" id="2" class="plagiarism" source="1">
    4565<s>
    4666Dnes    dnes    k6eAd1
    47 je  být k5eAaImIp3nS
    48 moc mnoho   k6
     67je      být    k5eAaImIp3nS
     68ale     ale     k9
    4969pěkný   pěkný   k2eAgInSc4d1    pěkný
    50 den den k1gInSc4    den
     70den     den     k1gInSc4        den
    5171<g/>
    52 !   !   k?
    53 </s>
    54 </doc>
    55 <doc author="Já První" id="3" class="plagiarism" source="1">
    56 <s>
    57 Dnes    dnes    k6eAd1
    58 je  být k5eAaImIp3nS
    59 ale ale k9
    60 pěkný   pěkný   k2eAgInSc4d1    pěkný
    61 den den k1gInSc4    den
    62 <g/>
    63 !   !   k?
    64 </s>
    65 </doc>
    66 <doc author="Já První" id="4" class="plagiarism" source="2">
    67 <s>
    68 Dnes    dnes    k6eAd1
    69 je  být k5eAaImIp3nS
    70 ale ale k9
    71 moc mnoho   k6
    72 pěkný   pěkný   k2eAgInSc4d1    pěkný
    73 den den k1gInSc4    den
    74 <g/>
    75 !   !   k?
    76 </s>
    77 </doc>
    78 <doc author="Já První" id="5" class="plagiarism" source="1">
    79 <s>
    80 xx  xx  x
    81 xx  xx  x
    82 xx  xx  x
    83 xx  xx  x
    84 xx  xx  x
    85 <g/>
    86 xx  xx  x
     72!       !       k?
    8773</s>
    8874</doc>
    8975}}}
    9076
    91 === Output example ===
    92 {{{
    93 Doc set by Já První
    94 3       1       1
    95 4       2       2
    96 5       5       1
    97 Set precision: 0.67, recall: 1.00, F1: 0.80
    98 }}}
    99 
    100 === Frame script implementing a simple plagiarism detection technique to extend ===
    101 [raw-attachment:plagiarism_simple.py]
    102 [raw-attachment:training_data.vert]
    103 
    104 Usage: {{{./plagiarism_simple.py < training_data.vert}}}