Context Navigation

LanguageResourcesFromWeb

Timestamp:: Oct 23, 2017, 5:18:23 PM (8 years ago)
Author:: xsuchom2
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

private/NlpInPracticeCourse/LanguageResourcesFromWeb

-                      v21
+                      v22
   * 20 % <= plagiarism content <= 90 %
   * File format: A POS tagged vertical consisting of structures {{{doc}}} with attributes {{{author}}}, {{{id}}}, {{{class}}}, {{{source}}}. Pair author, id is unique. Start with id = 1. Class is "original" or "plagiarism". Source is the id of the source (in the case of plagiarism) or the same as the id of the document (in the case of original).
   * POS tagged text: 3 columns: word, lemma (the base form of the word), POS/morphological tag.
+  * A POS tagged vertical: 3 TAB separated columns: word, lemma (the base form of the word), POS/morphological tag.
   * Text processing pipelines for converting a text file to a 3-column vertical:
     * Czech: {{{alba:/opt/majka/majka-desamb-czech.sh | cut -f1-3}}} or a [http://nlp.fi.muni.cz/projekty/rule_ind/index.cgi web interface] (short documents only)
 …
       * or implement other lexical/syntactic based detection approach, e.g. n-grams of words or Levenshtein's distance,
       * or implement other semantic based detection approach, e.g. the similarity of {{{word2vec}}} vectors.
   * Input format: see above. [raw-attachment:training_data.vert]
+  * Input format: A 3-column vertical, see above. [raw-attachment:training_data.vert]
   * Output: One plagiarism per line: id TAB detected source id TAB real source id. Evaluation line: precision, recall F1 measure.
   * Your script will be evaluated using data made by others.