Context Navigation

← Previous Change
Wiki History
Next Change →

Stylometry

Timestamp:: Aug 31, 2021, 2:11:05 PM (4 years ago)
Author:: Ales Horak
Comment:: copied from private/AdvancedNlpCourse/Stylometry

Legend:

: Unmodified
: Added
: Removed
: Modified

en/AdvancedNlpCourse2020/Stylometry

                       v1
+= Stylometry =
+[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák
+Prepared by: Honza Rygl, Aleš Horák
+== State of the Art ==
+The analysis of author's characteristic
+writing style and vocabulary has been used to uncover author's traits such as authorship, age, or gender
+documents by both manual linguistic approaches and automatic algorithmic methods.
+The most common approach to stylometry problems
+is to combine stylistic analysis with machine learning techniques:
+. specific style markers are extracted,
+. a classification procedure is applied to extracted markers
+=== References ===
+. Bevendorff, Janek, et al.(2020), Overview of PAN 2020: Authorship Verification, Celebrity Profiling, Profiling Fake News Spreaders on Twitter, and Style Change Detection. International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, Cham. [https://pan.webis.de/downloads/publications/papers/bevendorff_2020.pdf pdf]
+. Stamatatos, E. (2009), A Survey of Modern Authorship Attribution Methods (2009), Journal of the American Society for Information Science and Technology, 60(3), 538-556. [[http://www.clips.ua.ac.be/~walter/educational/material/Stamatatos_survey2009.pdf | pdf]]
+. Kestemont, M. (2014), Function Words in Authorship Attribution From Black Magic to Theory? Proceedings of the 3rd Workshop on Computational Linguistics for Literature, EACL 2014, 59–66 [[http://aclweb.org/anthology/W14-0908 | pdf]]
+. Daelemans, W. (2013). Explanation in computational stylometry. In International conference on intelligent text processing and computational linguistics (pp. 451-462). Springer, Berlin, Heidelberg. [https://www.clips.uantwerpen.be/sites/default/files/daelemans2013.pdf pdf]
+== Practical Session ==
+Students will work with the ''Style & Identity Recognition'' (SIR) tool. They will test this tool on prepared data.
+The goal will be to implement a small function to extract style markers from a text.
+. go to `asteria04.fi.muni.cz` server:
+{{{
+ssh asteria04.fi.muni.cz
+}}}
+. Download a  [[htdocs:bigdata/stylometry-assignment.zip|ZIP with python packages of the assignment]]
+{{{
+wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/stylometry-assignment.zip
+}}}
+. Unzip the downloaded file
+{{{
+unzip stylometry-assignment.zip
+}}}
+. Go to the unziped folder
+{{{
+cd sir-assignment
+}}}
+. Test the prepared program that analyses data from on-line dating services to distinguish gender (masculine/feminine) by text style features
+{{{
+./run.sh
+}}}
+`run.sh` can have two optional parameters:
+{{{
+./run.sh  [number_of_testing_cycles]  [show_first_N_erroneously_predicted_documents]
+}}}
+The default values, i.e. running `./run.sh` without parameters, are `10` cycles and `no documents` (`./run.sh 10 0`). With longer feature testing `./run.sh 100` could provide better results (but not necessarily).
+Example with document output (second parameter `>0`):
+{{{
+[xrygl@asteria04:~/temp/sir-assignment]$ ./run.sh 10 1
+pos: 5
+expected: on
+predicted: ona
+text: Ahoj, (nejen) pro výlety do víru podivnězimního velkoměsta, či divočiny
+      venkova, hledá se partnerka přiměřených rozměrů, tvarů a úrovně. Slečny
+      veselé povahy preferovány; ona je to nejspíš nutnost :-)
+morphology: [
+. <s>           <s>
+. Ahoj          N.N.I.S.1.-.-.-.-.-.A.-.-.-.-
+. ,             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
+. (             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
+. nejen         T.T.-.-.-.-.-.-.-.-.-.-.-.-.-
+. )             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
+. pro           R.R.-.-.4.-.-.-.-.-.-.-.-.-.-
+. výlety        N.N.I.P.4.-.-.-.-.-.A.-.-.-.-
+. do            R.R.-.-.2.-.-.-.-.-.-.-.-.-.-
+. víru          N.N.I.S.2.-.-.-.-.-.A.-.-.-.-
+. podivnězimního  A.A.N.S.2.-.-.-.-.1.A.-.-.-.-
+. velkoměsta    N.N.N.S.2.-.-.-.-.-.A.-.-.-.-
+. ,             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
+. či            J.^.-.-.-.-.-.-.-.-.-.-.-.-.-
+. divočiny      N.N.F.S.2.-.-.-.-.-.A.-.-.-.-
+. venkova       N.N.I.S.2.-.-.-.-.-.A.-.-.-.-
+. ,             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
+. hledá         V.B.-.S.-.-.-.3.P.-.A.A.-.-.-
+. se            P.7.-.X.4.-.-.-.-.-.-.-.-.-.-
+. partnerka     N.N.F.S.1.-.-.-.-.-.A.-.-.-.-
+. přiměřených   A.A.I.P.2.-.-.-.-.1.A.-.-.-.-
+. rozměrů       N.N.I.P.2.-.-.-.-.-.A.-.-.-.-
+. ,             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
+. tvarů         N.N.I.P.2.-.-.-.-.-.A.-.-.-.-
+. a             J.^.-.-.-.-.-.-.-.-.-.-.-.-.-
+. úrovně        N.N.F.S.2.-.-.-.-.-.A.-.-.-.-
+. .             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
+. <s>           <s>
+. Slečny        N.N.F.P.1.-.-.-.-.-.A.-.-.-.-
+. veselé        A.A.N.S.1.-.-.-.-.1.A.-.-.-.-
+. povahy        N.N.F.S.2.-.-.-.-.-.A.-.-.-.-
+. preferovány   V.s.T.P.-.-.-.X.X.-.A.P.-.-.-
+. ;             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
+. ona           P.P.F.S.1.-.-.3.-.-.-.-.-.-.-
+. je            V.B.-.S.-.-.-.3.P.-.A.A.-.-.-
+. to            P.D.N.S.1.-.-.-.-.-.-.-.-.-.-
+. nejspíš       D.g.-.-.-.-.-.-.-.3.A.-.-.-.-
+. nutnost       N.N.F.S.1.-.-.-.-.-.A.-.-.-.-
+. :             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
+. -             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
+. )             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
+]
+Acc: 73.8 +- 2.1% (baseline 50.0%, 10 iterations)
+}}}
+You may print more details in `http_server/basic_task.py` after `# print explanation(s)` comment.
+=== Task ===
+Examine files in  `stylometry_features` folder.
+Modify the `assignment.py` file to increase the accuracy of methods (use `vi` or `nano` command for editing).
+You can create another classes inside, change their names and class names to improve the accuracy score. Don't forget to add new classes into the `assignments` list in this file.
+Suggestions for your inspiration include:
+ * diacritics usage (yes/no), a regular expression will be needed
+ * sentence endings (number of sentences, or typical endings)
+ * repetitions of words in sentences/in the text
+ * usage of uppercase letters
+ * length of sentences/text
+ * POS tags (n-grams)
+ * word n-grams
+ * character n-grams
+Each modification can be tested by running `./run.sh` again.
+The first call of `run.sh` can be slower, because documents are morphologically analysed during the first run.
+Write the resulting `Acc:` line into the top comment of the `assignment.py` file. [[br]]
+Submit your `assignment.py` file.