# Sentiment Analysis Notebook

Inspired by https://www.kaggle.com/gabrielaltay/word-vectors-from-pmi-matrix and https://www.kaggle.com/rosado/sentiment-analysis-text-mining.

The notebook can handle two different datasets: Cestina 2.0 and Urban Dictionary. Select the one close to your language knowledge. Later, you can try with the other one.

The aim of this notebook is to demonstrate sentiment analysis, particularly on new words from the crowd-sourced website http://cestina20.cz or https://www.urbandictionary.com/. We scraped the data from Cestina 2.0 into a CSV file that is part of this project - http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/AdvancedNlpCourse/OpinionSentiment/cestina20.csv. In case of Urban Dictionary, we downloaded the CSV from Kaggle: https://www.kaggle.com/athontz/urban-dictionary-terms#urban_dictionary.csv

First, we try to recognize the sentiment of dictionary entries using the Liu's Opinion Lexicon (https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon). For Czech, we automatically translated the entries using Google Translate. The Opinion Lexicon as well as the translation are part of this project and can be found in http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/AdvancedNlpCourse/OpinionSentiment/positive-words-en.txt, http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/AdvancedNlpCourse/OpinionSentiment/negative-words-en.txt, http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/AdvancedNlpCourse/OpinionSentiment/positive-words-cs.txt, and http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/AdvancedNlpCourse/OpinionSentiment/negative-words-cs.txt.

**OPTIONAL TASK**: If you are not familiar with Cestina 2.0/Urban Dictionary, go to the website and go through some dictionary entries to see example of the data.


# Get the data

Download the data using `wget`.

In [None]:
!wget http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/AdvancedNlpCourse/OpinionSentiment/cestina20.csv
!wget http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/AdvancedNlpCourse/OpinionSentiment/urban_dictionary.csv
!wget http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/AdvancedNlpCourse/OpinionSentiment/positive-words-en.txt
!wget http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/AdvancedNlpCourse/OpinionSentiment/negative-words-en.txt
!wget http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/AdvancedNlpCourse/OpinionSentiment/positive-words-cs.txt
!wget http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/AdvancedNlpCourse/OpinionSentiment/negative-words-cs.txt

## Tools
### Tokenizer
We use NLTK standard tokenizer to split the texts by words. Splitting by spaces is not enough, since we want e.g. "word" and "word," to be one token. Tokenization is not strongly language dependent, so NLTK standard tokenizer is enough. If we want to process languages that do not use spaces (CJK, or Chinese, Japanese, Korean), we should modify this part.

### Stopwords
For training the word vectors, we use stoplists of English/Czech most common words. This helps especially in cases we have small data (our case).

In [None]:
!wget https://raw.githubusercontent.com/stopwords-iso/stopwords-cs/master/stopwords-cs.txt
!wget https://raw.githubusercontent.com/stopwords-iso/stopwords-en/master/stopwords-en.txt

### Python packages
#### Pandas
Pandas is a data science standard that allows easy work with large tabular data. Pandas DataFrame is the object we use in this project.

#### SciKit Learn
SciKit Learn (`sklearn`) is a standard machine learning package for Python. We use its `cosine_similarity` function.

#### Numpy
Together with `sklearn` a Python machine learning standard. Provides straightforward matrix computation, so we avoid unhealthy nested for-cycles.

In [None]:
# it depends on installation but probably this is not necessary
# in case it does not work, try pip instead of pip3
# DO NOT RUN in Colab, this is only useful if you download the notebook and use on your computer
!pip3 install --user nltk
!pip3 install --user sklearn

In [None]:
from collections import Counter
import itertools

import os
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

import numpy as np
import pandas as pd
from scipy import sparse
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
LANG_SETTINGS = {"en": {"filename":"urban_dictionary.csv", "sep":",", "explanation":"definition"},
                 "cs": {"filename":"cestina20.csv", "sep":";", "explanation":"explanation"}
                }
selected_language = 'cs'
settings = LANG_SETTINGS[selected_language]

## Data
We prepared the data in advance using web crawling and parsing of the HTML pages. Look at a sample of the CSV.

In [None]:
df = pd.read_csv(settings['filename'], sep=settings['sep'], keep_default_na=False)
df.head()

### Explanations
We try to recognize dictionary entry sentiment from the explanation. Let's see an example, next, let's convert the explanations into token sequences and remove stopwords.

In [None]:
df[settings['explanation']][1000]

In [None]:
explanations = df[settings['explanation']].tolist()
# remove stopwords
stopwords_set = ['-','.',':',';','"',',', '!', '(', ')', '``', "'", "''", "„", "”", "...", "apod", "viz", 'např', "například", "příklad"]
with open('stopwords-{}.txt'.format(selected_language), encoding='utf-8') as f:
    stopwords_set.extend(list(set([w.strip() for w in f.readlines()])))
#print(stopwords_set)
explanations = [
    [tok.lower() for tok in word_tokenize(explanation.replace('&#8230;', '...').replace('&#8230', '...')) if tok.lower() not in stopwords_set] for explanation in explanations
]
# show results
explanations[1000]

## Recognize sentiment using the Opinion Lexicon

**TASK 1**: Look in the original opinion lexicon, look in the translated version. Comment on what you see in both resources.

The easiest way is to go through all tokens in the explanation and sum their sentiment according to the Opinion Lexicon. Since some opinion lexicons distinguish *strong* and *weak* opinions we convert the input data into positive or negative numbers in a similar way, except we do not distinguish the intensity of the sentiment, only the polarity: -2 for negative words, 2 for positive words, 0 for words present in both files.


In [None]:
with open('positive-words-{}.txt'.format(selected_language),'rb') as f:
    positive_words = [w.strip().lower() for w in f.read().decode('utf-8','ignore').split('\n')]
with open('negative-words-{}.txt'.format(selected_language),'rb') as f:
    negative_words = [w.strip().lower() for w in f.read().decode('utf-8','ignore').split('\n')]
"No. of positive words:", len(positive_words), "No. of negative words:", len(negative_words)

In [None]:
 # we put a small epsilon in order to distinguish sentences with no recognized sentiment from sentences with positive+negative sentiment
score_word_dict = {k:2.00001 for k in positive_words}
score_word_dict.update({k:0 for k in negative_words if k in positive_words})
score_word_dict.update({k:-2 for k in negative_words if k not in positive_words})

In [None]:
def get_sentiment_sequence(tokens, score_word_dict):
    sentiment = 0
    for token in tokens:
        sentiment += score_word_dict.get(token, 0)
    return sentiment
df[settings['explanation']][1], get_sentiment_sequence(explanations[1], score_word_dict)

**OPTIONAL TASK**: Extend the above method, so it is able to explain the sentiment score. For example, the method can return a tuple (`score`, `list of sentiment positive/negative words`).

Calculate sentiment for all explanations and a column to the DataFrame.

In [None]:
scores=[]
for d in explanations:
    score = get_sentiment_sequence(d, score_word_dict)
    scores.append(score)
df['feeling_score_lexicon'] = scores
df.head()

In [None]:
df.sort_values(by=['feeling_score_lexicon'], ascending=False).head()

## Result 1

**TASK 2**: Add statistics about the dataset. How many dictionary entries have explanation? How many explanations have sentiment recognized?


## Word Vectors
The main problems of the naive solution are:
* small recall because of rather low quality of the lexicon (due to the automatic translation),
* small recall due to only one form of the word present in the lexicon. However, in Czech language, many different forms for a word exist, e.g. tlustý, tlustým, tlustých, tlustá, tlustého, tlustou.
* no context awareness of the method
 
    
We try to improve the sentiment recognition using word vectors. The main idea is the *distributional semantics* - an observation that similar words appear in similar contexts. There are many methods how to calculate word vectors, however, all of them take into account not only a token but also tokens in its surrounding (the context). Most techniques use a fixed window, in our case, the window is (-2, +2). For example, for the sentence "The quick brown fox jumped over the lazy dog.", using a sliding window we have the following sequences (stopword removal applied in the example):

\[quick, brown, fox, jumped, lazy\]<br>
\[brown, fox, jumped, lazy, dog\]

The distributional semantics assumes that *fox* is similar to other words that appear around the words *quick*, *brown*, *jumped*, *lazy*, *dog*.

### Token index
We convert the tokens in explanations (without stopwords) into an index. This is handy, since we will only calculate with numbers and provide the respective tokens via this token index only.

In [None]:
tok2indx = dict()
unigram_counts = Counter()
for ii, explanation in enumerate(explanations):
    for token in explanation:
        unigram_counts[token] += 1
        if token not in tok2indx:
            tok2indx[token] = len(tok2indx)
indx2tok = {indx:tok for tok,indx in tok2indx.items()}
print('done')
print('vocabulary size: {}'.format(len(unigram_counts)))
print('most common: {}'.format(unigram_counts.most_common(10)))

### Skipgrams
We calculate the frequencies of word tuples appearing in the same sliding window. You can see the most frequent tuples.

In [None]:
back_window = 2
front_window = 2
skipgram_counts = Counter()
for i, explanation in enumerate(explanations):
    for ifw, fw in enumerate(explanation):
        icw_min = max(0, ifw - back_window)
        icw_max = min(len(explanation) - 1, ifw + front_window)
        icws = [ii for ii in range(icw_min, icw_max + 1) if ii != ifw]
        for icw in icws:
            skipgram = (explanation[ifw], explanation[icw])
            skipgram_counts[skipgram] += 1    
        
print('done')
print('number of skipgrams: {}'.format(len(skipgram_counts)))
print('most common: {}'.format(skipgram_counts.most_common(10)))

## Token matrix
We store the skipgram frequencies in a (symmetric) matrix.

**OPTIONAL TASK**: Why is the matrix symmetric?

In [None]:
row_indxs = []
col_indxs = []
dat_values = []
ii = 0
for (tok1, tok2), sg_count in skipgram_counts.items():
    ii += 1
    if ii % 1000000 == 0:
        print(f'finished {ii/len(skipgram_counts):.2%} of skipgrams')
    tok1_indx = tok2indx[tok1]
    tok2_indx = tok2indx[tok2]
        
    row_indxs.append(tok1_indx)
    col_indxs.append(tok2_indx)
    dat_values.append(sg_count)
    
wwcnt_mat = sparse.csr_matrix((dat_values, (row_indxs, col_indxs)))
print(wwwcnt_mat.shape)

## Token similarity
In the token matrix, each row correspond to a token, the row is the word vector. The numbers in tha matrix show how often the token appears together with other tokens.
The following method calculates token similarity as a cosine of the angle between two word vectors.

In [None]:
def ww_sim(token, matrix, topn=10):
    """Calculate topn most similar words to word"""
    if token not in tok2indx:
        return 0
    indx = tok2indx[token]
    if isinstance(matrix, sparse.csr_matrix):
        v1 = matrix.getrow(indx)
    else:
        v1 = matrix[indx:indx+1, :]
    sims = cosine_similarity(matrix, v1).flatten()
    sindxs = np.argsort(-sims)
    sim_word_scores = [(indx2tok[sindx], sims[sindx]) for sindx in sindxs[0:topn]]
    return sim_word_scores

In [None]:
import pprint
if selected_language == "cs":
    pprint.pprint(ww_sim('obézní', wwcnt_mat))
    pprint.pprint(ww_sim('hezká', wwcnt_mat))
else:
    pprint.pprint(ww_sim('ugly', wwcnt_mat))
    pprint.pprint(ww_sim('girl', wwcnt_mat))

## Calculate sentiment for OOVs
Out-of-vocabulary terms is one of the problems in our first try. Let's try to expand the Opinion Lexicon by similar words. You can see that we now know the sentiment of words *not* present in the Opinion Lexicon.

In [None]:
def get_sentiment(token, score_word_dict, ww_matrix):
#    if token in score_word_dict.keys(): # alternative: calculate from vectors iff the word is not in the lexicon
#        return score_word_dict[token]
    sentiment = score_word_dict.get(token, 0)
    k = ww_sim(token, ww_matrix)
    if not k:
        return 0
    for sim, score in k:
        sentiment += score * score_word_dict.get(sim, 0)
    return sentiment
if selected_language=="cs":
    print('andrej', get_sentiment('andrej', score_word_dict, wwcnt_mat), score_word_dict.get('andrej'))
else:
    print('donald', get_sentiment('donald', score_word_dict, wwcnt_mat), score_word_dict.get('donald'))

In [None]:
def get_sentiment_sequence(tokens, score_word_dict, ww_matrix):
    sentiment = 0
    for token in tokens:
        sentiment += get_sentiment(token, score_word_dict, ww_matrix)
    return sentiment
df[settings['explanation']][1], get_sentiment_sequence(explanations[1], score_word_dict, wwcnt_mat)

In [None]:
# This operation takes time. It calculates sentiment for each word in all explanations from the matrix. It could be optimized e.g. by not calculating the same word multiple times.
# We calculate the time spent using the time magic.
def calculate_scores(explanations, score_word_dict, wwcnt_mat):
    scores=[]
    for d in explanations:
        score = get_sentiment_sequence(d, score_word_dict, wwcnt_mat)
        scores.append(score)
    return scores
%time scores = calculate_scores(explanations, score_word_dict, wwcnt_mat)
df['feeling_score_wv'] = scores
df.head()

In [None]:
df.sort_values(by=['feeling_score_wv']).head()

## Result 2

**TASK 3**: Calculate the same statistics as for Result 1. How did word vectors improve the number of recognized words in the explanations?

## Evaluation
For this course, we manually annotated sentiment for 400 Czech explanations. We will compare the sentiment recognized by our methods with the manual annotation.

In [None]:
!wget http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/AdvancedNlpCourse/OpinionSentiment/cestina20_annotation.csv

In [None]:
annotation = None
if selected_language=="cs":
    dfa = pd.read_csv('cestina20_annotation.csv', sep=';')
    annotation = pd.merge(df, dfa, on='word') #df.loc[:len(dfa)-1,]
else:
    print("no ground truth (manual annotations) available")

In [None]:
annotation.head()

**TASK 4 (cs)**: Insert code to calculate confusion matrix for both `feeling_score_lexicon` and `feeling_score_wv` with `annotation`. Assume that the opinion recognition is correct if the score == 0 or the same polarity. Which sentiment recognition is more accurate? Why?

**TASK 4 (en)**: Go through 100 classifications and try to annotate manually the sentiment, -1 for negative, 0 for neutral, and 1 for positive is enough, i.e. do not consider *intensity*, only *polarity*.