Polish sentiment analysis using Keras and Word2vec
The post on the blog will be devoted to the analysis of sentimental Polish language, a problem in the category of natural language processing, implemented using machine learning techniques and recurrent neural networks.
What is Sentiment Analysis?
Sentiment analysis is a natural language processing (NLP) problem where the text is understood and the underlying intent is predicted. In this post, I will show you how you can predict the sentiment of Polish language texts as either positive, neutral or negative with the use of Python and Keras Deep Learning library.
Introduction to the basics of NLP
Word2vec is a group of related models that are used to produce word embeddings. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space [1].
Word embedding is the collective name for a set of language modelling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Word and phrase embeddings, when used as the underlying input representation, have been shown to boost the performance in NLP tasks such as synthetic parsing and sentiment analysis.
Why is sentiment analysis for the Polish language difficult?
- syntax – relationships between words in a sentence can be often specified in several ways, which leads to different interpretations of the text,
- semantics – the same word can have many meanings, depending on the context,
- pragmatics – the occurrence of metaphors, tautologies, ironies, etc.
- diacritic marks such as: ą, ć, ę, ł, ń, ó, ś, ź, ż,
- homonyms – words with the same linguistic form, but derived from words of different meaning,
- synonyms – different words with the same or very similar meaning,
- idioms – expressions whose meaning is different from that which should be assigned to it, taking into account its constituent parts and syntax rules,
- more than 150k words in the basic dictionary.
Data sources used for the project
Data was collected from various sources:
- Opineo – Polish service with all reviews from online shops
- Twitter – Polish current top hashtags from political news
- Twitter – Polish Election Campaign 2015
- Polish Academy of Science HateSpeech project
- YouTube – comments from various videos
Download the text data of polish sentiment analysis form our Google Drive.
The polish word embeddings were downloaded from the Polish Academy of Science.
Let’s go to what we like the most – code…
First of all, we will be defining all of the libraries and functions we will need:
import numpy as np import pandas as pd import matplotlib.pylab as plt from livelossplot import PlotLossesKeras np.random.seed(7) from sklearn.model_selection import train_test_split from keras.preprocessing.text import Tokenizer from keras.models import Sequential from keras.layers import Dense, Dropout, Activation, LSTM from keras.layers.embeddings import Embedding from keras.utils import np_utils from keras.preprocessing import sequence from gensim.models import Word2Vec, KeyedVectors, word2vec import gensim from gensim.utils import simple_preprocess from keras.utils import to_categorical import pickle import h5py from time import time
Then load our dataset with simple preprocessing of data:
filename = 'Data/Dataset.csv' dataset = pd.read_csv(filename, delimiter = ",") # Delete unused column del dataset['length'] # Delete All NaN values from columns=['description','rate'] dataset = dataset[dataset['description'].notnull() & dataset['rate'].notnull()] # We set all strings as lower case letters dataset['description'] = dataset['description'].str.lower()
Split data into training (60%), test (20%) and validation (20%) set:
X = dataset['description'] y = dataset['rate'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
Print shape of X: train, test, validate and y: train, test, validate:
print("X_train shape: " + str(X_train.shape)) print("X_test shape: " + str(X_test.shape)) print("X_val shape: " + str(X_val.shape)) print("y_train shape: " + str(y_train.shape)) print("y_test shape: " + str(y_test.shape)) print("y_val shape: " + str(y_val.shape))
Load existing Polish Word2vec model taken from Polish Academy of Science:
word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('nkjp.txt', binary=False) embedding_matrix = word2vec_model.wv.syn0 print('Shape of embedding matrix: ', embedding_matrix.shape)
Vectorize X_train and X_test to 2D tensor:
top_words = embedding_matrix.shape[0] mxlen = 50 nb_classes = 3 tokenizer = Tokenizer(num_words=top_words) tokenizer.fit_on_texts(X_train) sequences_train = tokenizer.texts_to_sequences(X_train) sequences_test = tokenizer.texts_to_sequences(X_test) sequences_val = tokenizer.texts_to_sequences(X_val) word_index = tokenizer.word_index print('Found %s unique tokens.' % len(word_index)) print(word_index) X_train = sequence.pad_sequences(sequences_train, maxlen=mxlen) X_test = sequence.pad_sequences(sequences_test, maxlen=mxlen) X_val = sequence.pad_sequences(sequences_val, maxlen=mxlen) y_train = np_utils.to_categorical(y_train, nb_classes) y_test = np_utils.to_categorical(y_test, nb_classes) y_val = np_utils.to_categorical(y_val, nb_classes)
We define our LSTM (Long Short-Term Memory)
Long Short-Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies.
LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behaviour, not something they struggle to learn! The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.
The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.
batch_size = 32 nb_epoch = 12 embedding_layer = Embedding(embedding_matrix.shape[0], embedding_matrix.shape[1], weights=[embedding_matrix], trainable=False) model = Sequential() model.add(embedding_layer) model.add(LSTM(128, recurrent_dropout=0.5, dropout=0.5)) model.add(Dense(nb_classes)) model.add(Activation('softmax')) model.summary()
We just need to compile the model and we will be ready to train it. When we compile the model, we declare the optimizer (Adam, SGD, etc.) and the loss function. To fit the model, all we have to do is declare the number of epochs and the batch size.
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) rnn = model.fit(X_train, Y_train, nb_epoch= nb_epoch, batch_size=batch_size, shuffle=True, validation_data=(X_val, y_val)) score = model.evaluate(X_val, y_val) print("Test Loss: %.2f%%" % (score[0]*100)) print("Test Accuracy: %.2f%%" % (score[1]*100))
The last step is to save our pre-trained model with word index. We can then use it later to predict new sentences in the future.
print('Save model...') model.save('Models/finalsentimentmodel.h5') print('Saved model to disk...') print('Save Word index...') output = open('Models/finalwordindex.pkl', 'wb') pickle.dump(word_index, output) print('Saved word index to disk...')
We can also control the overfitting using graphs.
Plots for training and testing process (loss and accuracy):
plt.figure(0) plt.plot(rnn.history['acc'],'r') plt.plot(rnn.history['val_acc'],'g') plt.xticks(np.arange(0, nb_epoch+1, nb_epoch/5)) plt.rcParams['figure.figsize'] = (8, 6) plt.xlabel("Num of Epochs") plt.ylabel("Accuracy") plt.title("Training vs Validation Accuracy LSTM l=10, epochs=20") # for max length = 10 and 20 epochs plt.legend(['train', 'validation']) plt.figure(1) plt.plot(rnn.history['loss'],'r') plt.plot(rnn.history['val_loss'],'g') plt.xticks(np.arange(0, nb_epoch+1, nb_epoch/5)) plt.rcParams['figure.figsize'] = (8, 6) plt.xlabel("Num of Epochs") plt.ylabel("Training vs Validation Loss LSTM l=10, epochs=20") # for max length = 10 and 20 epochs plt.legend(['train', 'validation']) plt.show()
As we can see in the charts above, the neural network learns quickly, getting good results.
Apply Precision-Recall
Precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances.
Recall (also known as sensitivity) is the fraction of relevant instances that have been retrieved over the total amount of relevant instances.
Both precision and recall are therefore based on an understanding and measure of relevance.
# Apply Precision-Recall from sklearn.metrics import recall_score from sklearn.metrics import precision_score from sklearn.metrics import f1_score y_pred = model.predict(X_test) # Convert Y_Test into 1D array yy_true = [np.argmax(i) for i in Y_test] print(yy_true) yy_scores = [np.argmax(i) for i in y_pred] print(yy_scores) print("Recall: " + str(recall_score(yy_true, yy_scores, average='weighted'))) print("Precision: " + str(precision_score(yy_true, yy_scores, average='weighted'))) print("F1 Score: " + str(f1_score(yy_true, yy_scores, average='weighted')))
Apply and Visualizing of the confusion matrix
Confusion Matrix also is known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice versa).
# Apply Confusion matrix from sklearn.metrics import classification_report, confusion_matrix Y_pred = model.predict(X_test, verbose=2) y_pred = np.argmax(Y_pred, axis=1) for ix in range(3): print(ix, confusion_matrix(np.argmax(Y_test, axis=1), y_pred)[ix].sum()) cm = confusion_matrix(np.argmax(Y_test, axis=1), y_pred) print(cm) # Visualizing of confusion matrix import seaborn as sn df_cm = pd.DataFrame(cm, range(3), range(3)) plt.figure(figsize=(10,7)) sn.set(font_scale=1.4) sn.heatmap(df_cm, annot=False) sn.set_context("poster") plt.xlabel("Predicted Label") plt.ylabel("True Label") plt.title("Confusion Matrix") plt.savefig('Plots/confusionMatrix.png') plt.show()
Wordcloud
Word clouds (also known as text clouds or tag clouds) work in a simple way: the more a specific word appears in a source of textual data (such as a speech, blog post, or database), the bigger and bolder it appears in the word cloud.
from wordcloud import WordCloud from many_stop_words import get_stop_words stop_words = get_stop_words('pl') wordcloud = WordCloud( background_color='white', stopwords=stop_words, max_words=200, max_font_size=40, random_state=42 ).generate(str(dataset['description'])) print(wordcloud) fig = plt.figure(1) plt.imshow(wordcloud) plt.axis('off') plt.show()
The word cloud above shows the most common words in our dataset.
# ROC Curve from sklearn.metrics import roc_curve, auc from scipy import interp from itertools import cycle # Compute ROC curve and ROC area for each class n_classes = 3 lw = 2 fpr = dict() tpr = dict() roc_auc = dict() for i in range(n_classes): fpr[i], tpr[i], _ = roc_curve(np.array(pd.get_dummies(yy_true))[:, i], np.array(pd.get_dummies(yy_scores))[:, i]) roc_auc[i] = auc(fpr[i], tpr[i]) # Compute micro-average ROC curve and ROC area fpr["micro"], tpr["micro"], _ = roc_curve(np.array(pd.get_dummies(yy_true))[:, i], np.array(pd.get_dummies(yy_scores))[:, i]) roc_auc["micro"] = auc(fpr["micro"], tpr["micro"]) # Compute macro-average ROC curve and ROC area # First aggregate all false positive rates all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)])) # Then interpolate all ROC curves at this points mean_tpr = np.zeros_like(all_fpr) for i in range(n_classes): mean_tpr += interp(all_fpr, fpr[i], tpr[i]) # Finally average it and compute AUC mean_tpr /= n_classes fpr["macro"] = all_fpr tpr["macro"] = mean_tpr roc_auc["macro"] = auc(fpr["macro"], tpr["macro"]) # Plot all ROC curves plt.figure(figsize=(8,5)) plt.plot(fpr["micro"], tpr["micro"], label='micro-average ROC curve (area = {0:0.2f})' ''.format(roc_auc["micro"]), color='deeppink', linestyle=':', linewidth=4) plt.plot(fpr["macro"], tpr["macro"], label='macro-average ROC curve (area = {0:0.2f})' ''.format(roc_auc["macro"]), color='green', linestyle=':', linewidth=4) colors = cycle(['aqua', 'darkorange', 'cornflowerblue']) for i, color in zip(range(n_classes), colors): plt.plot(fpr[i], tpr[i], color=color, lw=lw, label='ROC curve of class {0} (area = {1:0.2f})' ''.format(i, roc_auc[i])) plt.plot([0, 1], [0, 1], 'k--',color='red', lw=lw) plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Receiver Operating Characteristic') plt.legend(loc="lower right") plt.savefig('Plots/ROCcurve.png') plt.show()
The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, recall.
Results
Test Accuracy: 97,65%
Test Loss: 6,56%
Recall score: 0.9761942865880075
Precision score: 0.9757310701772396
F1 score: 0.9758853714533414
Some statistics
- Our model was trained on a data set consisting of about 1,000,000 rows containing a minimum of 30 characters.
- Training time:
For epochs = 12; 220 minutes on MacBook Pro i7, 2.5GHz, 16 GB RAM, 512 GB SSD.
Summary
After working through this post you learned:
- What is a Sentiment Analysis
- How to use Keras library in NLP projects
- How to use Word2Vec
- How to preprocessing data for NLP projects
- How difficult is the Polish language in NLP
- How to implement LSTM
- What is a Word Embedding
- How to visualize on Wordcloud
- How to interpret confusion matrix, precision, recall, F1 score and ROC curve
You can download code from Ermlab GitHub repository.
How to run Our project?
Go to README in project repository here.
If you have any questions about the project or this post, please ask your question in the comments.
Resources
- Mikolov, Tomas; et al. “Efficient Estimation of Word Representations in Vector Space”. arXiv:1301.3781
- Twitter web scrapper & sentiment analysis of Polish government parties
- Pre-trained word vectors of 30+ languages
- An implementation of different neural networks to classify tweet’s sentiments
- Predict Sentiment From Move Reviews Using Deep Learning
- http://dsmodels.nlp.ipipan.waw.pl/
- Chinese Shopping Reviews sentiment analysis
- Prediction of Amazon review scores with a deep recurrent neural network using LSTM modules
- Classify the sentiment of sentences from the Rotten Tomatoes dataset
- Understanding LSTM Networks
- https://keras.io/