fbpx

Polish sentiment analysis using Keras and Word2vec

 In nlp

The post on the blog will be devoted to the analysis of sentimental Polish language, a problem in the category of natural language processing, implemented using machine learning techniques and recurrent neural networks.

What is Sentiment Analysis?

Sentiment analysis is a natural language processing (NLP) problem where the text is understood and the underlying intent is predicted. In this post, I will show you how you can predict the sentiment of Polish language texts as either positive, neutral or negative with the use of Python and Keras Deep Learning library.

Introduction to the basics of NLP

Word2vec is a group of related models that are used to produce word embeddings. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space [1].

Word embedding is the collective name for a set of language modelling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Word and phrase embeddings, when used as the underlying input representation, have been shown to boost the performance in NLP tasks such as synthetic parsing and sentiment analysis.

Why is sentiment analysis for the Polish language difficult?

  • syntax – relationships between words in a sentence can be often specified in several ways, which leads to different interpretations of the text,
  • semantics – the same word can have many meanings, depending on the context,
  • pragmatics – the occurrence of metaphors, tautologies, ironies, etc.
  • diacritic marks such as: ą, ć, ę, ł, ń, ó, ś, ź, ż,
  • homonyms – words with the same linguistic form, but derived from words of different meaning,
  • synonyms – different words with the same or very similar meaning,
  • idioms – expressions whose meaning is different from that which should be assigned to it, taking into account its constituent parts and syntax rules,
  • more than 150k words in the basic dictionary.

Data sources used for the project

Data was collected from various sources:

  1. Opineo – Polish service with all reviews from online shops
  2. Twitter – Polish current top hashtags from political news
  3. Twitter – Polish Election Campaign 2015
  4. Polish Academy of Science HateSpeech project
  5. YouTube – comments from various videos

Download the text data of polish sentiment analysis  form our Google Drive.

The polish word embeddings were downloaded from the Polish Academy of Science.

Let’s go to what we like the most – code…

First of all, we will be defining all of the libraries and functions we will need:

Then load our dataset with simple preprocessing of data:

Split data into training (60%),  test (20%) and validation (20%) set:

Print shape of X: train, test, validate and y: train, test, validate:

Load existing Polish Word2vec model taken from Polish Academy of Science:

Vectorize X_train and X_test to 2D tensor:

We define our LSTM (Long Short-Term Memory)

Long Short-Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies.

LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behaviour, not something they struggle to learn! The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

We just need to compile the model and we will be ready to train it. When we compile the model, we declare the optimizer (Adam, SGD, etc.)  and the loss function. To fit the model, all we have to do is declare the number of epochs and the batch size.

The last step is to save our pre-trained model with word index. We can then use it later to predict new sentences in the future.

We can also control the overfitting using graphs.
Plots for training and testing process (loss and accuracy):

 

As we can see in the charts above, the neural network learns quickly, getting good results.

Apply Precision-Recall

Precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances.

Recall (also known as sensitivity) is the fraction of relevant instances that have been retrieved over the total amount of relevant instances.

Both precision and recall are therefore based on an understanding and measure of relevance.

Apply and Visualizing of the confusion matrix

Confusion Matrix also is known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice versa).

Wordcloud

Word clouds (also known as text clouds or tag clouds) work in a simple way: the more a specific word appears in a source of textual data (such as a speech, blog post, or database), the bigger and bolder it appears in the word cloud.


The word cloud above shows the most common words in our dataset.

 

The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivityrecall.

Results

Test Accuracy: 97,65%

Test Loss: 6,56%

Recall score: 0.9761942865880075

Precision score: 0.9757310701772396

F1 score: 0.9758853714533414

Some statistics

  • Our model was trained on a data set consisting of about 1,000,000 rows containing a minimum of 30 characters.

  • Training time:

For epochs = 12; 220 minutes on MacBook Pro i7, 2.5GHz, 16 GB RAM, 512 GB SSD.

 

Summary

After working through this post you learned:

  • What is a Sentiment Analysis
  • How to use Keras library in NLP projects
  • How to use Word2Vec
  • How to preprocessing data for NLP projects
  • How difficult is the Polish language in NLP
  • How to implement LSTM
  • What is a Word Embedding
  • How to visualize on Wordcloud
  • How to interpret confusion matrix, precision, recall, F1 score and ROC curve

You can download code from Ermlab GitHub repository.

How to run Our project?

Go to README in project repository here.

If you have any questions about the project or this post, please ask your question in the comments.

Resources

  1. Mikolov, Tomas; et al. “Efficient Estimation of Word Representations in Vector Space”. arXiv:1301.3781
  2. Twitter web scrapper & sentiment analysis of Polish government parties
  3. Pre-trained word vectors of 30+ languages
  4. An implementation of different neural networks to classify tweet’s sentiments
  5. Predict Sentiment From Move Reviews Using Deep Learning
  6. http://dsmodels.nlp.ipipan.waw.pl/
  7. Chinese Shopping Reviews sentiment analysis
  8. Prediction of Amazon review scores with a deep recurrent neural network using LSTM modules
  9. Classify the sentiment of sentences from the Rotten Tomatoes dataset
  10. Understanding LSTM Networks
  11. https://keras.io/
Recommended Posts