Movie Review Classification Using NLP, GridDB, and Python

Introduction

In this tutorial, we will be classifying movie reviews based on sentimental analysis using an NLP Model. This is an application-based tutorial where we will be using a pre-trained LSTM model from the Allen NLP library. The outline of the tutorial is as follows:

  1. Setting up the environment
  2. All about the Dataset
  3. Data Preprocessing
  4. Loading the Allen NLP model
  5. Making predictions
  6. Evaluating the results

The full Jupyter file can be seen on our GitHub Page

Setting up the environment

This tutorial is carried out in Jupyter Notebooks (Anaconda version 4.8.3) with Python version 3.8 on Windows 10 Operating system. Following packages need to be installed before you continue with the code:

  1. Pandas
  2. allennlp
  3. allennlp-models
  4. nltk
  5. scikit-learn

You can install the above-mentioned packages using pip or conda. Simply type pip install package-name or conda install package-name in the command line.

To access GridDB’s database through Python, the following packages will be required:

  1. GridDB C-client
  2. SWIG (Simplified Wrapper and Interface Generator)
  3. GridDB Python-client

All About the Dataset

We are using the IMDB Sentiment Analysis Dataset which is available publicly on Kaggle. The format of the dataset is pretty simple – it has 2 attributes:

  1. Movie Review (string)
  2. Sentiment Label (int) – Binary

A label ‘0’ represents a negative movie review whereas ‘1’ represents a positive movie review. Since we will be using a pre-trained model, there is no need to download the train and validation dataset. We will be utilizing only the test dataset which has 5000 instances. Once you download the dataset, put it in the same working directory.

Now let’s go ahead and load the dataset in our python environment

Loading the Data

GridDB has made it easier to work with data as we can directly call the database using its python-client and load it in the form of pandas dataframe.

import griddb_python as griddb
import pandas as pd

sql_statement = ('SELECT * FROM movie_review_test')
movie_review_test = pd.read_sql_query(sql_statement, cont)

The cont variable has the container information in which you have your data stored. A detailed tutorial on reading and writing to GridDB using Pandas is available on the blog.

Alternatively, if you have the CSV file, you can use the read_csv() function of pandas. The outcome will be the same in both scenarios

import pandas as pd

movie_review_test = pd.read_csv("movie_review_test.csv")

Let’s print out the first five rows to get a little sneak peak into our data

movie_review_test.head()
text label
0 I always wrote this series off as being a comp… 0
1 1st watched 12/7/2002 – 3 out of 10(Dir-Steve … 0
2 This movie was so poorly written and directed … 0
3 The most interesting thing about Miryang (Secr… 1
4 when i first read about “berlin am meer” i did… 0
len(movie_review_test)
5000

Data Preprocessing

Data Preprocessing is an important step to avoid getting any unexpected behaviour from the machine learning model. Null values or missing values tend to mess with the overall results if not dealt with properly. Let’s see if our data contains any null values.

movie_review_test.isna().sum()
text     0
label    0
dtype: int64

Great! Fortunately, we have zero null/missing values in our test dataset. However, if you do encounter null values, consider dropping them or replacing them before moving further.

Removing Punctuation and Stop Words

Punctuation and stop words only increase the total word limit of a text. They do not contribute to model learning and serve majorly as noise. It is, therefore, important to remove those before the training step. In our case, although there is no training step, we still want to make sure that the input we’re providing is valid and appropriate. You can extend this step for the training dataset as well.

Various libraries provide a list of stopwords. We’ll be using the nltk library for this task. Note that the list of stop words depend on package to package. You might get a slightly different result if you’re using some other library, say spacy.

from nltk.corpus import stopwords
import nltk
stop = stopwords.words('english')
len(stop)
179
type(stop)
list

We now have a list of 179 stopwords. You can add some custom words to the list as well. In fact, let’s go ahead and add a couple of words to the stopwords list.

extra_words = ['Yeah', 'Okay']
for word in extra_words:
    if word not in stop:
        stop.append(word)
len(stop)
181

Alternatively, you can use the extend() to append all the items of the list. The if condition inside the for loop just makes sure we’re not adding the same word twice.

movie_review_test['text'] = movie_review_test['text'].apply(lambda words: ' '.join(word for word in words.split() if word not in stop))
movie_review_test.head()
text label
0 I always wrote series complete stink-fest Jim … 0
1 1st watched 12/7/2002 – 3 10(Dir-Steve Purcell… 0
2 This movie poorly written directed I fell asle… 0
3 The interesting thing Miryang (Secret Sunshine… 1
4 first read “berlin meer” expect much. thought … 0

As we can see, personal pronouns such as ‘I’, ‘we’, etc. have been removed. Let’s go ahead and remove the punctuation as well.

movie_review_test['text'] = movie_review_test['text'].str.lower()
movie_review_test['text'] = movie_review_test['text'].str.replace('[^\w\s]','')
movie_review_test.head()
text label
0 i always wrote series complete stinkfest jim b… 0
1 1st watched 1272002 3 10dirsteve purcell typi… 0
2 this movie poorly written directed i fell asle… 0
3 the interesting thing miryang secret sunshine … 1
4 first read berlin meer expect much thought rig… 0

Now that our data is ready to be used, let’s load up our model and start making some predictions!

Loading the Allen NLP Model

Allen NLP has made available a lot of machine learning models targeting different problem statements. We will be using the GLoVe-LSTM binary classifier for our movie review dataset. As per the official documentation, the model achieved an overall accuracy of 87% on the Stanford Sentiment Treebank. A live demo of the model is available on the allennlp’s official website.

Let’s go ahead and load our predictor.

from allennlp.predictors.predictor import Predictor
import allennlp_models.tagging
predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/basic_stanford_sentiment_treebank-2020.06.09.tar.gz")
error loading _jsonnet (this is expected on Windows), treating C:\Users\SHRIPR~2\AppData\Local\Temp\tmpfjmtd8u3\config.json as plain json

Note that these models can be heavy and if you have a GPU enabled system, simply pass the argument cuda_device=0 in the above predictor function.

To check if the predictor works fine, let’s pass a sample text review and see what kind of output do we get.

sample_review = "This movie was so great. I laughed and cried, a lot!"
predictor.predict(sample_review)
'0'

As we can see, the predictor returns a dictionary with 5 keys – logits, probs, token_ids, label, and, tokens. Since we know the sample review is a positive one, we can say that the model correctly returned a label '1'.

In addition to the label, the probs list also tells us the confidence score or probability of each label, which in our case are 0 or 1. The first item of the probs list i.e. the probability of label ‘1’ is 0.98 (or 98%) which implies that the model was 98% confident that the review was positive.

Now we know that the predictor is working fine, it is time to make some predictions

Making Predictions

We’ll define a predict function that takes a movie review and returns the label as an integer. Note that the original labels are of type int. It’ll be easier to compare the actual and predicted value if they’re of the same data type.

def predict_review(movie_review):
    return (int(predictor.predict(movie_review)['label']))
movie_review_test['predicted_label'] = movie_review_test['text'].apply(predict_review)
movie_review_test.head()
text label predicted_label
0 I always wrote this series off as being a comp… 0 1
1 1st watched 12/7/2002 – 3 out of 10(Dir-Steve … 0 0
2 This movie was so poorly written and directed … 0 0
3 The most interesting thing about Miryang (Secr… 1 1
4 when i first read about “berlin am meer” i did… 0 1

Now we simply need to calculate the accuracy of our model. The prediction cell took 6 minutes to execute for 5000 instances because it was running on CPU and these models can be heavy. If you’ll be utilizing the code for large data, consider using a GPU.

Evaluating the results

Allen NLP has their own set of metrics for evaluation. For the sake of simplicity, we’ll be using the scikit-learn library. You can find more information on Allen NLP metrics here.

from sklearn.metrics import accuracy_score
actual = movie_review_test['label']
predicted = movie_review_test['predicted_label']
accuracy = accuracy_score(actual, predicted)
accuracy
0.7208

Our model has an overall accuracy of 72% on the test dataset. That’s decent for starters, right? You can save the predictions in a CSV file using the pd.to_csv(file_path). Go ahead and try the code for yourself.

Happy coding!

If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.