Topic Modeling with LDA Using Python and GridDB

In natural language processing, topic modeling assigns a topic to a given corpus based on the words in it. Due to the fact that text data is unlabeled, it is an unsupervised technique. It is increasingly important to categorize documents according to topics in this world filled with data. As an example, if a company receives hundreds of reviews, the company will need to know what categories of reviews are the most important and vice versa.

As keywords, topics can be used to describe a document, for example, when we think of a topic related to economics, we think of stock market, USD, inflation, GPD, etc. Topic models are models that can automatically detect topics based on words appearing in a document. The problem we will tackle here is topic modeling.

LDA - (Latent Dirichlet Allocation)

The word latent means hidden, something that has yet to be discovered. As indicated by Dirichlet, the Dirichlet distribution is assumed to govern the distribution of topics and word patterns in documents. “Allocation” here refers to the process of giving something, in this case, topics.

In this tutorial, we’ll use the reviews in the following dataset to generate topics from the reviews. In this way, we can know about what users are talking about, what they are focusing on, and perhaps where app developers should make progress at.

The outline of the tutorial is as follows:

  1. Prerequisites and Environment setup
  2. Dataset overview
  3. Importing required libraries
  4. Loading the dataset
  5. Data Cleaning and Preprocessing
  6. Building and Training a Machine Learning Model
  7. Conclusion

1. Prerequisites and Environment setup

This tutorial is carried out in Anaconda Navigator (Python version – 3.8.3) on Windows Operating System. The following packages need to be installed before you continue with the tutorial –

  1. Pandas

  2. NumPy

  3. Sklearn

  4. nltk

  5. re

  6. griddb_python

  7. spacy

  8. gensim

You can install these packages in Conda’s virtual environment using conda install package-name. In case you are using Python directly via terminal/command prompt, pip install package-name will do the work.

GridDB installation

While loading the dataset, this tutorial will cover two methods – Using GridDB as well as Using Pandas. To access GridDB using Python, the following packages also need to be installed beforehand:

  1. GridDB C-client
  2. SWIG (Simplified Wrapper and Interface Generator)
  3. GridDB Python Client

2. Dataset Overview

Google Play Store Apps Dataset : Web scraped data of 10,000 Play Store apps for analyzing the Android market.

It can be downloaded from here (https://www.kaggle.com/datasets/lava18/google-play-store-apps/version/5).

3. Importing Required Libraries

import griddb_python as griddb
import numpy as np
import pandas as pd
import re, nltk, spacy, gensim
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from pprint import pprint

4. Loading the Dataset

Let’s proceed and load the dataset into our notebook.

4.a Using GridDB

Toshiba GridDB™ is a highly scalable NoSQL database best suited for IoT and Big Data. The foundation of GridDB’s principles is based upon offering a versatile data store that is optimized for IoT, provides high scalability, tuned for high performance, and ensures high reliability.

To store large amounts of data, a CSV file can be cumbersome. GridDB serves as a perfect alternative as it in open-source and a highly scalable database. GridDB is a scalable, in-memory, No SQL database which makes it easier for you to store large amounts of data. If you are new to GridDB, a tutorial on reading and writing to GridDB can be useful.

Assuming that you have already set up your database, we will now write the SQL query in python to load our dataset.

sql_statement = ('SELECT * FROM googleplaystore_user_reviews')
dataset = pd.read_sql_query(sql_statement, cont)

Note that the cont variable has the container information where our data is stored. Replace the bbc-text with the name of your container. More info can be found in this tutorial reading and writing to GridDB.

When it comes to IoT and Big Data use cases, GridDB clearly stands out among other databases in the Relational and NoSQL space. Overall, GridDB offers multiple reliability features for mission-critical applications that require high availability and data retention.

4.b Using pandas read_csv

In Python you need to give access to a file by opening it. You can do it by using the open() function. Open returns a file object, which has methods and attributes for getting information about and manipulating the opened file. Both of the above methods will lead to the same output as the data is loaded in the form of a pandas dataframe using either of the methods.

df = pd.read_csv("googleplaystore_user_reviews.csv")
df = df.dropna(subset=["Translated_Review"])
df.head()
App Translated_Review Sentiment Sentiment_Polarity Sentiment_Subjectivity
0 10 Best Foods for You I like eat delicious food. That’s I’m cooking … Positive 1.00 0.533333
1 10 Best Foods for You This help eating healthy exercise regular basis Positive 0.25 0.288462
3 10 Best Foods for You Works great especially going grocery store Positive 0.40 0.875000
4 10 Best Foods for You Best idea us Positive 1.00 0.300000
5 10 Best Foods for You Best way Positive 1.00 0.300000

Once the dataset is loaded, let us now explore the dataset. We’ll print the first 5 rows of this dataset using head() function.

5. Data Cleaning and Preprocessing

Cleaning the data by removing emails, new line character and quotes

# Convert to list
data = df.Translated_Review.values.tolist()
# Remove Emails
data = [re.sub(r'\S*@\S*\s?', '', sent) for sent in data]
# Remove new line characters
data = [re.sub(r'\s+', ' ', sent) for sent in data]
# Remove distracting single quotes
data = [re.sub(r"\'", "", sent) for sent in data]
pprint(data[:1])
['I like eat delicious food. Thats Im cooking food myself, case "10 Best '
 'Foods" helps lot, also "Best Before (Shelf Life)"']

We now need to tokenize each sentence into a list of words, eliminating all punctuation and unnecessary characters. Stemming refers to reducing a word to its word stem that attaches to prefixes and suffixes, or to the roots of words known as lemmas. The advantage of this is, we get to reduce the total number of unique words in the dictionary.

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations
        
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']): #'NOUN', 'ADJ', 'VERB', 'ADV'
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
    return texts_out

data_words = list(sent_to_words(data))
print(data_words[:1])

nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
data_lemmatized = lemmatization(data_words, allowed_postags=["NOUN", "VERB"]) #select noun and verb
print(data_lemmatized[:2])
[['like', 'eat', 'delicious', 'food', 'thats', 'im', 'cooking', 'food', 'myself', 'case', 'best', 'foods', 'helps', 'lot', 'also', 'best', 'before', 'shelf', 'life']]
['eat food s m cook food case food help lot shelf life', 'help eat exercise basis']

As input, the LDA topic model algorithm requires a document word matrix. This is done using CountVectorizer.

vectorizer = CountVectorizer(analyzer='word',       
                             min_df=10,
                             stop_words='english',             
                             lowercase=True,                   
                             token_pattern='[a-zA-Z0-9]{3,}') 
data_vectorized = vectorizer.fit_transform(data_lemmatized)

6. Machine Learning Model Building

We have everything we need to build a Latent Dirichlet Allocation (LDA) model. In order to construct the LDA model, let’s initialize one and then call fit_transform().

Based on my prior knowledge about the dataset, I have set n_topics to 20 in this example. This number will be adjusted using grid search later on.

# Build LDA Model
lda_model = LatentDirichletAllocation(n_components=20,max_iter=10,learning_method='online',random_state=100,batch_size=128,evaluate_every = -1,n_jobs = -1,               )
lda_output = lda_model.fit_transform(data_vectorized)
print(lda_model)  # Model attributes
LatentDirichletAllocation(learning_method='online', n_components=20, n_jobs=-1,
                          random_state=100)
LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
 evaluate_every=-1, learning_decay=0.7,
 learning_method="online", learning_offset=10.0,
 max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
 n_components=10, n_jobs=-1, perp_tol=0.1,
 random_state=100, topic_word_prior=None,
 total_samples=1000000.0, verbose=0)
LatentDirichletAllocation(learning_method='online', n_jobs=-1, random_state=100)

Diagnose model performance with perplexity and log-likelihood

High log-likelihood and low perplexity (exp(-1. * log-likelihood per word)) are considered good models.

# Log Likelyhood: Higher the better
print("Log Likelihood: ", lda_model.score(data_vectorized))
# Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
print("Perplexity: ", lda_model.perplexity(data_vectorized))
# See model parameters
pprint(lda_model.get_params())
Log Likelihood:  -2127623.32986425
Perplexity:  1065.3272644698702
{'batch_size': 128,
 'doc_topic_prior': None,
 'evaluate_every': -1,
 'learning_decay': 0.7,
 'learning_method': 'online',
 'learning_offset': 10.0,
 'max_doc_update_iter': 100,
 'max_iter': 10,
 'mean_change_tol': 0.001,
 'n_components': 20,
 'n_jobs': -1,
 'perp_tol': 0.1,
 'random_state': 100,
 'topic_word_prior': None,
 'total_samples': 1000000.0,
 'verbose': 0}

Use GridSearch to determine the best LDA model.

N_components (number of topics) is the most important tuning parameter for LDA models. Additionally, I will search learning_decay (which controls the learning rate) as well. In addition to these, learning_offset (downweigh early iterations. Should be > 1) and max_iter can also be considered as search parameters. This process can consume a lot of time and resources.

# Define Search Param
search_params = {'n_components': [10, 20], 'learning_decay': [0.5, 0.9]}
# Init the Model
lda = LatentDirichletAllocation(max_iter=5, learning_method='online', learning_offset=50.,random_state=0)
# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params)
# Do the Grid Search
model.fit(data_vectorized)
GridSearchCV(cv=None, error_score='raise',
       estimator=LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=10, n_jobs=1, perp_tol=0.1, random_state=None,
             topic_word_prior=None, total_samples=1000000.0, verbose=0),
        n_jobs=1,
       param_grid={'n_components': [10, 20], 'learning_decay': [0.5, 0.9]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)
GridSearchCV(error_score='raise',
             estimator=LatentDirichletAllocation(learning_method=None,
                                                 n_jobs=1),
             n_jobs=1,
             param_grid={'learning_decay': [0.5, 0.9],
                         'n_components': [10, 20]},
             return_train_score='warn')
# Best Model
best_lda_model = model.best_estimator_
# Model Parameters
print("Best Model's Params: ", model.best_params_)
# Log Likelihood Score
print("Best Log Likelihood Score: ", model.best_score_)
# Perplexity
print("Model Perplexity: ", best_lda_model.perplexity(data_vectorized))
Best Model's Params:  {'learning_decay': 0.9, 'n_components': 10}
Best Log Likelihood Score:  -432616.36669435585
Model Perplexity:  764.0439579711182

A logical way to determine whether a document belongs to a particular topic is to see which topic contributed the most to it and then assign it to that topic. Below table highlighted all major topics and assigned the most dominant topic its own column.

# Create Document — Topic Matrix
lda_output = best_lda_model.transform(data_vectorized)

topicnames = ["Topic" + str(i) for i in range(best_lda_model.n_components)]
docnames = ["Doc" + str(i) for i in range(len(data))]

# Make the pandas dataframe
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=docnames)
# Get dominant topic for each document
dominant_topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic["dominant_topic"] = dominant_topic
# Styling
def color_green(val):
 color = "green" if val > .1 else "black"
 return "color: {col}".format(col=color)
def make_bold(val):
 weight = 700 if val > .1 else 400
 return "font-weight: {weight}".format(weight=weight)
# Apply Style
df_document_topics = df_document_topic.head(15).style.applymap(color_green).applymap(make_bold)
df_document_topics
  Topic0 Topic1 Topic2 Topic3 Topic4 Topic5 Topic6 Topic7 Topic8 Topic9 dominant_topic
Doc0 0.010000 0.010000 0.010000 0.760000 0.010000 0.010000 0.010000 0.010000 0.010000 0.160000 3
Doc1 0.020000 0.020000 0.020000 0.820000 0.020000 0.020000 0.020000 0.020000 0.020000 0.020000 3
Doc2 0.030000 0.030000 0.030000 0.030000 0.030000 0.030000 0.770000 0.030000 0.030000 0.030000 6
Doc3 0.550000 0.050000 0.050000 0.050000 0.050000 0.050000 0.050000 0.050000 0.050000 0.050000 0
Doc4 0.050000 0.050000 0.050000 0.550000 0.050000 0.050000 0.050000 0.050000 0.050000 0.050000 3
Doc5 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0
Doc6 0.030000 0.030000 0.700000 0.030000 0.030000 0.030000 0.030000 0.030000 0.030000 0.030000 2
Doc7 0.030000 0.030000 0.030000 0.030000 0.030000 0.030000 0.250000 0.030000 0.030000 0.550000 9
Doc8 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0
Doc9 0.010000 0.010000 0.010000 0.010000 0.790000 0.120000 0.010000 0.010000 0.010000 0.010000 4
Doc10 0.850000 0.020000 0.020000 0.020000 0.020000 0.020000 0.020000 0.020000 0.020000 0.020000 0
Doc11 0.020000 0.020000 0.220000 0.620000 0.020000 0.020000 0.020000 0.020000 0.020000 0.020000 3
Doc12 0.030000 0.030000 0.030000 0.520000 0.030000 0.270000 0.030000 0.030000 0.030000 0.030000 3
Doc13 0.020000 0.020000 0.020000 0.380000 0.020000 0.020000 0.020000 0.020000 0.020000 0.460000 9
Doc14 0.850000 0.020000 0.020000 0.020000 0.020000 0.020000 0.020000 0.020000 0.020000 0.020000 0
# Topic-Keyword Matrix
df_topic_keywords = pd.DataFrame(best_lda_model.components_)
# Assign Column and Index
df_topic_keywords.columns = vectorizer.get_feature_names_out()
df_topic_keywords.index = topicnames
# View
df_topic_keywords.head()
aap abandon ability abuse accept access accessory accident accommodation accomplish yardage yay year yesterday yoga youtube zip zombie zone zoom
Topic0 0.102649 0.102871 56.001281 0.103583 0.107420 0.132561 12.712732 0.102863 0.102585 0.102685 8.642076 0.102612 153.232551 0.102522 0.496217 0.106992 0.211912 0.140018 0.177780 0.104975
Topic1 0.101828 0.102233 1.148602 0.102127 0.103543 558.310169 0.102997 2.594090 0.102651 0.110221 0.525860 0.102106 6.075186 20.135445 0.102284 0.106246 0.103076 0.108334 0.122234 0.102741
Topic2 0.103196 0.107593 0.107848 0.104019 0.103053 0.126004 0.106085 0.117876 9.979474 0.108507 0.366334 0.102367 5.066123 0.103931 31.039314 0.107878 0.102303 0.102200 0.128228 0.104907
Topic3 0.102564 0.107112 2.022397 12.968156 0.102692 0.130003 0.113959 1.838441 0.101579 8.345948 0.105286 0.103549 7.478397 0.104231 24.234774 0.118099 0.123212 0.128494 29.086953 0.103109
Topic4 0.102634 0.102345 76.332226 0.102486 41.139452 0.118419 0.115930 0.142032 0.103316 0.104292 0.409518 0.102979 737.692499 0.600751 0.116092 0.102262 0.108881 0.102011 0.115584 0.513135

5 rows × 2273 columns

# Show top n keywords for each topic
def show_topics(vectorizer=vectorizer, lda_model=lda_model, n_words=20):
    keywords = np.array(vectorizer.get_feature_names_out())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords
topic_keywords = show_topics(vectorizer=vectorizer, lda_model=best_lda_model, n_words=15)
# Topic - Keywords Dataframe
df_topic_keywords = pd.DataFrame(topic_keywords)
df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
df_topic_keywords
Word 0 Word 1 Word 2 Word 3 Word 4 Word 5 Word 6 Word 7 Word 8 Word 9 Word 10 Word 11 Word 12 Word 13 Word 14
Topic 0 phone make add app think picture version month work minute thing look list home number
Topic 1 email send news check price bug access color customer order make message service app camera
Topic 2 love app look date book lose guy family switch music recipe information quality feel change
Topic 3 fix way day money need buy star make lot start spend help rate like track
Topic 4 use pay want account user year fix note log error recommend problem app star option
Topic 5 feature thank hate learn photo text job search suck help tab tool weight weather group
Topic 6 work screen video need notification device wish thing option set store choose type food item
Topic 7 game play level fun player watch make enjoy start graphic thing win character score lose
Topic 8 time update try review crash know let problem page load waste want app need version
Topic 9 say card people time work tell download help datum issue happen support thing know want

In this step, we need to determine topics based on their key words. For topic 3, people mention “card”, “video”, and “spend”, so we conclude that this topic is about “Card Payment”. Next, add the 10 topics we inferred to the dataframe.

Topics = ["Update Version/Fix Crash Problem","Download/Internet Access","Learn and Share","Card Payment","Notification/Support", 
          "Account Problem", "Device/Design/Password", "Language/Recommend/Screen Size", "Graphic/ Game Design/ Level and Coin", "Photo/Search"]
df_topic_keywords["Topics"]=Topics
df_topic_keywords
Word 0 Word 1 Word 2 Word 3 Word 4 Word 5 Word 6 Word 7 Word 8 Word 9 Word 10 Word 11 Word 12 Word 13 Word 14 Topics
Topic 0 phone make add app think picture version month work minute thing look list home number Update Version/Fix Crash Problem
Topic 1 email send news check price bug access color customer order make message service app camera Download/Internet Access
Topic 2 love app look date book lose guy family switch music recipe information quality feel change Learn and Share
Topic 3 fix way day money need buy star make lot start spend help rate like track Card Payment
Topic 4 use pay want account user year fix note log error recommend problem app star option Notification/Support
Topic 5 feature thank hate learn photo text job search suck help tab tool weight weather group Account Problem
Topic 6 work screen video need notification device wish thing option set store choose type food item Device/Design/Password
Topic 7 game play level fun player watch make enjoy start graphic thing win character score lose Language/Recommend/Screen Size
Topic 8 time update try review crash know let problem page load waste want app need version Graphic/ Game Design/ Level and Coin
Topic 9 say card people time work tell download help datum issue happen support thing know want Photo/Search

Assuming that you have already built the topic model, you need to take the text through the same routine of transformations and before predicting the topic. For our case, the order of transformations is: sent_to_words() –> Stemming() –> vectorizer.transform() –> best_lda_model.transform() You need to apply these transformations in the same order. So to simplify it, let’s combine these steps into a predict_topic() function.

# Define function to predict topic for a given text document.
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
def predict_topic(text, nlp=nlp):
    global sent_to_words
    global lemmatization
# Step 1: Clean with simple_preprocess
    mytext_2 = list(sent_to_words(text))
# Step 2: Lemmatize
    mytext_3 = lemmatization(mytext_2, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
# Step 3: Vectorize transform
    mytext_4 = vectorizer.transform(mytext_3)
# Step 4: LDA Transform
    topic_probability_scores = best_lda_model.transform(mytext_4)
    topic = df_topic_keywords.iloc[np.argmax(topic_probability_scores), 1:14].values.tolist()
    
    # Step 5: Infer Topic
    infer_topic = df_topic_keywords.iloc[np.argmax(topic_probability_scores), -1]
    
    #topic_guess = df_topic_keywords.iloc[np.argmax(topic_probability_scores), Topics]
    return infer_topic, topic, topic_probability_scores
# Predict the topic
mytext = ["Very Useful in diabetes age 30. I need control sugar. thanks Good deal"]
infer_topic, topic, prob_scores = predict_topic(text = mytext)
print(topic)
print(infer_topic)
['way', 'day', 'money', 'need', 'buy', 'star', 'make', 'lot', 'start', 'spend', 'help', 'rate', 'like']
Card Payment

Final predictions of the reviews in the orignal dataset.

def apply_predict_topic(text):
    text = [text]
    infer_topic, topic, prob_scores = predict_topic(text = text)
    return(infer_topic)
df["Topic_key_word"]= df['Translated_Review'].apply(apply_predict_topic)
df.head()
App Translated_Review Sentiment Sentiment_Polarity Sentiment_Subjectivity Topic_key_word
0 10 Best Foods for You I like eat delicious food. That’s I’m cooking … Positive 1.00 0.533333 Card Payment
1 10 Best Foods for You This help eating healthy exercise regular basis Positive 0.25 0.288462 Card Payment
3 10 Best Foods for You Works great especially going grocery store Positive 0.40 0.875000 Device/Design/Password
4 10 Best Foods for You Best idea us Positive 1.00 0.300000 Notification/Support
5 10 Best Foods for You Best way Positive 1.00 0.300000 Card Payment

7. Conclusion

In this tutorial, we’ve used the google plays store reviews to generate topics using LDA. We examined two ways to import our data, using (1) GridDB and (2) pandas read_csv. For large datasets, GridDB provides an excellent alternative to import data in your notebook as it is open-source and highly scalable. Download GridDB today!

If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.