Topic Modeling with LDA Using Python and GridDB

In natural language processing, topic modeling assigns a topic to a given corpus based on the words in it. Due to the fact that text data is unlabeled, it is an unsupervised technique. It is increasingly important to categorize documents according to topics in this world filled with data. As an example, if a company receives hundreds of reviews, the company will need to know what categories of reviews are the most important and vice versa.

As keywords, topics can be used to describe a document, for example, when we think of a topic related to economics, we think of stock market, USD, inflation, GPD, etc. Topic models are models that can automatically detect topics based on words appearing in a document. The problem we will tackle here is topic modeling.

LDA - (Latent Dirichlet Allocation)

The word latent means hidden, something that has yet to be discovered. As indicated by Dirichlet, the Dirichlet distribution is assumed to govern the distribution of topics and word patterns in documents. “Allocation” here refers to the process of giving something, in this case, topics.

In this tutorial, we’ll use the reviews in the following dataset to generate topics from the reviews. In this way, we can know about what users are talking about, what they are focusing on, and perhaps where app developers should make progress at.

The outline of the tutorial is as follows:

Prerequisites and Environment setup
Dataset overview
Importing required libraries
Loading the dataset
Data Cleaning and Preprocessing
Building and Training a Machine Learning Model
Conclusion

1. Prerequisites and Environment setup

This tutorial is carried out in Anaconda Navigator (Python version – 3.8.3) on Windows Operating System. The following packages need to be installed before you continue with the tutorial –

Pandas
NumPy
Sklearn
nltk
re
griddb_python
spacy
gensim

You can install these packages in Conda’s virtual environment using conda install package-name. In case you are using Python directly via terminal/command prompt, pip install package-name will do the work.

GridDB installation

While loading the dataset, this tutorial will cover two methods – Using GridDB as well as Using Pandas. To access GridDB using Python, the following packages also need to be installed beforehand:

GridDB C-client
SWIG (Simplified Wrapper and Interface Generator)
GridDB Python Client

2. Dataset Overview

Google Play Store Apps Dataset : Web scraped data of 10,000 Play Store apps for analyzing the Android market.

It can be downloaded from here (https://www.kaggle.com/datasets/lava18/google-play-store-apps/version/5).

3. Importing Required Libraries

import griddb_python as griddb
import numpy as np
import pandas as pd
import re, nltk, spacy, gensim
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from pprint import pprint

4. Loading the Dataset

Let’s proceed and load the dataset into our notebook.

4.a Using GridDB

Toshiba GridDB™ is a highly scalable NoSQL database best suited for IoT and Big Data. The foundation of GridDB’s principles is based upon offering a versatile data store that is optimized for IoT, provides high scalability, tuned for high performance, and ensures high reliability.

To store large amounts of data, a CSV file can be cumbersome. GridDB serves as a perfect alternative as it in open-source and a highly scalable database. GridDB is a scalable, in-memory, No SQL database which makes it easier for you to store large amounts of data. If you are new to GridDB, a tutorial on reading and writing to GridDB can be useful.

Assuming that you have already set up your database, we will now write the SQL query in python to load our dataset.

sql_statement = ('SELECT * FROM googleplaystore_user_reviews')
dataset = pd.read_sql_query(sql_statement, cont)

Note that the cont variable has the container information where our data is stored. Replace the bbc-text with the name of your container. More info can be found in this tutorial reading and writing to GridDB.

When it comes to IoT and Big Data use cases, GridDB clearly stands out among other databases in the Relational and NoSQL space. Overall, GridDB offers multiple reliability features for mission-critical applications that require high availability and data retention.

4.b Using pandas read_csv

In Python you need to give access to a file by opening it. You can do it by using the open() function. Open returns a file object, which has methods and attributes for getting information about and manipulating the opened file. Both of the above methods will lead to the same output as the data is loaded in the form of a pandas dataframe using either of the methods.

df = pd.read_csv("googleplaystore_user_reviews.csv")
df = df.dropna(subset=["Translated_Review"])

df.head()

	App	Translated_Review	Sentiment	Sentiment_Polarity	Sentiment_Subjectivity
0	10 Best Foods for You	I like eat delicious food. That’s I’m cooking …	Positive	1.00	0.533333
1	10 Best Foods for You	This help eating healthy exercise regular basis	Positive	0.25	0.288462
3	10 Best Foods for You	Works great especially going grocery store	Positive	0.40	0.875000
4	10 Best Foods for You	Best idea us	Positive	1.00	0.300000
5	10 Best Foods for You	Best way	Positive	1.00	0.300000

Once the dataset is loaded, let us now explore the dataset. We’ll print the first 5 rows of this dataset using head() function.

5. Data Cleaning and Preprocessing

Cleaning the data by removing emails, new line character and quotes

# Convert to list
data = df.Translated_Review.values.tolist()
# Remove Emails
data = [re.sub(r'\S*@\S*\s?', '', sent) for sent in data]
# Remove new line characters
data = [re.sub(r'\s+', ' ', sent) for sent in data]
# Remove distracting single quotes
data = [re.sub(r"\'", "", sent) for sent in data]
pprint(data[:1])

['I like eat delicious food. Thats Im cooking food myself, case "10 Best '
 'Foods" helps lot, also "Best Before (Shelf Life)"']

We now need to tokenize each sentence into a list of words, eliminating all punctuation and unnecessary characters. Stemming refers to reducing a word to its word stem that attaches to prefixes and suffixes, or to the roots of words known as lemmas. The advantage of this is, we get to reduce the total number of unique words in the dictionary.

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations
        
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']): #'NOUN', 'ADJ', 'VERB', 'ADV'
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
    return texts_out

data_words = list(sent_to_words(data))
print(data_words[:1])

nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
data_lemmatized = lemmatization(data_words, allowed_postags=["NOUN", "VERB"]) #select noun and verb
print(data_lemmatized[:2])

[['like', 'eat', 'delicious', 'food', 'thats', 'im', 'cooking', 'food', 'myself', 'case', 'best', 'foods', 'helps', 'lot', 'also', 'best', 'before', 'shelf', 'life']]
['eat food s m cook food case food help lot shelf life', 'help eat exercise basis']

As input, the LDA topic model algorithm requires a document word matrix. This is done using CountVectorizer.

vectorizer = CountVectorizer(analyzer='word',       
                             min_df=10,
                             stop_words='english',             
                             lowercase=True,                   
                             token_pattern='[a-zA-Z0-9]{3,}') 
data_vectorized = vectorizer.fit_transform(data_lemmatized)

6. Machine Learning Model Building

We have everything we need to build a Latent Dirichlet Allocation (LDA) model. In order to construct the LDA model, let’s initialize one and then call fit_transform().

Based on my prior knowledge about the dataset, I have set n_topics to 20 in this example. This number will be adjusted using grid search later on.

# Build LDA Model
lda_model = LatentDirichletAllocation(n_components=20,max_iter=10,learning_method='online',random_state=100,batch_size=128,evaluate_every = -1,n_jobs = -1,               )
lda_output = lda_model.fit_transform(data_vectorized)
print(lda_model)  # Model attributes

LatentDirichletAllocation(learning_method='online', n_components=20, n_jobs=-1,
                          random_state=100)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
 evaluate_every=-1, learning_decay=0.7,
 learning_method="online", learning_offset=10.0,
 max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
 n_components=10, n_jobs=-1, perp_tol=0.1,
 random_state=100, topic_word_prior=None,
 total_samples=1000000.0, verbose=0)

LatentDirichletAllocation(learning_method='online', n_jobs=-1, random_state=100)

Diagnose model performance with perplexity and log-likelihood

High log-likelihood and low perplexity (exp(-1. * log-likelihood per word)) are considered good models.

# Log Likelyhood: Higher the better
print("Log Likelihood: ", lda_model.score(data_vectorized))
# Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
print("Perplexity: ", lda_model.perplexity(data_vectorized))
# See model parameters
pprint(lda_model.get_params())

Log Likelihood:  -2127623.32986425
Perplexity:  1065.3272644698702
{'batch_size': 128,
 'doc_topic_prior': None,
 'evaluate_every': -1,
 'learning_decay': 0.7,
 'learning_method': 'online',
 'learning_offset': 10.0,
 'max_doc_update_iter': 100,
 'max_iter': 10,
 'mean_change_tol': 0.001,
 'n_components': 20,
 'n_jobs': -1,
 'perp_tol': 0.1,
 'random_state': 100,
 'topic_word_prior': None,
 'total_samples': 1000000.0,
 'verbose': 0}

Use GridSearch to determine the best LDA model.

N_components (number of topics) is the most important tuning parameter for LDA models. Additionally, I will search learning_decay (which controls the learning rate) as well. In addition to these, learning_offset (downweigh early iterations. Should be > 1) and max_iter can also be considered as search parameters. This process can consume a lot of time and resources.

# Define Search Param
search_params = {'n_components': [10, 20], 'learning_decay': [0.5, 0.9]}
# Init the Model
lda = LatentDirichletAllocation(max_iter=5, learning_method='online', learning_offset=50.,random_state=0)
# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params)
# Do the Grid Search
model.fit(data_vectorized)
GridSearchCV(cv=None, error_score='raise',
       estimator=LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=10, n_jobs=1, perp_tol=0.1, random_state=None,
             topic_word_prior=None, total_samples=1000000.0, verbose=0),
        n_jobs=1,
       param_grid={'n_components': [10, 20], 'learning_decay': [0.5, 0.9]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

GridSearchCV(error_score='raise',
             estimator=LatentDirichletAllocation(learning_method=None,
                                                 n_jobs=1),
             n_jobs=1,
             param_grid={'learning_decay': [0.5, 0.9],
                         'n_components': [10, 20]},
             return_train_score='warn')

# Best Model
best_lda_model = model.best_estimator_
# Model Parameters
print("Best Model's Params: ", model.best_params_)
# Log Likelihood Score
print("Best Log Likelihood Score: ", model.best_score_)
# Perplexity
print("Model Perplexity: ", best_lda_model.perplexity(data_vectorized))

Best Model's Params:  {'learning_decay': 0.9, 'n_components': 10}
Best Log Likelihood Score:  -432616.36669435585
Model Perplexity:  764.0439579711182

A logical way to determine whether a document belongs to a particular topic is to see which topic contributed the most to it and then assign it to that topic. Below table highlighted all major topics and assigned the most dominant topic its own column.

# Create Document — Topic Matrix
lda_output = best_lda_model.transform(data_vectorized)

topicnames = ["Topic" + str(i) for i in range(best_lda_model.n_components)]
docnames = ["Doc" + str(i) for i in range(len(data))]

# Make the pandas dataframe
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=docnames)
# Get dominant topic for each document
dominant_topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic["dominant_topic"] = dominant_topic
# Styling
def color_green(val):
 color = "green" if val > .1 else "black"
 return "color: {col}".format(col=color)
def make_bold(val):
 weight = 700 if val > .1 else 400
 return "font-weight: {weight}".format(weight=weight)
# Apply Style
df_document_topics = df_document_topic.head(15).style.applymap(color_green).applymap(make_bold)
df_document_topics

	Topic0	Topic1	Topic2	Topic3	Topic4	Topic5	Topic6	Topic7	Topic8	Topic9	dominant_topic
Doc0	0.010000	0.010000	0.010000	0.760000	0.010000	0.010000	0.010000	0.010000	0.010000	0.160000	3
Doc1	0.020000	0.020000	0.020000	0.820000	0.020000	0.020000	0.020000	0.020000	0.020000	0.020000	3
Doc2	0.030000	0.030000	0.030000	0.030000	0.030000	0.030000	0.770000	0.030000	0.030000	0.030000	6
Doc3	0.550000	0.050000	0.050000	0.050000	0.050000	0.050000	0.050000	0.050000	0.050000	0.050000	0
Doc4	0.050000	0.050000	0.050000	0.550000	0.050000	0.050000	0.050000	0.050000	0.050000	0.050000	3
Doc5	0.100000	0.100000	0.100000	0.100000	0.100000	0.100000	0.100000	0.100000	0.100000	0.100000	0
Doc6	0.030000	0.030000	0.700000	0.030000	0.030000	0.030000	0.030000	0.030000	0.030000	0.030000	2
Doc7	0.030000	0.030000	0.030000	0.030000	0.030000	0.030000	0.250000	0.030000	0.030000	0.550000	9
Doc8	0.100000	0.100000	0.100000	0.100000	0.100000	0.100000	0.100000	0.100000	0.100000	0.100000	0
Doc9	0.010000	0.010000	0.010000	0.010000	0.790000	0.120000	0.010000	0.010000	0.010000	0.010000	4
Doc10	0.850000	0.020000	0.020000	0.020000	0.020000	0.020000	0.020000	0.020000	0.020000	0.020000	0
Doc11	0.020000	0.020000	0.220000	0.620000	0.020000	0.020000	0.020000	0.020000	0.020000	0.020000	3
Doc12	0.030000	0.030000	0.030000	0.520000	0.030000	0.270000	0.030000	0.030000	0.030000	0.030000	3
Doc13	0.020000	0.020000	0.020000	0.380000	0.020000	0.020000	0.020000	0.020000	0.020000	0.460000	9
Doc14	0.850000	0.020000	0.020000	0.020000	0.020000	0.020000	0.020000	0.020000	0.020000	0.020000	0

# Topic-Keyword Matrix
df_topic_keywords = pd.DataFrame(best_lda_model.components_)
# Assign Column and Index
df_topic_keywords.columns = vectorizer.get_feature_names_out()
df_topic_keywords.index = topicnames
# View
df_topic_keywords.head()

	aap	abandon	ability	abuse	accept	access	accessory	accident	accommodation	accomplish	…	yardage	yay	year	yesterday	yoga	youtube	zip	zombie	zone	zoom
Topic0	0.102649	0.102871	56.001281	0.103583	0.107420	0.132561	12.712732	0.102863	0.102585	0.102685	…	8.642076	0.102612	153.232551	0.102522	0.496217	0.106992	0.211912	0.140018	0.177780	0.104975
Topic1	0.101828	0.102233	1.148602	0.102127	0.103543	558.310169	0.102997	2.594090	0.102651	0.110221	…	0.525860	0.102106	6.075186	20.135445	0.102284	0.106246	0.103076	0.108334	0.122234	0.102741
Topic2	0.103196	0.107593	0.107848	0.104019	0.103053	0.126004	0.106085	0.117876	9.979474	0.108507	…	0.366334	0.102367	5.066123	0.103931	31.039314	0.107878	0.102303	0.102200	0.128228	0.104907
Topic3	0.102564	0.107112	2.022397	12.968156	0.102692	0.130003	0.113959	1.838441	0.101579	8.345948	…	0.105286	0.103549	7.478397	0.104231	24.234774	0.118099	0.123212	0.128494	29.086953	0.103109
Topic4	0.102634	0.102345	76.332226	0.102486	41.139452	0.118419	0.115930	0.142032	0.103316	0.104292	…	0.409518	0.102979	737.692499	0.600751	0.116092	0.102262	0.108881	0.102011	0.115584	0.513135

5 rows × 2273 columns

# Show top n keywords for each topic
def show_topics(vectorizer=vectorizer, lda_model=lda_model, n_words=20):
    keywords = np.array(vectorizer.get_feature_names_out())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords
topic_keywords = show_topics(vectorizer=vectorizer, lda_model=best_lda_model, n_words=15)
# Topic - Keywords Dataframe
df_topic_keywords = pd.DataFrame(topic_keywords)
df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
df_topic_keywords

	Word 0	Word 1	Word 2	Word 3	Word 4	Word 5	Word 6	Word 7	Word 8	Word 9	Word 10	Word 11	Word 12	Word 13	Word 14
Topic 0	phone	make	add	app	think	picture	version	month	work	minute	thing	look	list	home	number
Topic 1	email	send	news	check	price	bug	access	color	customer	order	make	message	service	app	camera
Topic 2	love	app	look	date	book	lose	guy	family	switch	music	recipe	information	quality	feel	change
Topic 3	fix	way	day	money	need	buy	star	make	lot	start	spend	help	rate	like	track
Topic 4	use	pay	want	account	user	year	fix	note	log	error	recommend	problem	app	star	option
Topic 5	feature	thank	hate	learn	photo	text	job	search	suck	help	tab	tool	weight	weather	group
Topic 6	work	screen	video	need	notification	device	wish	thing	option	set	store	choose	type	food	item
Topic 7	game	play	level	fun	player	watch	make	enjoy	start	graphic	thing	win	character	score	lose
Topic 8	time	update	try	review	crash	know	let	problem	page	load	waste	want	app	need	version
Topic 9	say	card	people	time	work	tell	download	help	datum	issue	happen	support	thing	know	want

In this step, we need to determine topics based on their key words. For topic 3, people mention “card”, “video”, and “spend”, so we conclude that this topic is about “Card Payment”. Next, add the 10 topics we inferred to the dataframe.

Topics = ["Update Version/Fix Crash Problem","Download/Internet Access","Learn and Share","Card Payment","Notification/Support", 
          "Account Problem", "Device/Design/Password", "Language/Recommend/Screen Size", "Graphic/ Game Design/ Level and Coin", "Photo/Search"]
df_topic_keywords["Topics"]=Topics
df_topic_keywords

	Word 0	Word 1	Word 2	Word 3	Word 4	Word 5	Word 6	Word 7	Word 8	Word 9	Word 10	Word 11	Word 12	Word 13	Word 14	Topics
Topic 0	phone	make	add	app	think	picture	version	month	work	minute	thing	look	list	home	number	Update Version/Fix Crash Problem
Topic 1	email	send	news	check	price	bug	access	color	customer	order	make	message	service	app	camera	Download/Internet Access
Topic 2	love	app	look	date	book	lose	guy	family	switch	music	recipe	information	quality	feel	change	Learn and Share
Topic 3	fix	way	day	money	need	buy	star	make	lot	start	spend	help	rate	like	track	Card Payment
Topic 4	use	pay	want	account	user	year	fix	note	log	error	recommend	problem	app	star	option	Notification/Support
Topic 5	feature	thank	hate	learn	photo	text	job	search	suck	help	tab	tool	weight	weather	group	Account Problem
Topic 6	work	screen	video	need	notification	device	wish	thing	option	set	store	choose	type	food	item	Device/Design/Password
Topic 7	game	play	level	fun	player	watch	make	enjoy	start	graphic	thing	win	character	score	lose	Language/Recommend/Screen Size
Topic 8	time	update	try	review	crash	know	let	problem	page	load	waste	want	app	need	version	Graphic/ Game Design/ Level and Coin
Topic 9	say	card	people	time	work	tell	download	help	datum	issue	happen	support	thing	know	want	Photo/Search

Assuming that you have already built the topic model, you need to take the text through the same routine of transformations and before predicting the topic. For our case, the order of transformations is: sent_to_words() –> Stemming() –> vectorizer.transform() –> best_lda_model.transform() You need to apply these transformations in the same order. So to simplify it, let’s combine these steps into a predict_topic() function.

# Define function to predict topic for a given text document.
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
def predict_topic(text, nlp=nlp):
    global sent_to_words
    global lemmatization
# Step 1: Clean with simple_preprocess
    mytext_2 = list(sent_to_words(text))
# Step 2: Lemmatize
    mytext_3 = lemmatization(mytext_2, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
# Step 3: Vectorize transform
    mytext_4 = vectorizer.transform(mytext_3)
# Step 4: LDA Transform
    topic_probability_scores = best_lda_model.transform(mytext_4)
    topic = df_topic_keywords.iloc[np.argmax(topic_probability_scores), 1:14].values.tolist()
    
    # Step 5: Infer Topic
    infer_topic = df_topic_keywords.iloc[np.argmax(topic_probability_scores), -1]
    
    #topic_guess = df_topic_keywords.iloc[np.argmax(topic_probability_scores), Topics]
    return infer_topic, topic, topic_probability_scores
# Predict the topic
mytext = ["Very Useful in diabetes age 30. I need control sugar. thanks Good deal"]
infer_topic, topic, prob_scores = predict_topic(text = mytext)
print(topic)
print(infer_topic)

['way', 'day', 'money', 'need', 'buy', 'star', 'make', 'lot', 'start', 'spend', 'help', 'rate', 'like']
Card Payment

Final predictions of the reviews in the orignal dataset.

def apply_predict_topic(text):
    text = [text]
    infer_topic, topic, prob_scores = predict_topic(text = text)
    return(infer_topic)
df["Topic_key_word"]= df['Translated_Review'].apply(apply_predict_topic)
df.head()

	App	Translated_Review	Sentiment	Sentiment_Polarity	Sentiment_Subjectivity	Topic_key_word
0	10 Best Foods for You	I like eat delicious food. That’s I’m cooking …	Positive	1.00	0.533333	Card Payment
1	10 Best Foods for You	This help eating healthy exercise regular basis	Positive	0.25	0.288462	Card Payment
3	10 Best Foods for You	Works great especially going grocery store	Positive	0.40	0.875000	Device/Design/Password
4	10 Best Foods for You	Best idea us	Positive	1.00	0.300000	Notification/Support
5	10 Best Foods for You	Best way	Positive	1.00	0.300000	Card Payment

7. Conclusion

In this tutorial, we’ve used the google plays store reviews to generate topics using LDA. We examined two ways to import our data, using (1) GridDB and (2) pandas read_csv. For large datasets, GridDB provides an excellent alternative to import data in your notebook as it is open-source and highly scalable. Download GridDB today!

If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.