Overview¶
In this analysis, we use GridDB to analyze ecommerce data for a UK-based online retail store that sells all-occasion gifts. The primary customer base of the store includes wholesalers. More details on the dataset has been provided in the section 'About the Dataset'.
We start by downloading the dataset into a dataframe in Python. We load the data from the dataframe on to GridDB's container. We then use GridDB's query and processing engine to run complex queries to derive valuable insights and drive actionable outcomes on the stored data. GridDB's capabilities in high-speed data processing and ACID transactions empower users to handle large-scale data workloads efficiently and effectively. Additionally, it's support for in-memory and disk-based storage, distributed architecture, and column-oriented storage model ensures optimal performance, scalability, and reliability for diverse use cases across industries. Let's load the basic libraries and then proceed to learn more about the dataset and also learn about some of GridDB's SQL features and functions. Resources have also been provided throughout the article to learn more about GridDB. In this analysis, we also perform Market Basket Analysis to identify the items frequently bought together. Let's delve in further into the analysis and proceed to uncover interesting insights.
Importing Libraries¶
#Libraries for basic data manipulation
import os
import pandas as pd
import numpy as np
#Libraries for the webAPI requests
import http
http.client.HTTPConnection.debuglevel = 1
import json
import requests
#libraries for the JayDeBeAPI
import jaydebeapi
import urllib.parse
#library to calculate time taken and to change date formats
import time
from datetime import datetime
#Library for loading files into dataframes
import dask.dataframe as dd
#Libraries for graphs
from tabulate import tabulate
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.figure_factory as ff
from IPython.display import Image, Markdown,HTML
#libraries for Market Basket Analysis
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
path = os.chdir("./") # Put in your working directory
About the Dataset (with Instructions to Download)¶
The dataset is obtained from https://www.kaggle.com/datasets/carrie1/ecommerce-data/data. To download the dataset, click the downward arrow in the 'Data Card' tab. It is as shown below. Note that you will need to sign in to Kaggle to download the dataset. You can use your existing Google/Yahoo/Facebook accounts to sign in. You also have the option of registering for a free account with Kaggle using your email address. The Dataset is transactional in nature and contains all transactions occurring between 01/12/2010 and 09/12/2011 in a UK-based retail store. The store primarily has wholesalers as its customer base.
##Specify the path to your image file
current_directory = os.getcwd()
# Construct the image path using os.path.join()
image_path = os.path.join(current_directory, 'Ecommerce_Dataset.png')
width = 700
display(Markdown("## Downloading the E-Commerce Dataset"))
## Display the image
display(Image(filename=image_path, width=width))
Downloading the E-Commerce Dataset¶
The dataset gets downloaded as a compressed zip folder (.zip). This can be uncompressed using any archival tool such as 7Zip or Winzip. Save the location where the file has been unarchived to. You will utilize this location in the code below.
The dataset has the following columns-
- InvoiceNo - uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation
- StockCode - uniquely assigned to each distinct product
- Description - details of the product
- Quantity - quantities of each product (item) per transaction
- InvoiceDate - the day and time when each transaction was generated
- UnitPrice - product price per unit in terms of sterling
- CustomerID - uniquely assigned to each customer
- Country - country where each customer resides
Loading the Dataset into Python¶
We download the dataset and load it into a dataframe in Python. The resulting dataframe is 'ecommerce_data'. Since the data file is large, we use the 'dask' library to load the datafile. We then convert the dask dataframe to a Pandas dataframe (pandas_ecommerce_df).
#I unarchived the downloaded data file (.zip) to a folder with name 'archive'.
#Please modify the OS path below to specify the location where you have unarchived the data file.
dataset_dir = os.path.join(os.getcwd(), 'archive','data.csv')
# Specify the columns to parse as datetime
parse_dates = ['InvoiceDate']
ecommerce_data = dd.read_csv(dataset_dir, encoding='latin-1',assume_missing=True, dtype={'InvoiceNo':'object','StockCode':'object','Description':'object','Quantity':'int32','UnitPrice':'float',
'CustomerID':'object','Country':'object'})
pandas_ecommerce_df = ecommerce_data.compute() #converting a dask dataframe to a pandas dataframe
pandas_ecommerce_df['InvoiceDate'] = pd.to_datetime(pandas_ecommerce_df['InvoiceDate'], format='mixed')
Data Cleaning & Conversion¶
GridDB supports timestamp values of the format "%Y-%m-%dT%H:%M:%S.000Z". We use strptime and strftime functions in Python to convert the 'InvoiceDate' column to the format suitable for ingestion into GridDB.
# Convert Pandas Timestamp objects to strings
pandas_ecommerce_df['InvoiceDate'] = pandas_ecommerce_df['InvoiceDate'].astype(str)
# Parse the original timestamp string into a datetime object
pandas_ecommerce_df['InvoiceDate'] = pandas_ecommerce_df['InvoiceDate'].apply(
lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S")
)
# Format the datetime object into the desired string format
pandas_ecommerce_df['formatted_timestamp'] = pandas_ecommerce_df['InvoiceDate'].apply(
lambda x: x.strftime("%Y-%m-%dT%H:%M:%S.000Z")
)
We then delete the original column which had a different timestamp format and rename the new column 'formatted_timestamp' to 'InvoiceDate'.
# Delete the original column 'original_datetime' and 'InvoiceDate'
del pandas_ecommerce_df['InvoiceDate']
# Rename the column 'formatted_timestamp' to 'original_datetime'
pandas_ecommerce_df.rename(columns={'formatted_timestamp': 'InvoiceDate'}, inplace=True)
Here, we treat missing values as NaNs cannot be directly inserted into the GridDB database. These are converted to Nulls so that they can be ingested into GridDB. This is consistent with the behavior of many relational database systems.
#Finding Nulls in a dataframe
pandas_ecommerce_df.head() # This will return a Dask DataFrame
null_values = pandas_ecommerce_df.isnull()
null_values_summary = null_values.any(axis=0) # columns
columns_with_null_values = null_values_summary[null_values_summary == True].index
print("Columns with null values:")
print(columns_with_null_values)
#Treating Nulls (Nans)
#NaNs cannot be inserted into the database. They should be converted as Nulls.
pandas_ecommerce_df = pandas_ecommerce_df.where(pandas_ecommerce_df.notna(), 'Null')
Columns with null values: Index(['Description', 'CustomerID'], dtype='object')
Subsetting the Dataset¶
A subset of the dataframe is created by considering only specific countries of interest.
# Assuming pandas_ecommerce_df is your DataFrame containing the data
countries_of_interest = ['USA', 'Japan', 'Germany', 'Canada', 'Australia', 'Hong Kong', 'France']
# Subsetting the DataFrame based on the countries of interest
final_ecommerce_df = pandas_ecommerce_df[pandas_ecommerce_df['Country'].isin(countries_of_interest)]
Connecting to GridDB using the JayDeBeAPI¶
In this case, we use the JayDeBeAPI to connect to the GridDB cluster of interest. Note that the SSLMode here is set to 'preferred'.
notification_provider = "https://dbaasshareextconsta.blob.core.windows.net/dbaas-share-extcon-blob/trial2002.json?sv=2015-04-05&sr=b&st=2024-01-16T00%3A00%3A00.0000000Z&se=2074-01-reeww16T00%3A00%3A00.0000000Z&sp=r&sigwefw=39wefweU8P%2B%2BqzKAcUQNRw6K54dZHBhyojfiFmgHTmcVOHlY%3D"
np_encode = urllib.parse.quote(notification_provider)
cluster_name = "XXXX" ## Specify the cluster name here
cn_encode = urllib.parse.quote(cluster_name)
database_name = "XXXX" ## Specify the database name here
dbn_encode = urllib.parse.quote(database_name)
sslMode = "&sslMode=PREFERRED" #sslMode should be PREFERRED and connectionRoute should be PUBLIC
sm_encode = urllib.parse.quote(sslMode)
username = 'XXXX'
password = 'XXXX'
#Construct the JDBC URL to be used to connect to the GridDB database. The format is -
# jdbc:gs:///clusterName/databaseName?notificationProvider=theNotificationProviderURL&sslMode=PREFERRED
url = "jdbc:gs:///" + cn_encode + "/" + dbn_encode + "?notificationProvider=" + np_encode + sm_encode
#print("JDBC URL is " + url)
conn = jaydebeapi.connect("com.toshiba.mwcloud.gs.sql.Driver",
url,
{'user': username, 'password': password,
'connectionRoute':'PUBLIC',
'loginTimeout': '20'}
, "gridstore-jdbc-5.2.0.jar") #ensure to provide the correct location of the gridstore-jdbc JAR library
print('success!')
Creating the Container in GridDB to load data¶
The advantage of using the JayDeBeAPI to ingest data into GridDB is that we can use relational SQL syntaxes even though we are working with a NoSQL database. In this case, we create a container in GridDB using the 'CREATE TABLE' statement. This will create a 'Collection' container. To create a timeseries container, you can simply specify the timestamp column as a primary key. Refer to this resource to learn more about a timeseries container creation using JayDeBeAPI.
sql_query1 = (f"""
CREATE TABLE Ecommerce_Data
(
InvoiceNo STRING,
StockCode STRING,
Description STRING,
Quantity INTEGER,
InvoiceDate TIMESTAMP,
UnitPrice FLOAT,
CustomerID STRING,
Country STRING
)
""")
with conn.cursor() as cursor:
cursor.execute(sql_query1)
Using the JayDeBeAPI to perform Row Registration¶
With the JayDeBeAPI, a simple Insert SQL statement can be used to ingest data into a GridDB container within a cluster. This process is also known as Row Registration. If we were to use the WebAPI, we would have to create the data as a JSON object and pass it as a 'http request'.
#Note the use of the GridDB TIMESTAMP function in the insert SQL below
#The TIMESTAMP function converts the specified value into a timestamp
sql_query = "INSERT INTO Ecommerce_Data (InvoiceNo, StockCode, Description, Quantity, UnitPrice, CustomerID, Country, InvoiceDate) VALUES (?,?,?,?,?,?,?,TIMESTAMP(?))"
# Use a loop to insert each row of data and record the time taken
cursor = conn.cursor()
try:
start_time = time.time() # Record the start time
for index, row in final_ecommerce_df.iterrows():
# Execute the SQL query with parameters
#print(tuple(row))
try:
cursor.execute(sql_query, (
row['InvoiceNo'],
row['StockCode'],
row['Description'],
row['Quantity'],
row['UnitPrice'],
row['CustomerID'],
row['Country'],
row['InvoiceDate']
))
except Exception as e1:
print(row)
print(e1)
break
#break
end_time = time.time() # Record the end time
execution_time = end_time - start_time
print(f"Time taken for insertion: {execution_time:.6f} seconds")
conn.commit() # Commit the changes
except Exception as e:
# Handle any exceptions that may occur during execution
print("Error:", e)
conn.rollback() # Rollback changes if an error occurs
finally:
cursor.close() # Close the cursor
Time taken for insertion: 8083.310104 seconds
Analysis¶
What are the Top 2 items sold by Country every Friday?¶
To build this query, we use Window functions in GridDB. In this case, we specifically use the ROW_NUMBER() OVER PARTITION BY function. Refer to this resource to learn more. This is similar to the Rank Over Partition function available in most database platforms.
#day of week - Sunday-0, Saturday-6, Friday-5
sql_select_query1 = ''' SELECT
COUNTRY,
DESCRIPTION,
Qty
FROM
(
SELECT Day_of_Week, COUNTRY, DESCRIPTION, Qty, ROW_NUMBER()
OVER(PARTITION BY Day_of_Week, COUNTRY ORDER BY Qty Desc) As Rn
FROM
(
SELECT EXTRACT(DAY_OF_WEEK,InvoiceDate) AS Day_of_Week,
COUNTRY,
DESCRIPTION,
SUM(Quantity) as Qty
FROM Ecommerce_Data
WHERE STRFTIME('%w', InvoiceDate) ='5'
GROUP BY 1,2,3
ORDER BY 1,2,4 desc, 3
) query_1
) query_2
WHERE Rn <= 2
ORDER BY 1
'''
cursor = conn.cursor()
cursor.execute(sql_select_query1)
column_names = [desc[0] for desc in cursor.description]
results = cursor.fetchall()
Top_Selling_weekends_df = pd.DataFrame(results, columns=column_names)
cursor.close()
#BOLD = '\033[1m'
#RESET = '\033[0m'
#print(BOLD + "Top 2 Items sold by Country every Friday" + RESET)
display(HTML("<h3 style='text-align: left;'>Top 2 Items sold by Country every Friday</h3>"))
sorted_df = Top_Selling_weekends_df.sort_values(by='Qty', ascending=False)
sorted_df
Top 2 Items sold by Country every Friday
Country | Description | Qty | |
---|---|---|---|
3 | France | RABBIT NIGHT LIGHT | 1260 |
4 | France | MINI PAINT SET VINTAGE | 504 |
5 | Germany | PACK OF 12 RED APPLE TISSUES | 432 |
0 | Australia | RABBIT NIGHT LIGHT | 336 |
6 | Germany | SPACEBOY LUNCH BOX | 318 |
1 | Australia | RED TOADSTOOL LED NIGHT LIGHT | 288 |
9 | Japan | JAZZ HEARTS PURSE NOTEBOOK | 96 |
10 | Japan | PACK OF 20 SPACEBOY NAPKINS | 96 |
7 | Hong Kong | ETCHED GLASS COASTER | 48 |
8 | Hong Kong | POLKADOT PEN | 48 |
11 | USA | SET/10 BLUE POLKADOT PARTY CANDLES | 48 |
12 | USA | SET/10 IVORY POLKADOT PARTY CANDLES | 48 |
2 | Canada | POSTAGE | 1 |
Insight(s):
- While the majority of items in France and Australia sold night lights; party candles get sold a lot in the USA on Fridays.
- Similarly, the sale of Napkins and Tissues are maximum in Japan and Germany respectively.
- Stationary gets sold in Hong Kong and Japan the most on Fridays.
Given an Invoice Number, what are the items frequently sold together?¶
Here, we use the STRFTIME() function in GridDB. The STRFTIME function, short for 'String From Time' or 'String Format Time' takes a timestamp and formats it into a string as per the format provided. Refer to this GridDB resource to learn more. We also use subqueries here. GridDB also has aggregation functions and Character functions which are used here.
Note that here we exclude the item 'Postage' because this item skews the results and analysis.
sql_select_query2 = '''SELECT
InvoiceNo,
Description,
TotalQuantity
FROM
(
SELECT
InvoiceNo,
Description,
SUM(Quantity) AS TotalQuantity,
ROW_NUMBER() OVER(PARTITION BY InvoiceNo ORDER BY 3 DESC) AS ItemRank
FROM Ecommerce_Data
WHERE STRFTIME('%Y-%m', InvoiceDate) IN ('2010-12','2011-12')
AND LOWER(Description) NOT LIKE '%postage%'
GROUP BY InvoiceNo, Description
) query_1;
''' #WHERE ItemRank <= 5
cursor = conn.cursor()
cursor.execute(sql_select_query2)
column_names = [desc[0] for desc in cursor.description]
results = cursor.fetchall()
Invoice_Description_df = pd.DataFrame(results, columns=column_names)
cursor.close()
#subset_df = Invoice_Description_df.iloc[:20]
df2 = pd.crosstab(Invoice_Description_df['InvoiceNo'], Invoice_Description_df['Description'])
Now that we have the data required for the Market Basket analysis, let's move on to perform the analysis.
Market-Basket Analysis¶
We use the data gathered from GridDB to perform a Market-Basket Analysis. We identify the items frequently sold together using this approach. In this case, we use a sufficiently large lift value of 5 and above. We do this because we have a large amount of transactions and are only interested in rules that exhibit a strong association between the items in the transactions. A lift value of 5 or higher indicates that the likelihood of items being purchased together is at least five times greater than if their purchase was random and independent.
# Perform one-hot encoding
encoded_df = pd.get_dummies(df2)
# Find frequent itemsets using Apriori
frequent_itemsets = apriori(encoded_df, min_support=0.05, use_colnames=True)
# Generate association rules
# We consider a sufficiently big threshold for lift.
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=5)
# Display the association rules
#rules
C:\Users\mg_su\anaconda3\lib\site-packages\mlxtend\frequent_patterns\fpcommon.py:109: DeprecationWarning: DataFrames with non-bool types result in worse computationalperformance and their support might be discontinued in the future.Please use a DataFrame with bool type
sorted_rules = rules.sort_values(["lift","confidence","support"],axis = 0, ascending = False)
# Replace brackets in all columns from frozen set object
sorted_rules["antecedents"] = sorted_rules["antecedents"].apply(lambda x: ', '.join(list(x))).astype("unicode")
sorted_rules["consequents"] = sorted_rules["consequents"].apply(lambda x: ', '.join(list(x))).astype("unicode")
subset = sorted_rules[['antecedents', 'consequents']]
pd.set_option('display.max_colwidth', None)
# Print the title
display(HTML("<h3 style='text-align: center;'>Top 100 Association Rules</h3>"))
#display(subset.head(100))
display(HTML(subset.head(100).to_html(index=False))) #hide index
Top 100 Association Rules
antecedents | consequents |
---|---|
ALARM CLOCK BAKELIKE RED , ALARM CLOCK BAKELIKE PINK | ALARM CLOCK BAKELIKE GREEN |
ALARM CLOCK BAKELIKE GREEN | ALARM CLOCK BAKELIKE RED , ALARM CLOCK BAKELIKE PINK |
ALARM CLOCK BAKELIKE GREEN | ALARM CLOCK BAKELIKE RED |
ALARM CLOCK BAKELIKE RED | ALARM CLOCK BAKELIKE GREEN |
LUNCH BOX WITH CUTLERY RETROSPOT | STRAWBERRY LUNCH BOX WITH CUTLERY |
ALARM CLOCK BAKELIKE GREEN, ALARM CLOCK BAKELIKE PINK | ALARM CLOCK BAKELIKE RED |
STRAWBERRY LUNCH BOX WITH CUTLERY | LUNCH BOX WITH CUTLERY RETROSPOT |
ALARM CLOCK BAKELIKE RED | ALARM CLOCK BAKELIKE GREEN, ALARM CLOCK BAKELIKE PINK |
JAM JAR WITH PINK LID | JAM MAKING SET PRINTED |
PLASTERS IN TIN CIRCUS PARADE , RED TOADSTOOL LED NIGHT LIGHT | PLASTERS IN TIN WOODLAND ANIMALS |
JAM MAKING SET PRINTED | JAM JAR WITH PINK LID |
PLASTERS IN TIN WOODLAND ANIMALS | PLASTERS IN TIN CIRCUS PARADE , RED TOADSTOOL LED NIGHT LIGHT |
ALARM CLOCK BAKELIKE GREEN | ALARM CLOCK BAKELIKE PINK |
ALARM CLOCK BAKELIKE PINK | ALARM CLOCK BAKELIKE GREEN |
ALARM CLOCK BAKELIKE PINK | ALARM CLOCK BAKELIKE GREEN, ALARM CLOCK BAKELIKE RED |
ALARM CLOCK BAKELIKE GREEN, ALARM CLOCK BAKELIKE RED | ALARM CLOCK BAKELIKE PINK |
ALARM CLOCK BAKELIKE RED | ALARM CLOCK BAKELIKE PINK |
ALARM CLOCK BAKELIKE PINK | ALARM CLOCK BAKELIKE RED |
PLASTERS IN TIN CIRCUS PARADE | RED TOADSTOOL LED NIGHT LIGHT, PLASTERS IN TIN WOODLAND ANIMALS |
RED TOADSTOOL LED NIGHT LIGHT, PLASTERS IN TIN WOODLAND ANIMALS | PLASTERS IN TIN CIRCUS PARADE |
PLASTERS IN TIN CIRCUS PARADE | PLASTERS IN TIN STRONGMAN |
PLASTERS IN TIN STRONGMAN | PLASTERS IN TIN CIRCUS PARADE |
PLASTERS IN TIN CIRCUS PARADE , PLASTERS IN TIN WOODLAND ANIMALS | RED TOADSTOOL LED NIGHT LIGHT |
STRAWBERRY LUNCH BOX WITH CUTLERY | PLASTERS IN TIN WOODLAND ANIMALS |
PLASTERS IN TIN WOODLAND ANIMALS | STRAWBERRY LUNCH BOX WITH CUTLERY |
RED TOADSTOOL LED NIGHT LIGHT | PLASTERS IN TIN CIRCUS PARADE , PLASTERS IN TIN WOODLAND ANIMALS |
ROUND SNACK BOXES SET OF 4 FRUITS | ROUND SNACK BOXES SET OF4 WOODLAND |
ROUND SNACK BOXES SET OF4 WOODLAND | ROUND SNACK BOXES SET OF 4 FRUITS |
Insight(s): As mentioned above, many customers of this store being wholesalers exhibit certain patterns of purchasing behavior where they frequently buy multiple varieties of the same item together. This behavior is indicative of how wholesalers stock up on related products to meet the demands of their own customers or to replenish inventory efficiently.
- For example, Alarm Clocks of Green and Pink Color are often purchased along with Alarm Clocks of Red Color. In other words, customers frequently purchase Alarm Clocks of different colors and patterns (in one transaction) to be able to meet customers' needs and preferences.
- Jam Making Sets are purchased along with Jam Jars. This is an interesting correlation which indicates that wholesalers who buy Jam Making Sets also buy the necessary Jam Jars to use with the sets. This may be because customers of wholesalers may be asking for a complete set for jam making.
- LED Night Lights are often purchased with Plasters.
- Lunchboxes are also purchased with Plasters. This may be indicative of the fact that customers who visit wholesalers may have kids who are interested in cute Night Lights and lunchboxes but also have a need for plasters. The association seems to stem from the needs of families with children, where safety and comfort are paramount. It could also mean that wholesalers may be providing interesting packaged deals on such items knowing the demand for these combinations.
What is the average size of an invoice by country, considering that the customer base has primarily wholesalers?"¶
When we say 'Invoice Size', we are looking at the number of items per Invoice. To answer the above question, we use GridDB's aggregation functions.
sql_select_query3 = '''SELECT
Country,
SUM(Quantity)/COUNT(DISTINCT InvoiceNo) AS "Avg. Invoice Size"
FROM Ecommerce_Data
GROUP BY 1
ORDER BY 2 DESC
'''
cursor = conn.cursor()
cursor.execute(sql_select_query3)
column_names = [desc[0] for desc in cursor.description]
results = cursor.fetchall()
Avg_Invoice_Size = pd.DataFrame(results, columns=column_names)
cursor.close()
fig = ff.create_table(Avg_Invoice_Size)
fig.update_layout(
title_text = '<b>Average Invoice Size by Country</b>',
title_x=0.5, # Center align the title horizontally
title_font_size=20,
margin = {'t':40, 'b':50})
fig.show()