Overview¶
In this analysis, we use GridDB to analyze ecommerce data for a UK-based online retail store that sells all-occasion gifts. The primary customer base of the store includes wholesalers. More details on the dataset has been provided in the section 'About the Dataset'.
We start by downloading the dataset into a dataframe in Python. We load the data from the dataframe on to GridDB's container. We then use GridDB's query and processing engine to run complex queries to derive valuable insights and drive actionable outcomes on the stored data. GridDB's capabilities in high-speed data processing and ACID transactions empower users to handle large-scale data workloads efficiently and effectively. Additionally, it's support for in-memory and disk-based storage, distributed architecture, and column-oriented storage model ensures optimal performance, scalability, and reliability for diverse use cases across industries. Let's load the basic libraries and then proceed to learn more about the dataset and also learn about some of GridDB's SQL features and functions. Resources have also been provided throughout the article to learn more about GridDB. In this analysis, we also perform Market Basket Analysis to identify the items frequently bought together. Let's delve in further into the analysis and proceed to uncover interesting insights.
Importing Libraries¶
#Libraries for basic data manipulation
import os
import pandas as pd
import numpy as np
#Libraries for the webAPI requests
import http
http.client.HTTPConnection.debuglevel = 1
import json
import requests
#libraries for the JayDeBeAPI
import jaydebeapi
import urllib.parse
#library to calculate time taken and to change date formats
import time
from datetime import datetime
#Library for loading files into dataframes
import dask.dataframe as dd
#Libraries for graphs
from tabulate import tabulate
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.figure_factory as ff
from IPython.display import Image, Markdown,HTML
#libraries for Market Basket Analysis
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
path = os.chdir("./") # Put in your working directory
About the Dataset (with Instructions to Download)¶
The dataset is obtained from https://www.kaggle.com/datasets/carrie1/ecommerce-data/data. To download the dataset, click the downward arrow in the 'Data Card' tab. It is as shown below. Note that you will need to sign in to Kaggle to download the dataset. You can use your existing Google/Yahoo/Facebook accounts to sign in. You also have the option of registering for a free account with Kaggle using your email address. The Dataset is transactional in nature and contains all transactions occurring between 01/12/2010 and 09/12/2011 in a UK-based retail store. The store primarily has wholesalers as its customer base.
##Specify the path to your image file
current_directory = os.getcwd()
# Construct the image path using os.path.join()
image_path = os.path.join(current_directory, 'Ecommerce_Dataset.png')
width = 700
display(Markdown("## Downloading the E-Commerce Dataset"))
## Display the image
display(Image(filename=image_path, width=width))
Downloading the E-Commerce Dataset¶
The dataset gets downloaded as a compressed zip folder (.zip). This can be uncompressed using any archival tool such as 7Zip or Winzip. Save the location where the file has been unarchived to. You will utilize this location in the code below.
The dataset has the following columns-
- InvoiceNo - uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation
- StockCode - uniquely assigned to each distinct product
- Description - details of the product
- Quantity - quantities of each product (item) per transaction
- InvoiceDate - the day and time when each transaction was generated
- UnitPrice - product price per unit in terms of sterling
- CustomerID - uniquely assigned to each customer
- Country - country where each customer resides
Loading the Dataset into Python¶
We download the dataset and load it into a dataframe in Python. The resulting dataframe is 'ecommerce_data'. Since the data file is large, we use the 'dask' library to load the datafile. We then convert the dask dataframe to a Pandas dataframe (pandas_ecommerce_df).
#I unarchived the downloaded data file (.zip) to a folder with name 'archive'.
#Please modify the OS path below to specify the location where you have unarchived the data file.
dataset_dir = os.path.join(os.getcwd(), 'archive','data.csv')
# Specify the columns to parse as datetime
parse_dates = ['InvoiceDate']
ecommerce_data = dd.read_csv(dataset_dir, encoding='latin-1',assume_missing=True, dtype={'InvoiceNo':'object','StockCode':'object','Description':'object','Quantity':'int32','UnitPrice':'float',
'CustomerID':'object','Country':'object'})
pandas_ecommerce_df = ecommerce_data.compute() #converting a dask dataframe to a pandas dataframe
pandas_ecommerce_df['InvoiceDate'] = pd.to_datetime(pandas_ecommerce_df['InvoiceDate'], format='mixed')
Data Cleaning & Conversion¶
GridDB supports timestamp values of the format "%Y-%m-%dT%H:%M:%S.000Z". We use strptime and strftime functions in Python to convert the 'InvoiceDate' column to the format suitable for ingestion into GridDB.
# Convert Pandas Timestamp objects to strings
pandas_ecommerce_df['InvoiceDate'] = pandas_ecommerce_df['InvoiceDate'].astype(str)
# Parse the original timestamp string into a datetime object
pandas_ecommerce_df['InvoiceDate'] = pandas_ecommerce_df['InvoiceDate'].apply(
lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S")
)
# Format the datetime object into the desired string format
pandas_ecommerce_df['formatted_timestamp'] = pandas_ecommerce_df['InvoiceDate'].apply(
lambda x: x.strftime("%Y-%m-%dT%H:%M:%S.000Z")
)
We then delete the original column which had a different timestamp format and rename the new column 'formatted_timestamp' to 'InvoiceDate'.
# Delete the original column 'original_datetime' and 'InvoiceDate'
del pandas_ecommerce_df['InvoiceDate']
# Rename the column 'formatted_timestamp' to 'original_datetime'
pandas_ecommerce_df.rename(columns={'formatted_timestamp': 'InvoiceDate'}, inplace=True)
Here, we treat missing values as NaNs cannot be directly inserted into the GridDB database. These are converted to Nulls so that they can be ingested into GridDB. This is consistent with the behavior of many relational database systems.
#Finding Nulls in a dataframe
pandas_ecommerce_df.head() # This will return a Dask DataFrame
null_values = pandas_ecommerce_df.isnull()
null_values_summary = null_values.any(axis=0) # columns
columns_with_null_values = null_values_summary[null_values_summary == True].index
print("Columns with null values:")
print(columns_with_null_values)
#Treating Nulls (Nans)
#NaNs cannot be inserted into the database. They should be converted as Nulls.
pandas_ecommerce_df = pandas_ecommerce_df.where(pandas_ecommerce_df.notna(), 'Null')
Columns with null values: Index(['Description', 'CustomerID'], dtype='object')
Subsetting the Dataset¶
A subset of the dataframe is created by considering only specific countries of interest.
# Assuming pandas_ecommerce_df is your DataFrame containing the data
countries_of_interest = ['USA', 'Japan', 'Germany', 'Canada', 'Australia', 'Hong Kong', 'France']
# Subsetting the DataFrame based on the countries of interest
final_ecommerce_df = pandas_ecommerce_df[pandas_ecommerce_df['Country'].isin(countries_of_interest)]
Connecting to GridDB using the JayDeBeAPI¶
In this case, we use the JayDeBeAPI to connect to the GridDB cluster of interest. Note that the SSLMode here is set to 'preferred'.
notification_provider = "https://dbaasshareextconsta.blob.core.windows.net/dbaas-share-extcon-blob/trial2002.json?sv=2015-04-05&sr=b&st=2024-01-16T00%3A00%3A00.0000000Z&se=2074-01-reeww16T00%3A00%3A00.0000000Z&sp=r&sigwefw=39wefweU8P%2B%2BqzKAcUQNRw6K54dZHBhyojfiFmgHTmcVOHlY%3D"
np_encode = urllib.parse.quote(notification_provider)
cluster_name = "XXXX" ## Specify the cluster name here
cn_encode = urllib.parse.quote(cluster_name)
database_name = "XXXX" ## Specify the database name here
dbn_encode = urllib.parse.quote(database_name)
sslMode = "&sslMode=PREFERRED" #sslMode should be PREFERRED and connectionRoute should be PUBLIC
sm_encode = urllib.parse.quote(sslMode)
username = 'XXXX'
password = 'XXXX'
#Construct the JDBC URL to be used to connect to the GridDB database. The format is -
# jdbc:gs:///clusterName/databaseName?notificationProvider=theNotificationProviderURL&sslMode=PREFERRED
url = "jdbc:gs:///" + cn_encode + "/" + dbn_encode + "?notificationProvider=" + np_encode + sm_encode
#print("JDBC URL is " + url)
conn = jaydebeapi.connect("com.toshiba.mwcloud.gs.sql.Driver",
url,
{'user': username, 'password': password,
'connectionRoute':'PUBLIC',
'loginTimeout': '20'}
, "gridstore-jdbc-5.2.0.jar") #ensure to provide the correct location of the gridstore-jdbc JAR library
print('success!')
Creating the Container in GridDB to load data¶
The advantage of using the JayDeBeAPI to ingest data into GridDB is that we can use relational SQL syntaxes even though we are working with a NoSQL database. In this case, we create a container in GridDB using the 'CREATE TABLE' statement. This will create a 'Collection' container. To create a timeseries container, you can simply specify the timestamp column as a primary key. Refer to this resource to learn more about a timeseries container creation using JayDeBeAPI.
sql_query1 = (f"""
CREATE TABLE Ecommerce_Data
(
InvoiceNo STRING,
StockCode STRING,
Description STRING,
Quantity INTEGER,
InvoiceDate TIMESTAMP,
UnitPrice FLOAT,
CustomerID STRING,
Country STRING
)
""")
with conn.cursor() as cursor:
cursor.execute(sql_query1)
Using the JayDeBeAPI to perform Row Registration¶
With the JayDeBeAPI, a simple Insert SQL statement can be used to ingest data into a GridDB container within a cluster. This process is also known as Row Registration. If we were to use the WebAPI, we would have to create the data as a JSON object and pass it as a 'http request'.
#Note the use of the GridDB TIMESTAMP function in the insert SQL below
#The TIMESTAMP function converts the specified value into a timestamp
sql_query = "INSERT INTO Ecommerce_Data (InvoiceNo, StockCode, Description, Quantity, UnitPrice, CustomerID, Country, InvoiceDate) VALUES (?,?,?,?,?,?,?,TIMESTAMP(?))"
# Use a loop to insert each row of data and record the time taken
cursor = conn.cursor()
try:
start_time = time.time() # Record the start time
for index, row in final_ecommerce_df.iterrows():
# Execute the SQL query with parameters
#print(tuple(row))
try:
cursor.execute(sql_query, (
row['InvoiceNo'],
row['StockCode'],
row['Description'],
row['Quantity'],
row['UnitPrice'],
row['CustomerID'],
row['Country'],
row['InvoiceDate']
))
except Exception as e1:
print(row)
print(e1)
break
#break
end_time = time.time() # Record the end time
execution_time = end_time - start_time
print(f"Time taken for insertion: {execution_time:.6f} seconds")
conn.commit() # Commit the changes
except Exception as e:
# Handle any exceptions that may occur during execution
print("Error:", e)
conn.rollback() # Rollback changes if an error occurs
finally:
cursor.close() # Close the cursor
Time taken for insertion: 8083.310104 seconds
Analysis¶
What are the Top 2 items sold by Country every Friday?¶
To build this query, we use Window functions in GridDB. In this case, we specifically use the ROW_NUMBER() OVER PARTITION BY function. Refer to this resource to learn more. This is similar to the Rank Over Partition function available in most database platforms.
#day of week - Sunday-0, Saturday-6, Friday-5
sql_select_query1 = ''' SELECT
COUNTRY,
DESCRIPTION,
Qty
FROM
(
SELECT Day_of_Week, COUNTRY, DESCRIPTION, Qty, ROW_NUMBER()
OVER(PARTITION BY Day_of_Week, COUNTRY ORDER BY Qty Desc) As Rn
FROM
(
SELECT EXTRACT(DAY_OF_WEEK,InvoiceDate) AS Day_of_Week,
COUNTRY,
DESCRIPTION,
SUM(Quantity) as Qty
FROM Ecommerce_Data
WHERE STRFTIME('%w', InvoiceDate) ='5'
GROUP BY 1,2,3
ORDER BY 1,2,4 desc, 3
) query_1
) query_2
WHERE Rn <= 2
ORDER BY 1
'''
cursor = conn.cursor()
cursor.execute(sql_select_query1)
column_names = [desc[0] for desc in cursor.description]
results = cursor.fetchall()
Top_Selling_weekends_df = pd.DataFrame(results, columns=column_names)
cursor.close()
#BOLD = '\033[1m'
#RESET = '\033[0m'
#print(BOLD + "Top 2 Items sold by Country every Friday" + RESET)
display(HTML("<h3 style='text-align: left;'>Top 2 Items sold by Country every Friday</h3>"))
sorted_df = Top_Selling_weekends_df.sort_values(by='Qty', ascending=False)
sorted_df
Top 2 Items sold by Country every Friday
| Country | Description | Qty | |
|---|---|---|---|
| 3 | France | RABBIT NIGHT LIGHT | 1260 |
| 4 | France | MINI PAINT SET VINTAGE | 504 |
| 5 | Germany | PACK OF 12 RED APPLE TISSUES | 432 |
| 0 | Australia | RABBIT NIGHT LIGHT | 336 |
| 6 | Germany | SPACEBOY LUNCH BOX | 318 |
| 1 | Australia | RED TOADSTOOL LED NIGHT LIGHT | 288 |
| 9 | Japan | JAZZ HEARTS PURSE NOTEBOOK | 96 |
| 10 | Japan | PACK OF 20 SPACEBOY NAPKINS | 96 |
| 7 | Hong Kong | ETCHED GLASS COASTER | 48 |
| 8 | Hong Kong | POLKADOT PEN | 48 |
| 11 | USA | SET/10 BLUE POLKADOT PARTY CANDLES | 48 |
| 12 | USA | SET/10 IVORY POLKADOT PARTY CANDLES | 48 |
| 2 | Canada | POSTAGE | 1 |
Insight(s):
- While the majority of items in France and Australia sold night lights; party candles get sold a lot in the USA on Fridays.
- Similarly, the sale of Napkins and Tissues are maximum in Japan and Germany respectively.
- Stationary gets sold in Hong Kong and Japan the most on Fridays.
Given an Invoice Number, what are the items frequently sold together?¶
Here, we use the STRFTIME() function in GridDB. The STRFTIME function, short for 'String From Time' or 'String Format Time' takes a timestamp and formats it into a string as per the format provided. Refer to this GridDB resource to learn more. We also use subqueries here. GridDB also has aggregation functions and Character functions which are used here.
Note that here we exclude the item 'Postage' because this item skews the results and analysis.
sql_select_query2 = '''SELECT
InvoiceNo,
Description,
TotalQuantity
FROM
(
SELECT
InvoiceNo,
Description,
SUM(Quantity) AS TotalQuantity,
ROW_NUMBER() OVER(PARTITION BY InvoiceNo ORDER BY 3 DESC) AS ItemRank
FROM Ecommerce_Data
WHERE STRFTIME('%Y-%m', InvoiceDate) IN ('2010-12','2011-12')
AND LOWER(Description) NOT LIKE '%postage%'
GROUP BY InvoiceNo, Description
) query_1;
''' #WHERE ItemRank <= 5
cursor = conn.cursor()
cursor.execute(sql_select_query2)
column_names = [desc[0] for desc in cursor.description]
results = cursor.fetchall()
Invoice_Description_df = pd.DataFrame(results, columns=column_names)
cursor.close()
#subset_df = Invoice_Description_df.iloc[:20]
df2 = pd.crosstab(Invoice_Description_df['InvoiceNo'], Invoice_Description_df['Description'])
Now that we have the data required for the Market Basket analysis, let's move on to perform the analysis.
Market-Basket Analysis¶
We use the data gathered from GridDB to perform a Market-Basket Analysis. We identify the items frequently sold together using this approach. In this case, we use a sufficiently large lift value of 5 and above. We do this because we have a large amount of transactions and are only interested in rules that exhibit a strong association between the items in the transactions. A lift value of 5 or higher indicates that the likelihood of items being purchased together is at least five times greater than if their purchase was random and independent.
# Perform one-hot encoding
encoded_df = pd.get_dummies(df2)
# Find frequent itemsets using Apriori
frequent_itemsets = apriori(encoded_df, min_support=0.05, use_colnames=True)
# Generate association rules
# We consider a sufficiently big threshold for lift.
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=5)
# Display the association rules
#rules
C:\Users\mg_su\anaconda3\lib\site-packages\mlxtend\frequent_patterns\fpcommon.py:109: DeprecationWarning: DataFrames with non-bool types result in worse computationalperformance and their support might be discontinued in the future.Please use a DataFrame with bool type
sorted_rules = rules.sort_values(["lift","confidence","support"],axis = 0, ascending = False)
# Replace brackets in all columns from frozen set object
sorted_rules["antecedents"] = sorted_rules["antecedents"].apply(lambda x: ', '.join(list(x))).astype("unicode")
sorted_rules["consequents"] = sorted_rules["consequents"].apply(lambda x: ', '.join(list(x))).astype("unicode")
subset = sorted_rules[['antecedents', 'consequents']]
pd.set_option('display.max_colwidth', None)
# Print the title
display(HTML("<h3 style='text-align: center;'>Top 100 Association Rules</h3>"))
#display(subset.head(100))
display(HTML(subset.head(100).to_html(index=False))) #hide index
Top 100 Association Rules
| antecedents | consequents |
|---|---|
| ALARM CLOCK BAKELIKE RED , ALARM CLOCK BAKELIKE PINK | ALARM CLOCK BAKELIKE GREEN |
| ALARM CLOCK BAKELIKE GREEN | ALARM CLOCK BAKELIKE RED , ALARM CLOCK BAKELIKE PINK |
| ALARM CLOCK BAKELIKE GREEN | ALARM CLOCK BAKELIKE RED |
| ALARM CLOCK BAKELIKE RED | ALARM CLOCK BAKELIKE GREEN |
| LUNCH BOX WITH CUTLERY RETROSPOT | STRAWBERRY LUNCH BOX WITH CUTLERY |
| ALARM CLOCK BAKELIKE GREEN, ALARM CLOCK BAKELIKE PINK | ALARM CLOCK BAKELIKE RED |
| STRAWBERRY LUNCH BOX WITH CUTLERY | LUNCH BOX WITH CUTLERY RETROSPOT |
| ALARM CLOCK BAKELIKE RED | ALARM CLOCK BAKELIKE GREEN, ALARM CLOCK BAKELIKE PINK |
| JAM JAR WITH PINK LID | JAM MAKING SET PRINTED |
| PLASTERS IN TIN CIRCUS PARADE , RED TOADSTOOL LED NIGHT LIGHT | PLASTERS IN TIN WOODLAND ANIMALS |
| JAM MAKING SET PRINTED | JAM JAR WITH PINK LID |
| PLASTERS IN TIN WOODLAND ANIMALS | PLASTERS IN TIN CIRCUS PARADE , RED TOADSTOOL LED NIGHT LIGHT |
| ALARM CLOCK BAKELIKE GREEN | ALARM CLOCK BAKELIKE PINK |
| ALARM CLOCK BAKELIKE PINK | ALARM CLOCK BAKELIKE GREEN |
| ALARM CLOCK BAKELIKE PINK | ALARM CLOCK BAKELIKE GREEN, ALARM CLOCK BAKELIKE RED |
| ALARM CLOCK BAKELIKE GREEN, ALARM CLOCK BAKELIKE RED | ALARM CLOCK BAKELIKE PINK |
| ALARM CLOCK BAKELIKE RED | ALARM CLOCK BAKELIKE PINK |
| ALARM CLOCK BAKELIKE PINK | ALARM CLOCK BAKELIKE RED |
| PLASTERS IN TIN CIRCUS PARADE | RED TOADSTOOL LED NIGHT LIGHT, PLASTERS IN TIN WOODLAND ANIMALS |
| RED TOADSTOOL LED NIGHT LIGHT, PLASTERS IN TIN WOODLAND ANIMALS | PLASTERS IN TIN CIRCUS PARADE |
| PLASTERS IN TIN CIRCUS PARADE | PLASTERS IN TIN STRONGMAN |
| PLASTERS IN TIN STRONGMAN | PLASTERS IN TIN CIRCUS PARADE |
| PLASTERS IN TIN CIRCUS PARADE , PLASTERS IN TIN WOODLAND ANIMALS | RED TOADSTOOL LED NIGHT LIGHT |
| STRAWBERRY LUNCH BOX WITH CUTLERY | PLASTERS IN TIN WOODLAND ANIMALS |
| PLASTERS IN TIN WOODLAND ANIMALS | STRAWBERRY LUNCH BOX WITH CUTLERY |
| RED TOADSTOOL LED NIGHT LIGHT | PLASTERS IN TIN CIRCUS PARADE , PLASTERS IN TIN WOODLAND ANIMALS |
| ROUND SNACK BOXES SET OF 4 FRUITS | ROUND SNACK BOXES SET OF4 WOODLAND |
| ROUND SNACK BOXES SET OF4 WOODLAND | ROUND SNACK BOXES SET OF 4 FRUITS |
Insight(s): As mentioned above, many customers of this store being wholesalers exhibit certain patterns of purchasing behavior where they frequently buy multiple varieties of the same item together. This behavior is indicative of how wholesalers stock up on related products to meet the demands of their own customers or to replenish inventory efficiently.
- For example, Alarm Clocks of Green and Pink Color are often purchased along with Alarm Clocks of Red Color. In other words, customers frequently purchase Alarm Clocks of different colors and patterns (in one transaction) to be able to meet customers' needs and preferences.
- Jam Making Sets are purchased along with Jam Jars. This is an interesting correlation which indicates that wholesalers who buy Jam Making Sets also buy the necessary Jam Jars to use with the sets. This may be because customers of wholesalers may be asking for a complete set for jam making.
- LED Night Lights are often purchased with Plasters.
- Lunchboxes are also purchased with Plasters. This may be indicative of the fact that customers who visit wholesalers may have kids who are interested in cute Night Lights and lunchboxes but also have a need for plasters. The association seems to stem from the needs of families with children, where safety and comfort are paramount. It could also mean that wholesalers may be providing interesting packaged deals on such items knowing the demand for these combinations.
What is the average size of an invoice by country, considering that the customer base has primarily wholesalers?"¶
When we say 'Invoice Size', we are looking at the number of items per Invoice. To answer the above question, we use GridDB's aggregation functions.
sql_select_query3 = '''SELECT
Country,
SUM(Quantity)/COUNT(DISTINCT InvoiceNo) AS "Avg. Invoice Size"
FROM Ecommerce_Data
GROUP BY 1
ORDER BY 2 DESC
'''
cursor = conn.cursor()
cursor.execute(sql_select_query3)
column_names = [desc[0] for desc in cursor.description]
results = cursor.fetchall()
Avg_Invoice_Size = pd.DataFrame(results, columns=column_names)
cursor.close()
fig = ff.create_table(Avg_Invoice_Size)
fig.update_layout(
title_text = '<b>Average Invoice Size by Country</b>',
title_x=0.5, # Center align the title horizontally
title_font_size=20,
margin = {'t':40, 'b':50})
fig.show()
Insights:
- Australia has the largest average invoice size, where wholesalers buy around 1210 items on an average.
- This is followed by Japan and Canada with 900 items per invoice and 460 items per invoice respectively.
- USA and Germany has the lowest Invoice Size with around 147 and 194 items per invoice.
What is the average unit price of cake-based products?¶
We use GridDB's conditional expressions, character functions and string operators in this case. Refer to this resource to learn more about Character Functions.
sql_select_query4 = '''SELECT
CASE WHEN LOWER(Description) like '%stand%' THEN 'Cake-Stand'
WHEN LOWER(Description) like '%tin%' THEN 'Cake-Tin'
WHEN LOWER(Description) like '%lace%' THEN 'Cake-Lace'
WHEN LOWER(Description) like '%candle%' THEN 'Cake Candle Set'
WHEN LOWER(Description) like '%case%' THEN 'Cake Cases'
WHEN LOWER(Description) like '%plate%' THEN 'Cake Plates'
WHEN LOWER(Description) like '%teapot%' THEN 'Cake Teapots'
WHEN LOWER(Description) like '%decor%' THEN 'Cake Decor'
WHEN LOWER(Description) like '%napkin%' OR LOWER(Description) like '%towel%' THEN 'Cake Napkins/Towels'
WHEN (LOWER(Description) like '%fork%' OR LOWER(Description) like '%chopsticks%') THEN 'Cake Forks/Chopsticks'
WHEN (LOWER(Description) like '%apron%' OR LOWER(Description) like '%frill%') THEN 'Cake Aprons/Frills'
WHEN LOWER(Description) like '%bake%' THEN 'Bake A Cake'
ELSE 'Others'
END AS Category,
ROUND(AVG(UnitPrice),2) AS Avg_Unit_Price
FROM Ecommerce_Data
WHERE (LOWER(Description) like '%cake%')
GROUP BY 1
ORDER BY 2 DESC
'''
cursor = conn.cursor()
cursor.execute(sql_select_query4)
column_names = [desc[0] for desc in cursor.description]
results = cursor.fetchall()
Avg_Unit_Price = pd.DataFrame(results, columns=column_names)
cursor.close()
fig = ff.create_table(Avg_Unit_Price)
fig.update_layout(
title_text = '<b>Average Unit Price of Cake-based products</b>',
title_x=0.5, # Center align the title horizontally
title_font_size=16,
margin = {'t':40, 'b':30})
fig.show()
Insight(s): Cake stands have the highest average unit price at 10.57 Sterlings. Cake tins cost 6.05 Sterlings. Cake plates and decor cost a little over 3 Sterlings. Cake Forks, Napkins cost a little over 1 Sterling.
What is the breakdown of cancelled transactions by country?¶
Here, we use character comparison operators in GridDB. Refer to this resource to learn more.
sql_select_query4 = '''SELECT
Country,
count(InvoiceNo) as cancelled_transactions
FROM Ecommerce_Data
WHERE lower(InvoiceNo) like '%c%'
GROUP BY 1
ORDER BY 2 DESC
'''
cursor = conn.cursor()
cursor.execute(sql_select_query4)
column_names = [desc[0] for desc in cursor.description]
results = cursor.fetchall()
Cancelled_Transactions = pd.DataFrame(results, columns=column_names)
cursor.close()
# Assuming 'Cancelled_Transactions' is your DataFrame
# Replace 'cancelled_transactions' with the actual column name if different
country_cancelled = Cancelled_Transactions.groupby('Country')['cancelled_transactions'].sum()
# Sort the data by cancelled transactions in descending order
country_cancelled = country_cancelled.sort_values(ascending=False)
# Plotting the bar chart
plt.figure(figsize=(12, 6)) # Adjust the figure size as needed
country_cancelled.plot(kind='bar', color='skyblue')
bars = country_cancelled.plot(kind='bar', color='skyblue')
# Adding values on top of bars
for bar in bars.patches:
plt.annotate(format(bar.get_height(), '.0f'),
(bar.get_x() + bar.get_width() / 2,
bar.get_height()), ha='center', va='center',
size=10, xytext=(0, 8),
textcoords='offset points')
# Adding titles and labels
plt.title('Cancelled Transactions by Country')
plt.xlabel('Country')
plt.ylabel('Cancelled Transactions')
# Show the plot
plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for better visibility
plt.tight_layout()
plt.show()
Insight(s): Germany has the largest number of cancelled transactions, followed by France and USA. Hong Kong has the lowest number of cancelled transactions.
Concluding Remarks¶
In this analysis, we used GridDB's fast query engine to run queries and retrieve data in a seamless manner. We then used Python to visualize and analyze the results. We also learnt to create a container and to ingest data into GridDB using the JayDeBeAPI. Some insights that stand out here are that -
- Cake stands have the highest average unit price at 10.57 Sterlings. Cake tins cost 6.05 Sterlings. Cake plates and decor cost a little over 3 Sterlings. Cake Forks, Napkins cost a little over 1 Sterling.
- Australia has the largest average invoice size,followed by Japan and Canada. USA and Germany have the lowest Invoice Size. This means that wholesalers in Australia and Japan stock up a lot more than in the USA and Germany. This can be due to several socio-economic factors which will need to be analyzed further.
- From the Market Basket Analysis, we learnt that wholesalers preferred buying the same items having different colors and pattern to be able to cater to customers' preferences.
- On Fridays, wholesalers in France and Australia seemed to stock up on night lights. In the USA, party candles got sold a lot. Similarly, wholesalers stocked up on Napkins and Tissues in Japan and Germany on Fridays. Stationary got sold in Hong Kong and Japan the most on Fridays.
- The last insight we got was that Germany has the largest cancelled transactions whereas HongKong had the least. It is interesting to note that Hong Kong also has a higher average invoice size than Germany. The cancelled transactions and the average Invoice Size may have a possible correlation. For instance, Hong Kong being a major commercial hub might incur larger transactions because customers may be less likely to cancel transactions compared to Germany. However, this requires further statistical analysis to be sure of the correlation.
In essence, this analysis highlights the use of GridDB as a powerful database for data-driven insights and for effective decision-making. The analysis also showcases the performance of GridDB's fast query engine combined with Python's versatility in gathering results from GridDB and providing interesting visuals to further substantiate findings.
If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.
