How do you know your business is going in the right direction?

With so much data we have in hand, it is very easy for the problem to get lost while cleaning and analysing it. To be honest, I was in the same place. As a Data Analyst/Data Scientist, I would try to look at the data first and make sense of it. But it not only consumed more time but also made it difficult to explain why I took certain decisions.

A recent project on Business Intelligence taught me the power of problem-solving. For this tutorial, we will be exploring a dataset of a global store. To keep it general, I am going to assume our problem statment here is How to grow our business. Then, we will work through the problem statement in a tree-like framework. This will help in structuring our analysis and also get specific KPIs.

Framework

The framework breaks this down into two major strategic approaches:

1. Increase Our Profit

There are a few ways to boost profits. This framework highlights three core strategies:

➤ Increase the Quantity Sold More units sold usually means more revenue. To do this effectively: – Ensure demand is met – Make sure your supply chain and stock levels can support customer demand. We need more data to do this. – Focus on region-specific performance – Some regions may outperform others, and identifying where sales are strongest helps target efforts better.

➤ Sell More Products with High Profit Margins Not all products are created equal. Shifting focus toward high-margin items can improve profitability. – Use product-level data to identify top performers. – Understand regional trends to see where high-margin products sell best.

➤ Reduce the Shipping Cost Shipping and logistics can quietly erode margins, so optimizing this is crucial. – Analyze which products incur the highest shipping costs. – Check which regions are driving those costs and explore smarter routing or fulfillment solutions.

2. Decrease Our Discount

While discounts can drive volume, excessive discounting eats into profits. Two key tactics can help manage this:

➤ Identify Best-Selling Products Best-sellers may not need heavy discounting to move. – Reassess discount policies at the product level. – Consider performance variations across regions.

➤ Uncover Bottlenecks Inefficiencies or slow-moving inventory often trigger unnecessary discounts. – Identifying and addressing bottlenecks can reduce dependency on markdowns.

Now that we have a basic framework in place, let us look at each aspect and decide on the metrics.

What metrics are we tracking?

Key Performance Indicators (KPIs):
- total sales
- total profit
- average shipping cost
- average discount
Profitability:
- products with most profit
- regions with most profit
Discount:
- best selling products (do not need discounts)
- bottlenecks due to discounts (low or negative profit margin)
Market segmentation: Identifying customer patterns
- quantity sold by region
- profit by region
- shipping cost by region
- discount by region

Pre-requisites

The following libraries are required for the below code to execute:

Pandas
GridDB Python Client
Seaborn
Matplotlib

GridDB’s GitHub Page covers the installation in detail. Please go through it to be able to interact with the GridDB Server in Python.

Great! Let’s go ahead and import the libraries

Importing Libraries

#this is optional - in case of errors during execution even though the python client is installed, it is likely that this could be a path issue and can be resolved as follows
import sys
sys.path.append("/home/shripriya/python_client")
#sys.path.append(<path_to_python_client>)</path_to_python_client>

import griddb_python as griddb
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Dataset

We will be using an open-source SuperStore dataset that can be downloaded from here. Once we download it and put it in the same folder as the code, we can then use pandas to read it and clean it.

Reading the Dataframe

superstore_df = pd.read_excel("global_superstore_2016.xlsx", sheet_name="Orders")
superstore_df.head()

	Row ID	Order ID	Order Date	Ship Date	Ship Mode	Customer ID	Customer Name	Segment	Postal Code	City	Product ID	Category	Sub-Category	Product Name	Sales	Quantity	Discount	Profit	Shipping Cost	Order Priority
0	40098	CA-2014-AB10015140-41954	2014-11-11	2014-11-13	First Class	AB-100151402	Aaron Bergman	Consumer	73120.0	Oklahoma City	TEC-PH-5816	Technology	Phones	Samsung Convoy 3	221.980	2	0.0	62.1544	40.77	High
1	26341	IN-2014-JR162107-41675	2014-02-05	2014-02-07	Second Class	JR-162107	Justin Ritter	Corporate	NaN	Wollongong	FUR-CH-5379	Furniture	Chairs	Novimex Executive Leather Armchair, Black	3709.395	9	0.1	-288.7650	923.63	Critical
2	25330	IN-2014-CR127307-41929	2014-10-17	2014-10-18	First Class	CR-127307	Craig Reiter	Consumer	NaN	Brisbane	TEC-PH-5356	Technology	Phones	Nokia Smart Phone, with Caller ID	5175.171	9	0.1	919.9710	915.49	Medium
3	13524	ES-2014-KM1637548-41667	2014-01-28	2014-01-30	First Class	KM-1637548	Katherine Murray	Home Office	NaN	Berlin	TEC-PH-5267	Technology	Phones	Motorola Smart Phone, Cordless	2892.510	5	0.1	-96.5400	910.16	Medium
4	47221	SG-2014-RH9495111-41948	2014-11-05	2014-11-06	Same Day	RH-9495111	Rick Hansen	Consumer	NaN	Dakar	TEC-CO-6011	Technology	Copiers	Sharp Wireless Fax, High-Speed	2832.960	8	0.0	311.5200	903.04	Critical

5 rows × 23 columns

We can see that our dataset has a lot of information on the orders being placed. Not all information is useful here. Remember our framework comes in handy here. So, we will go ahead and drop some columns that are not needed.

Exploratory Data Analysis (EDA)

#do not need Order ID, Postal Code, Order Priority
cols = ["Order ID", "Postal Code", "Order Priority"]
superstore_df.drop(columns=cols, inplace=True)

superstore_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51290 entries, 0 to 51289
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Row ID         51290 non-null  int64         
 1   Order Date     51290 non-null  datetime64[ns]
 2   Ship Date      51290 non-null  datetime64[ns]
 3   Ship Mode      51290 non-null  object        
 4   Customer ID    51290 non-null  object        
 5   Customer Name  51290 non-null  object        
 6   Segment        51290 non-null  object        
 7   City           51290 non-null  object        
 8   State          51290 non-null  object        
 9   Country        51290 non-null  object        
 10  Region         51290 non-null  object        
 11  Market         51290 non-null  object        
 12  Product ID     51290 non-null  object        
 13  Category       51290 non-null  object        
 14  Sub-Category   51290 non-null  object        
 15  Product Name   51290 non-null  object        
 16  Sales          51290 non-null  float64       
 17  Quantity       51290 non-null  int64         
 18  Discount       51290 non-null  float64       
 19  Profit         51290 non-null  float64       
 20  Shipping Cost  51290 non-null  float64       
dtypes: datetime64[ns](2), float64(4), int64(2), object(13)
memory usage: 8.2+ MB

Since there are no null-values, we do not need to worry about that. However, the date columns here are not of the correct datatype. So we will quickly change that before inserting the data into GridDB.

# Prepare the DataFrame
superstore_df["Order Date"] = pd.to_datetime(superstore_df["Order Date"]).dt.to_pydatetime()
superstore_df["Ship Date"] = pd.to_datetime(superstore_df["Ship Date"]).dt.to_pydatetime()

# Rename columns to match GridDB naming (no spaces or dashes)
superstore_df.columns = [col.replace(" ", "_").replace("-", "_") for col in superstore_df.columns]

/tmp/ipykernel_1024/114528052.py:2: FutureWarning: The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result
  superstore_df["Order Date"] = pd.to_datetime(superstore_df["Order Date"]).dt.to_pydatetime()
/tmp/ipykernel_1024/114528052.py:3: FutureWarning: The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result
  superstore_df["Ship Date"] = pd.to_datetime(superstore_df["Ship Date"]).dt.to_pydatetime()

Let us verify that the datatypes are indeed changed.

superstore_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51290 entries, 0 to 51289
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Row_ID         51290 non-null  int64         
 1   Order_Date     51290 non-null  datetime64[ns]
 2   Ship_Date      51290 non-null  datetime64[ns]
 3   Ship_Mode      51290 non-null  object        
 4   Customer_ID    51290 non-null  object        
 5   Customer_Name  51290 non-null  object        
 6   Segment        51290 non-null  object        
 7   City           51290 non-null  object        
 8   State          51290 non-null  object        
 9   Country        51290 non-null  object        
 10  Region         51290 non-null  object        
 11  Market         51290 non-null  object        
 12  Product_ID     51290 non-null  object        
 13  Category       51290 non-null  object        
 14  Sub_Category   51290 non-null  object        
 15  Product_Name   51290 non-null  object        
 16  Sales          51290 non-null  float64       
 17  Quantity       51290 non-null  int64         
 18  Discount       51290 non-null  float64       
 19  Profit         51290 non-null  float64       
 20  Shipping_Cost  51290 non-null  float64       
dtypes: datetime64[ns](2), float64(4), int64(2), object(13)
memory usage: 8.2+ MB

Okay, everything looks great! We can now put our clean data into GridDB.

Putting the data into GridDB

Before putting the data into GridDB, we need to create something called containers. Containers are like tables in any database. We define our schema that includes the name and data type of the column. Once we create the containers, it is very easy to put the data using the GridDB python client.

Container Creation

factory = griddb.StoreFactory.get_instance()
host = "127.0.0.1:10001"
cluster = "myCluster"
db_user = "admin"
db_password = "admin"

try:
    gridstore = factory.get_store(notification_member=host, cluster_name=cluster, username=db_user, password=db_password)
    print("Connection successful")

    conInfo = griddb.ContainerInfo(
        name="sales_data",
        column_info_list=[
            ["Row_ID", griddb.Type.LONG],
            ["Order_Date", griddb.Type.TIMESTAMP],
            ["Ship_Date", griddb.Type.TIMESTAMP],
            ["Ship_Mode", griddb.Type.STRING],
            ["Customer_ID", griddb.Type.STRING],
            ["Customer_Name", griddb.Type.STRING],
            ["Segment", griddb.Type.STRING],
            ["City", griddb.Type.STRING],
            ["State", griddb.Type.STRING],
            ["Country", griddb.Type.STRING],
            ["Region", griddb.Type.STRING],
            ["Market", griddb.Type.STRING],
            ["Product_ID", griddb.Type.STRING],
            ["Category", griddb.Type.STRING],
            ["Sub_Category", griddb.Type.STRING],
            ["Product_Name", griddb.Type.STRING],
            ["Sales", griddb.Type.DOUBLE],
            ["Quantity", griddb.Type.INTEGER],
            ["Discount", griddb.Type.DOUBLE],
            ["Profit", griddb.Type.DOUBLE],
            ["Shipping_Cost", griddb.Type.DOUBLE]
        ],
        type=griddb.ContainerType.COLLECTION
    )

    container = gridstore.put_container(conInfo)
    print("Container creation successful")


except griddb.GSException as e:
    for i in range(e.get_error_stack_size()):
        print("[", i, "]")
        print("Error Code:", e.get_error_code(i))
        print("Location:", e.get_location(i))
        print("Message:", e.get_message(i))

Connection successful
Container creation successful

If executed correctly, the above code should print out Connection successful followed by Container creation successful. If not, the exception message will help you troubleshoot. More information on the error codes are available here.

Once our container is in place, let us put the data into it using the following code.

Inserting the data

container.put_rows(superstore_df)
print("Data insertion successful")

Data insertion successful

It is very important to verify if our data has been inserted correctly before moving onto the data analysis. Let’s go ahead and do that.

Reading the data

try:
        gridstore = factory.get_store(notification_member=host, cluster_name=cluster, username=db_user, password=db_password)
        cont = gridstore.get_container("sales_data")
        if cont is None:
            print("Does not exist")
        print("Connection successful")
        
        query_str="SELECT *"
        query = cont.query(query_str)
        rs = query.fetch()
        df = rs.fetch_rows()
        print(df.info())

except griddb.GSException as e:
    for i in range(e.get_error_stack_size()):
        print("[",i,"]")
        print(e.get_error_code(i))
        print(e.get_location(i))
        print(e.get_message(i))

Connection successful
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51290 entries, 0 to 51289
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Row_ID         51290 non-null  int64         
 1   Order_Date     51290 non-null  datetime64[ns]
 2   Ship_Date      51290 non-null  datetime64[ns]
 3   Ship_Mode      51290 non-null  object        
 4   Customer_ID    51290 non-null  object        
 5   Customer_Name  51290 non-null  object        
 6   Segment        51290 non-null  object        
 7   City           51290 non-null  object        
 8   State          51290 non-null  object        
 9   Country        51290 non-null  object        
 10  Region         51290 non-null  object        
 11  Market         51290 non-null  object        
 12  Product_ID     51290 non-null  object        
 13  Category       51290 non-null  object        
 14  Sub_Category   51290 non-null  object        
 15  Product_Name   51290 non-null  object        
 16  Sales          51290 non-null  float64       
 17  Quantity       51290 non-null  int64         
 18  Discount       51290 non-null  float64       
 19  Profit         51290 non-null  float64       
 20  Shipping_Cost  51290 non-null  float64       
dtypes: datetime64[ns](2), float64(4), int64(2), object(13)
memory usage: 8.2+ MB
None

Great! Everything looks great. We can now go ahead with our analysis.

Analysis

I will define a function that takes the container, query, and output data type as an input and returns the result of the query. This helps me avoid writing the same code over and over again.

def fetch_data(container, query_str, data_type):
    rs = container.query(query_str)
    row_set = rs.fetch()
    if row_set.has_next():
        result = row_set.next()
        result = result.get(type=data_type)
    
        return result

1. Key Performance Indicators (KPI)

Referencing the framework, we are tracking 4 KPIs –

Total Sales
Total Profit
Avg Shipping Cost
Avg Discount

total_sales = fetch_data(cont, 'SELECT SUM(Sales) FROM sales_data', griddb.Type.DOUBLE)
total_profit = fetch_data(cont, 'SELECT SUM(Profit) FROM sales_data', griddb.Type.DOUBLE)
avg_shipping = fetch_data(cont, 'SELECT AVG(Shipping_Cost) FROM sales_data', griddb.Type.DOUBLE)
avg_discount = fetch_data(cont, 'SELECT AVG(Discount) FROM sales_data', griddb.Type.DOUBLE)

print(f"🔢 KPIs")
print(f"Total Sales: ${total_sales:,.2f}")
print(f"Total Profit: ${total_profit:,.2f}")
print(f"Avg Shipping Cost: ${avg_shipping:.2f}")
print(f"Avg Discount: {avg_discount:.2%}")

🔢 KPIs
Total Sales: $12,642,501.91
Total Profit: $1,467,457.29
Avg Shipping Cost: $26.48
Avg Discount: 14.29%

Let’s go ahead and plot these.

fig, axes = plt.subplots(2, 2, figsize=(10, 6))
fig.suptitle("Key Performance Indicators (KPIs)", fontsize=20, fontweight='bold')

axes = axes.flatten()
for ax in axes:
    ax.axis('off')

# KPIs
kpi_labels = ["Total Sales", "Total Profit", "Avg Shipping Cost", "Avg Discount"]
kpi_values = [
    f"${total_sales:,.2f}",
    f"${total_profit:,.2f}",
    f"${avg_shipping:.2f}",
    f"{avg_discount:.2%}"
]
kpi_colors = ["#4CAF50", "#2196F3", "#FF9800", "#9C27B0"]

# Fill KPI cards
for i, ax in enumerate(axes):
    ax.text(0.5, 0.6, kpi_labels[i], fontsize=16, ha='center', va='center', weight='bold', color='gray')
    ax.text(0.5, 0.4, kpi_values[i], fontsize=24, ha='center', va='center', weight='bold', color=kpi_colors[i])
    ax.set_facecolor("#f7f7f7")
    ax.set_frame_on(True)
    ax.patch.set_linewidth(1)
    ax.patch.set_edgecolor("#ddd")

# Add vertical and horizontal dividers 
fig.subplots_adjust(hspace=0.3, wspace=0.3)
fig_width, fig_height = fig.get_size_inches()

# Vertical line in the center
fig.lines.append(plt.Line2D([0.5, 0.5], [0.05, 0.95], color="lightgray", linewidth=1, linestyle="--", transform=fig.transFigure))
# Horizontal line in the center
fig.lines.append(plt.Line2D([0.05, 0.95], [0.5, 0.5], color="lightgray", linewidth=1, linestyle="--", transform=fig.transFigure))

plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()

For the next set of metrics, we will be using multiple columns of the database. Let’s go ahead and retrieve it as a dataframe so that we can easily leverage pandas groupby and aggregation.

query_str="SELECT *"
query = cont.query(query_str)
rs = query.fetch()
df = rs.fetch_rows()

2. Profitability

We are tracking 2 key things here:

products with most profit
regions with most profit

top_products_profit = df.groupby('Product_Name')['Profit'].sum().sort_values(ascending=False).head(10)
region_profit = df.groupby('Region')['Profit'].sum().sort_values(ascending=False)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))
sns.barplot(x=top_products_profit.values, y=top_products_profit.index,
            hue=top_products_profit.index, palette="viridis", legend=False, ax=axes[0])
axes[0].set_title("Top 10 Products by Profit")
axes[0].set_xlabel("Profit")
axes[0].set_ylabel("Product")

sns.barplot(x=region_profit.values, y=region_profit.index,
            hue=region_profit.index, palette="magma", legend=False, ax=axes[1])
axes[1].set_title("Profit by Region")
axes[1].set_xlabel("Profit")
axes[1].set_ylabel("Region")
plt.tight_layout()
plt.show()

3. Discount

Similarly, for the second branch in our tree framework – Discount, we will be tracking 2 metrics:

Best selling products (products that do not need discounts)
Bottlenecks due to discounts (products with low or negative profit margin)

top_sales_no_discount = df[df['Discount'] == 0].groupby('Product_Name')['Sales'].sum().sort_values(ascending=False).head(10)
df['Profit_Margin'] = df['Profit'] / df['Sales']
low_margin = df[(df['Discount'] > 0.3) & (df['Profit_Margin'] < 0)].groupby('Product_Name')['Profit'].sum().sort_values().head(10)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))
sns.barplot(x=top_sales_no_discount.values, y=top_sales_no_discount.index,
            hue=top_sales_no_discount.index, palette="cubehelix", legend=False, ax=axes[0])
axes[0].set_title("Top Selling Products Without Discount")
axes[0].set_xlabel("Sales")
axes[0].set_ylabel("Product")

sns.barplot(x=low_margin.values, y=low_margin.index,
            hue=low_margin.index, palette="coolwarm", legend=False, ax=axes[1])
axes[1].set_title("High Discount, Low/Negative Profit")
axes[1].set_xlabel("Profit")
axes[1].set_ylabel("Product")
plt.tight_layout()
plt.show()

4. Market Segmentation

Lastly, we also want to look at the market segmentation and we can segment it by 4 parts:

Quantity sold by region
Profit by region
Shipping cost by region
Discount by region

quantity_by_region = df.groupby('Region')['Quantity'].sum().sort_values(ascending=False)
shipping_by_region = df.groupby('Region')['Shipping_Cost'].mean().sort_values(ascending=False)
discount_by_region = df.groupby('Region')['Discount'].mean().sort_values(ascending=False)

fig, axes = plt.subplots(2, 2, figsize=(18, 12))

# Plot 1: Quantity Sold by Region
sns.barplot(x=quantity_by_region.values, y=quantity_by_region.index,
            hue=quantity_by_region.index, palette="crest", legend=False, ax=axes[0, 0])
axes[0, 0].set_title("Quantity Sold by Region")
axes[0, 0].set_xlabel("Quantity")
axes[0, 0].set_ylabel("Region")

# Plot 2: Profit by Region
sns.barplot(x=region_profit.values, y=region_profit.index,
            hue=region_profit.index, palette="flare", legend=False, ax=axes[0, 1])
axes[0, 1].set_title("Profit by Region")
axes[0, 1].set_xlabel("Profit")
axes[0, 1].set_ylabel("Region")

# Plot 3: Shipping Cost by Region
sns.barplot(x=shipping_by_region.values, y=shipping_by_region.index,
            hue=shipping_by_region.index, palette="rocket", legend=False, ax=axes[1, 0])
axes[1, 0].set_title("Shipping Cost by Region")
axes[1, 0].set_xlabel("Shipping Cost")
axes[1, 0].set_ylabel("Region")

# Plot 4: Discount by Region
sns.barplot(x=discount_by_region.values, y=discount_by_region.index,
            hue=discount_by_region.index, palette="ch:start=.2,rot=-.3", legend=False, ax=axes[1, 1])
axes[1, 1].set_title("Average Discount by Region")
axes[1, 1].set_xlabel("Discount")
axes[1, 1].set_ylabel("Region")

plt.tight_layout()
plt.show()

Insights

The problem statement was how can we grow our business. We focused on increasing our profits and reducing our discounts. So what insights can we draw from the visualizations above?

Some specific smartphones are one of the most profitable products for our business, so I would focus on if the demand is met and how can we improve the customer experience for that domain.
Africa and Central Asia stores are facing overall losses. Therefore, we need to consider reducing the scale of the store. More information is needed in this regard.
The second point is validated when we also look at the average discount in Africa and Asia – the discount is one of the highest while profit margins are negative. These regions, therefore, are contributing negatively to our overall business.
Western Europe seems to be our biggest market with higher demand and higher profit margins. So, we should be doubling down here to grow our business.

Conclusion

In this tutorial, we developed an end-to-end BI solution for a superstore. The tutorial dived into problem-solving, critical thinking followed by detailed analysis and providing actionable insights. This is close to a real-world application of how a Data/Business Analyst would approach a business problem.

We also saw that in order to have a centralized data repository, we can make use of GridDB. With GridDB’s python client, the integration of pandas for analysis seemed effortless.

If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.

Business Intelligence Case Study with GridDB and Python