Exploring customers’ personalities using Python

In this tutorial, we will be exploring a dataset containing customers’ information such as Marital Status, Education, Income, etc. Getting an insight into the kinds of customers a firm deals with helps to provide better and customized services. Therefore, customer analysis and segmentation play a crucial role in a business.

The outline for this tutorial is as follows:

  1. About the Dataset
  2. Importing the necessary libraries
  3. Exploratory Data Analysis
  4. Plotting the Correlation
  5. Encoding the data
  6. Scaling the data
  7. Dimensionality reduction
  8. CLustering
  9. Visualizing the clusters
  10. Observation
  11. Conclusion

1. About the Dataset

The dataset used for this tutorial is publicly available on Kaggle. The total number of instances (or rows) are 2240 whereas the total number of attributes (or columns) is 29. As mentioned above, each attribute corresponds to a person trait important for customer classification such as Marital Status, Income, Customer ID, etc.

Go ahead and download the dataset! We will next be importing libraries to begin our analysis.

2. Importing the necessary libraries

import numpy as np
import pandas as pd
import datetime
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import colors
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.colors import ListedColormap
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA

In case you run into a package installation error, you can install it by typing pip install package-name in the command line. Alternatively, if you’re using a conda virtual environment, you can type conda install package-name.

Loading our dataset and Exploratory Data Analysis (EDA)

Let’s go ahead and load our CSV file.

Using GridDB

To store large amounts of data, a CSV file can be cumbersome and chaotic. GridDB serves as a great alternative as it is open-source and a highly scalable database. It is optimized for IoT and Big Data so that you can store your time-series data effectively. It utilizes an in-memory data architecture along with parallel processing for higher performance. You can Download GridDB from their official website.

By using the GridDB-Python client, we can load our data directly into the Python environment as a dataframe.

import griddb_python as griddb
import pandas as pd

sql_statement = ('SELECT * FROM marketing_campaign')
movie_review_test = pd.read_sql_query(sql_statement, cont)

Note that the cont variable has the container information where our data is stored. Replace the marketing_campaign with the name of your container. You can also refer to the tutorial on reading and writing to GridDB for more information.

Using Pandas

If you do not have GridDB, you can load the file using Pandas.

data = pd.read_csv('marketing_campaign.csv', sep="\t")

We’ll display the first five rows using the head command to get a gist of our data.

data.head()
data.head()
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact Z_Revenue Response
0 5524 1957 Graduation Single 58138.0 0 0 04-09-2012 58 635 7 0 0 0 0 0 0 3 11 1
1 2174 1954 Graduation Single 46344.0 1 1 08-03-2014 38 11 5 0 0 0 0 0 0 3 11 0
2 4141 1965 Graduation Together 71613.0 0 0 21-08-2013 26 426 4 0 0 0 0 0 0 3 11 0
3 6182 1984 Graduation Together 26646.0 1 0 10-02-2014 26 11 6 0 0 0 0 0 0 3 11 0
4 5324 1981 PhD Married 58293.0 1 0 19-01-2014 94 173 5 0 0 0 0 0 0 3 11 0

5 rows × 29 columns

len(data)
2240

As mentioned earlier, we have a total of 2240 instances. Let’s eliminate any missing values since they can behave abnormally during mathematical operations.

data = data.dropna()
len(data)
2216

We had 24 missing instances in our dataset. Let’s also convert the Date attribute of our dataset to the DateTime format as it will later be used for calculating the time a customer has been active.

data["Dt_Customer"] = pd.to_datetime(data["Dt_Customer"])
dates = []
for i in data["Dt_Customer"]:
    i = i.date()
    dates.append(i) 

We will be creating a new attribute – Customer_dur to calculate the duration for which a customer has been associated with the firm. For the sake of simplicity, we will be extracting the date of the most recent customer and using it to calculate the duration for others.

days = []
d1 = max(dates) 
for i in dates:
    delta = d1 - i
    days.append(delta)
data["Customer_dur"] = days
data["Customer_dur"] = pd.to_numeric(data["Customer_dur"], errors="coerce")

Great! Now let’s have a look at the next attribute – Marital Status. From the head command, it seems like there are more than two categories.

data["Marital_Status"].value_counts()
Married     857
Together    573
Single      471
Divorced    232
Widow        76
Alone         3
Absurd        2
YOLO          2
Name: Marital_Status, dtype: int64

There are 8 further categories of this attribute. However, we will be grouping them together to ease out our segmentation process. We will be creating a binary attribute – Living_With having either of the two values – Partner or Alone.

data["Living_With"]=data["Marital_Status"].replace({"Married":"Partner", "Together":"Partner", "Absurd":"Alone", "Widow":"Alone", "YOLO":"Alone", "Divorced":"Alone", "Single":"Alone",})

Now that we have a derivative feature, Marital_Status serves as a redundant attribute. Let’s go ahead and drop that. We’ll also be dropping some columns that contain promotions’ and deals’ information as it is not relevant for customer segmentation.

cols_del = ['Marital_Status', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1','AcceptedCmp2', 'Complain', 'Response']
data = data.drop(cols_del, axis=1)

Let’s have a look at how many categories does the Education attribute hold.

data["Education"].value_counts()
Graduation    1116
PhD            481
Master         365
2n Cycle       200
Basic           54
Name: Education, dtype: int64

We will repeat the same step with the Education attribute as well. In this case, the categories will be Undergraduate, Graduate, Postgraduate.

data["Education"]=data["Education"].replace({"Basic":"Undergraduate","2n Cycle":"Undergraduate", "Graduation":"Graduate", "Master":"Postgraduate", "PhD":"Postgraduate"})

We will be creating some new features derived from the original ones for easy calculations in the later sections. These features are as follows:

  1. Age: can be derived from the Year_of_Birth.
  2. Spent: sum of all the edible products (wine, fruits, fish, etc.).
  3. Children: sum of Kidhome and Teenhome.
  4. Family_Size: We’ll use the Living_With attribute along with the Children attribute.
  5. Is_Parent: Binary attribute with values 0 or 1 derived from the Children attribute.
data["Age"] = 2021-data["Year_Birth"]
data["Spent"] = data["MntWines"]+ data["MntFruits"]+ data["MntMeatProducts"]+ data["MntFishProducts"]+ data["MntSweetProducts"]+ data["MntGoldProds"]
data["Children"]=data["Kidhome"]+data["Teenhome"]
data["Family_Size"] = data["Living_With"].replace({"Alone": 1, "Partner":2})+ data["Children"]
data["Is_Parent"] = np.where(data.Children> 0, 1, 0)

Like previously, we’ll drop the redundant attributes.

to_drop = ["Dt_Customer", "Z_CostContact", "Z_Revenue", "Year_Birth", "ID"]
data = data.drop(to_drop, axis=1)

After all the changes, our data looks like –

data.describe()
Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Customer_dur Age Spent Children Family_Size Is_Parent
count 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2.216000e+03 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000
mean 52247.251354 0.441787 0.505415 49.012635 305.091606 26.356047 166.995939 37.637635 27.028881 43.965253 4.085289 2.671029 5.800993 5.319043 4.423735e+16 52.179603 607.075361 0.947202 2.592509 0.714350
std 25173.076661 0.536896 0.544181 28.948352 337.327920 39.793917 224.283273 54.752082 41.072046 51.815414 2.740951 2.926734 3.250785 2.425359 2.008532e+16 11.985554 602.900476 0.749062 0.905722 0.451825
min 1730.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 25.000000 5.000000 0.000000 1.000000 0.000000
25% 35303.000000 0.000000 0.000000 24.000000 24.000000 2.000000 16.000000 3.000000 1.000000 9.000000 2.000000 0.000000 3.000000 3.000000 2.937600e+16 44.000000 69.000000 0.000000 2.000000 0.000000
50% 51381.500000 0.000000 0.000000 49.000000 174.500000 8.000000 68.000000 12.000000 8.000000 24.500000 4.000000 2.000000 5.000000 6.000000 4.432320e+16 51.000000 396.500000 1.000000 3.000000 1.000000
75% 68522.000000 1.000000 1.000000 74.000000 505.000000 33.000000 232.250000 50.000000 33.000000 56.000000 6.000000 4.000000 8.000000 7.000000 5.927040e+16 62.000000 1048.000000 1.000000 3.000000 1.000000
max 666666.000000 2.000000 2.000000 99.000000 1493.000000 199.000000 1725.000000 259.000000 262.000000 321.000000 27.000000 28.000000 13.000000 20.000000 9.184320e+16 128.000000 2525.000000 3.000000 5.000000 1.000000

8 rows × 21 columns

4. Plotting the correlation

We will now plot these newly created features pair-wise while using the Is_Parent as the main classification attribute.

To_Plot = [ "Income", "Age", "Spent", "Is_Parent"]
plt.figure()
sns.pairplot(data[To_Plot], hue= "Is_Parent") 
plt.show()
<Figure size 432x288 with 0 Axes>

We can see that we have a few outliers such as:

  1. Age>100 is highly unlikely. This implies that some of the data instances might be old. Also, the majority of the data is concentrated below the threshold of 80. So, we’ll eliminate the few instances that cross that threshold

  2. Income>600000 has only one instance.

We’ll go ahead and delete that.

data = data[(data["Age"]<90)]
data = data[(data["Income"]<600000)]
len(data)
2212

5. Encoding the data

Categorical variables need to be encoded before a machine learning task as the string value can not be directly used in mathematical operations. Let’s print out the categorical attributes in our dataset.

s = (data.dtypes == 'object')
object_cols = list(s[s].index)

object_cols
['Education', 'Living_With']

We will use the scikit-learn’s Label Encoder which encodes the target values into classes: 0 to n-1. In our case, Education has three categories, so the values will be denoted by 0,1,2 respectively. Similarly, Living_With is a binary attribute, therefore, the values will take the form of 0 or 1.

LE=LabelEncoder()
for i in object_cols:
    data[i]=data[[i]].apply(LE.fit_transform)
data.head()
Education Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Customer_dur Living_With Age Spent Children Family_Size Is_Parent
0 0 58138.0 0 0 58 635 88 546 172 88 10 4 7 83894400000000000 0 64 1617 0 1 0
1 0 46344.0 1 1 38 11 1 6 2 1 1 2 5 10800000000000000 0 67 27 2 3 1
2 0 71613.0 0 0 26 426 49 127 111 21 2 10 4 40780800000000000 1 56 776 0 2 0
3 0 26646.0 1 0 26 11 4 20 10 3 0 4 6 5616000000000000 1 37 53 1 3 1
4 1 58293.0 1 0 94 173 43 118 46 27 3 6 5 27734400000000000 1 40 422 1 3 1

5 rows × 23 columns

6. Scaling the data

It is also important to note that these numerical values differ in scale. This could lead to a biased model which gives much significance to one attribute over the other. Therefore, it is important to map them on a similar scale.

Standard Scaler removes the mean and scales the features to unit variance. More information on how unit variance is calculated can be found here.

scaler = StandardScaler()
scaler.fit(data)
scaled_data = pd.DataFrame(scaler.transform(data),columns= data.columns)
scaled_data.head()
Education Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Customer_dur Living_With Age Spent Children Family_Size Is_Parent
0 -0.893586 0.287105 -0.822754 -0.929699 0.310353 0.977660 1.552041 1.690293 2.453472 1.483713 2.503607 -0.555814 0.692181 1.973583 -1.349603 1.018352 1.676245 -1.264598 -1.758359 -1.581139
1 -0.893586 -0.260882 1.040021 0.908097 -0.380813 -0.872618 -0.637461 -0.718230 -0.651004 -0.634019 -0.571340 -1.171160 -0.132545 -1.665144 -1.349603 1.274785 -0.963297 1.404572 0.449070 0.632456
2 -0.893586 0.913196 -0.822754 -0.929699 -0.795514 0.357935 0.570540 -0.178542 1.339513 -0.147184 -0.229679 1.290224 -0.544908 -0.172664 0.740959 0.334530 0.280110 -1.264598 -0.654644 -1.581139
3 -0.893586 -1.176114 1.040021 -0.929699 -0.795514 -0.872618 -0.561961 -0.655787 -0.504911 -0.585335 -0.913000 -0.555814 0.279818 -1.923210 0.740959 -1.289547 -0.920135 0.069987 0.449070 0.632456
4 0.571657 0.294307 1.040021 -0.929699 1.554453 -0.392257 0.419540 -0.218684 0.152508 -0.001133 0.111982 0.059532 -0.132545 -0.822130 0.740959 -1.033114 -0.307562 0.069987 0.449070 0.632456

5 rows × 23 columns

7. Dimensionality reduction

We now have the scaled data but the total number of columns is still large to deal with. A higher number of columns lead to higher dimensions which makes it harder to work with. For the sake of simplicity, we will be reducing the total number of columns to 3 as some of them are already redundant.

Principal Component Analysis (PCA) is one of the popular techniques used for dimensionality reduction as it minimizes information loss.

pca = PCA(n_components=3)
pca.fit(scaled_data)
pca_data = pd.DataFrame(pca.transform(scaled_data), columns=(["c1","c2", "c3"]))
pca_data.describe().T
count mean std min 25% 50% 75% max
c1 2212.0 -2.549698e-16 2.878377 -5.969394 -2.538494 -0.780421 2.383290 7.444304
c2 2212.0 -3.924929e-17 1.706839 -4.312236 -1.328300 -0.158123 1.242307 6.142677
c3 2212.0 6.936384e-17 1.221957 -3.530349 -0.828741 -0.021947 0.799472 6.614546

8. Clustering

Now that our data has been cleaned and reduced to 3 dimensions, we can go ahead and divide it into clusters. For this tutorial, we will be using Agglomerative Clustering from the scikit-learn library. Agglomerative Clustering is a recursive method of clustering that uses linkage distance. More information can be found here.

The number of clusters is chosen as 4. However, for real-world models, there are approaches such as the Elbow method to anticipate the ideal number of clusters for a sample of data.

ac = AgglomerativeClustering(n_clusters=4)
customer_ac = ac.fit_predict(pca_data)

Let’s add this data to our original dataframes.

pca_data["Clusters"] = customer_ac
data["Clusters"]= customer_ac

9. Visualizing the clusters

Great! It is now time to plot our clusters and see how they look in 3D.

x =pca_data["c1"]
y =pca_data["c2"]
z =pca_data["c3"]

fig = plt.figure(figsize=(8,8))
ax = plt.subplot(111, projection='3d')
ax.scatter(x, y, z, s=20, c=pca_data["Clusters"])
plt.show()

We will now plot the pairwise plots of the Spent attribute with the other attributes we created previously in this tutorial. This will give us an insight as to how the final results are affected by each attribute and what kind of customers belong to each cluster.

columns = [ "Customer_dur", "Age", "Family_Size", "Is_Parent", "Education","Living_With"]
h = data["Clusters"]
for col in columns:
    plt.figure()
    sns.jointplot(x=data[col], y=data["Spent"], hue=h, kind="kde")
    plt.show()
C:\Users\Shripriya\anaconda3\lib\site-packages\seaborn\distributions.py:434: UserWarning: The following kwargs were not used by contour: 'hue'
  cset = contour_func(xx, yy, z, n_levels, **kwargs)



<Figure size 432x288 with 0 Axes>

C:\Users\Shripriya\anaconda3\lib\site-packages\seaborn\distributions.py:434: UserWarning: The following kwargs were not used by contour: 'hue'
  cset = contour_func(xx, yy, z, n_levels, **kwargs)



<Figure size 432x288 with 0 Axes>

C:\Users\Shripriya\anaconda3\lib\site-packages\seaborn\distributions.py:434: UserWarning: The following kwargs were not used by contour: 'hue'
  cset = contour_func(xx, yy, z, n_levels, **kwargs)



<Figure size 432x288 with 0 Axes>

C:\Users\Shripriya\anaconda3\lib\site-packages\seaborn\distributions.py:434: UserWarning: The following kwargs were not used by contour: 'hue'
  cset = contour_func(xx, yy, z, n_levels, **kwargs)



<Figure size 432x288 with 0 Axes>

C:\Users\Shripriya\anaconda3\lib\site-packages\seaborn\distributions.py:434: UserWarning: The following kwargs were not used by contour: 'hue'
  cset = contour_func(xx, yy, z, n_levels, **kwargs)



<Figure size 432x288 with 0 Axes>

C:\Users\Shripriya\anaconda3\lib\site-packages\seaborn\distributions.py:434: UserWarning: The following kwargs were not used by contour: 'hue'
  cset = contour_func(xx, yy, z, n_levels, **kwargs)



<Figure size 432x288 with 0 Axes>

10. Observation

Some key insights from the above joint plots:

  1. Cluster 0: Certainly a parent, 40-70 age group, family size of 2-4.
  2. Cluster 1: Not a parent, 30-80 age group, family size of 1-2.
  3. Cluster 2: The majority of customers are a parent, all age groups from 20-80, family size of 1-3.
  4. Cluster 3: Certainly a parent, 35-75 age group, family size of 2-5.

The Customer_dur attribute spans all across the clusters and is not specific to one, therefore we get a widely spread shaped.

Note: The seaborn library is not updated to its latest version on the system where the tutorial was executed and therefore, some features like contour did not behave as expected. However, if you have an updated environment, the above graphs will be much clearer with distinct lines and cluster labels.

11. Conclusion

In this tutorial, we explored customer profiles and see how it can affect a firm’s business. We also segmented the customers using Agglomerative Clustering. In the end, we identified some key features corresponding to each cluster. Finally, we introduced an alternative for storing large amounts of data in a highly efficient and scalable way – GridDB!

This code has been inspired from Kaggle

If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.