Heart Failure Prediction using Machine Learning, Python, and GridDB

In this tutorial, we will explore the Heart Failure Prediction dataset which is publicly available on Kaggle. We will use GridDB to see how can we extract the data. Later, we will perform some Exploratory Data Analysis. Finally, we will build a Machine Learning Model for making future predictions. The outline of this tutorial is as follows:

Setting up your environment
Introduction to the dataset
Importing the necessary libraries
Loading the Dataset
Exploratory Data Analysis
Handling categorical variables
Machine Learning Model
Model Evaluation
Conclusion
References

You can read the Jupyter file here: https://github.com/griddbnet/Blogs/blob/main/Heart%20Failure%20Prediction.ipynb

1. Setting up your environment

The following tutorial is carried out in Jupyter Notebooks (Anaconda version 4.8.3) with Python version 3.8 on Windows 10 Operating system. Below mentioned packages need to be installed before the code execution:

The hyperlinks will direct you to the installation. Alternatively, if you are using a command line, simply type pip install package-name. Or in the case of Anaconda, conda install package-name also works.

While loading the dataset, this tutorial will cover two methods – Using GridDB as well as Using Pandas. To access GridDB using Python, the following packages also need to be installed beforehand:

GridDB C-client
SWIG (Simplified Wrapper and Interface Generator)
GridDB Python Client

2. Introduction to the dataset

Cardiovascular disease is one of the leading causes of death worldwide. Therefore, if machine learning could help predict heart failure prediction, the contribution would be significant. The dataset used in this tutorial has been developed by Davide Chicco, Giuseppe Jurman of BMC Medical Informatics and Decision Making. It has been open-sourced and can be downloaded from Kaggle.

The data contains a total of 918 instances (or rows) with 12 attributes (or columns). Out of these 12 attributes, 5 are categorical and 7 are numerical in nature. Let’s now go ahead and import the necessary libraries.

3. Importing the necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import plotly.graph_objects as go
import plotly.express as px
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import plot_confusion_matrix

If the installation was successful, the above cell should execute just fine without any error messages or warnings. However, if you do encounter an error –

Recheck if the installation was successful. If not, execute pip install package-name again.
Check if the version of the packages installed is compatible with your anaconda/system version.

4. Loading the dataset

4.1 Using GridDB

GridDB is a scalable, in-memory, No SQL database which makes it easier for you to store large amounts of data. Using GridDB’s python-client, we can now directly load our data as a pandas dataframe into the python environment. If you are new to GridDB, a tutorial on reading and writing to GridDB can be useful.

Assuming that you have already set up your database, we will now write the SQL query in python to load our dataset

import griddb_python as griddb

sql_statement = ('SELECT * FROM heart_failure_prediction')
heart_dataset = pd.read_sql_query(sql_statement, cont)

The cont variable has the container information where the data is stored.

4.2 Using Pandas

Alternatively, we can use the pandas read_csv() function. Note that both the methods would result in the same output as both loads the data in the form of a pandas dataframe.

heart_dataset = pd.read_csv('heart.csv')

5. Exploratory Data Analysis

Let us first determine the shape of our dataset i.e the number of rows and number of columns

heart_dataset.shape

(918, 12)

We will now display the first five rows of our data using the pandas head function to get a gist of how our data looks like.

heart_dataset.head()

	Age	Sex	ChestPainType	RestingBP	Cholesterol	RestingECG	MaxHR	ExerciseAngina	Oldpeak	ST_Slope	HeartDisease
0	40	M	ATA	140	289	Normal	172	N	0.0	Up	0
1	49	F	NAP	160	180	Normal	156	N	1.0	Flat	1
2	37	M	ATA	130	283	ST	98	N	0.0	Up	0
3	48	F	ASY	138	214	Normal	108	Y	1.5	Flat	1
4	54	M	NAP	150	195	Normal	122	N	0.0	Up	0

Great! There is a mix of categorical and numerical values in this dataset. Note that we can not pass categorical variables directly to our machine learning model. We will have to encode them before model training. Let us go ahead and check the data types of our attributes.

heart_dataset.dtypes

Age                 int64
Sex                object
ChestPainType      object
RestingBP           int64
Cholesterol         int64
FastingBS           int64
RestingECG         object
MaxHR               int64
ExerciseAngina     object
Oldpeak           float64
ST_Slope           object
HeartDisease        int64
dtype: object

5 of the attributes have a data type of object which signifies that they are categorical in nature while the rest of them are either float or int which can be directly passed during the model training.

We will also get rid of the null values (if any) as they can produce an error during mathematical operations.

heart_dataset.isna().sum()

Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64

Fortunately, we do not have any null values. We will now explore the categorical variables before moving on to the Machine Learning part.

categorical_cols= heart_dataset.select_dtypes(include=['object'])
categorical_cols.columns

Index(['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope'], dtype='object')

for cols in categorical_cols.columns:
    print(cols,'-', len(categorical_cols[cols].unique()),'Labels')

Sex - 2 Labels
ChestPainType - 4 Labels
RestingECG - 3 Labels
ExerciseAngina - 2 Labels
ST_Slope - 3 Labels

Since it is a single CSV file, it is better to split our dataset into train and test so that we can keep aside the test dataset for calculating the accuracy in later stages. We are using a 70-30 ratio for the train:test. The random_state variables ensure that these instances are picked randomly to minimize any bias or skewness.

train, test = train_test_split(heart_dataset,test_size=0.3,random_state= 1234)

labels = [x for x in train.ChestPainType.value_counts().index]
values = train.ChestPainType.value_counts()

The distribution of data by Chest Pain Type —

fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=.3)])

fig.update_layout(
    title_text="Distribution of data by Chest Pain Type (in %)")
fig.update_traces()
fig.show()

Distribution of data by Gender which is further divided into whether a person has a heart disease or not —

fig=px.histogram(heart_dataset, 
                 x="HeartDisease",
                 color="Sex",
                 hover_data=heart_dataset.columns,
                 title="Distribution of Heart Diseases by Gender",
                 barmode="group")
fig.show()

Try experimenting with other categorical variables using the histogram or pie function.

6. Handling categorical variables

We saw that the 2 attributes – Sex and ExerciseAngina among the 5 total categorical attributes are binary i.e. they only take two values. We can, therefore, manually encode these using 0 and 1. For the other values, we will use an encoding function.

train['Sex'] = np.where(train['Sex'] == "M", 0, 1)
train['ExerciseAngina'] = np.where(train['ExerciseAngina'] == "N", 0, 1)
test['Sex'] = np.where(test['Sex'] == "M", 0, 1)
test['ExerciseAngina'] = np.where(test['ExerciseAngina'] == "N", 0, 1)

<ipython-input-14-3d5da43d58db>:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

<ipython-input-14-3d5da43d58db>:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

<ipython-input-14-3d5da43d58db>:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

<ipython-input-14-3d5da43d58db>:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

train.head()

	Age	ChestPainType	RestingBP	Cholesterol	FastingBS	RestingECG	MaxHR	ExerciseAngina	Oldpeak	ST_Slope	HeartDisease
578	57	ASY	156	173	0	LVH	119	1	3.0	Down	1
480	58	ATA	126	0	1	Normal	110	1	2.0	Flat	1
512	35	NAP	123	161	0	ST	153	0	-0.1	Up	0
634	40	TA	140	199	0	Normal	178	1	1.4	Up	0
412	56	ASY	125	0	1	Normal	103	1	1.0	Flat	1

For attributes with 3 or more, we will use the pandas get_dummies function. It will create a new attribute per label. For instance, ChestPainType has 4 labels, therefore 4 new attributes will be created with a value of either 0 or 1.

train=pd.get_dummies(train)
test=pd.get_dummies(test)

train.head()

	Age	RestingBP	Cholesterol	FastingBS	MaxHR	ExerciseAngina	Oldpeak	HeartDisease	ChestPainType_ASY	ChestPainType_ATA	ChestPainType_NAP	ChestPainType_TA	RestingECG_LVH	RestingECG_Normal	RestingECG_ST	ST_Slope_Down	ST_Slope_Flat	ST_Slope_Up
578	57	156	173	0	119	1	3.0	1	1	0	0	0	1	0	0	1	0	0
480	58	126	0	1	110	1	2.0	1	0	1	0	0	0	1	0	0	1	0
512	35	123	161	0	153	0	-0.1	0	0	0	1	0	0	0	1	0	0	1
634	40	140	199	0	178	1	1.4	0	0	0	0	1	0	1	0	0	0	1
412	56	125	0	1	103	1	1.0	1	1	0	0	0	0	1	0	0	1	0

test.head()

	Age	Sex	RestingBP	Cholesterol	FastingBS	MaxHR	ExerciseAngina	Oldpeak	HeartDisease	ChestPainType_ASY	ChestPainType_ATA	RestingECG_LVH	RestingECG_Normal	ST_Slope_Flat	ST_Slope_Up
581	48	0	140	208	0	159	1	1.5	1	1	0	0	1	0	1
623	60	0	140	293	0	170	0	1.2	1	1	0	1	0	1	0
60	49	0	100	253	0	174	0	0.0	0	0	1	0	1	0	1
613	58	0	140	385	1	135	0	0.3	0	1	0	1	0	0	1
40	54	1	150	230	0	130	0	0.0	0	0	1	0	1	0	1

train.shape

(642, 19)

test.shape

(276, 19)

The total number of attributes have increased because of the encoding.

We will again divide our training and test sets into X and Y. X represents the set of independent variables/attributes which determine the outcome of the dependent variable, Y. In our case, the dependent variable or explanatory variable is HeartDisease.

x_train=train.drop(['HeartDisease'],1)
x_test=test.drop(['HeartDisease'],1)

y_train=train['HeartDisease']
y_test=test['HeartDisease']

print(x_train.shape)
print(x_test.shape)

(642, 18)
(276, 18)

7. Machine Learning Model

Let us now build a Logistic Regression model with the following parameters —

max_iter=10000. Signifies maximum number of iterations taken for the solver to converge. The default choice is 100 iterations.
penalty=l2. Signifies the norm used for penalty. Options include – None, l1, l2, and, elasticnet. The default is l2, so we do not have to provide it explicitly.

There are several parameters available for the function including class_weight, random_state, etc. Official documentation with usage and default parameters can be found here.

lr = LogisticRegression(max_iter=10000)
model1=lr.fit(x_train, y_train)

print("Train accuracy:",model1.score(x_train, y_train))

Train accuracy: 0.8566978193146417

The training accuracy is approximately 85.6% which seems a decent start. Let us go ahead and make predictions for the test dataset.

8. Model Evaluation

print("Test accuracy:",model1.score(x_test,y_test))

Test accuracy: 0.894927536231884

The test accuracy is nearly 89.5% which is higher than expected. Great! We can now store the predictions using the predict method on test dataset.

lrpred = lr.predict(x_test)

8.1 Classification Report

The classification_report is one of the metrics in the scikit-learn library used for model evaluation. The function outputs the following:

Precision: Defined as True Positive/(True Positive+False Positive)
Recall: Defined as True Positive/(True Positive+False Negative)
F1 Score: The weighted harmonic mean of precision and recall. 1 signifies that both get equal weightage.
Support: Number of occurences of each class in the ground truth.

print(classification_report(lrpred,y_test))

              precision    recall  f1-score   support

           0       0.85      0.90      0.88       114
           1       0.93      0.89      0.91       162

    accuracy                           0.89       276
   macro avg       0.89      0.90      0.89       276
weighted avg       0.90      0.89      0.90       276

8.2 Confusion Matrix

Confusion Matrix is again one of the metrics used for evaluating your classifier. By definition, each entity (i,j) in the confusion matrix represents the observations that are actually in group i but classified under group j by your model. Explore more on what parameters can be customized for the confusion matrix here.

displr = plot_confusion_matrix(lr, x_test, y_test,cmap=plt.cm.OrRd , values_format='d')

9. Conclusion

In this tutorial, we covered how can we use GriDB and Python to build a classifier for the Heart Failure Prediction Dataset. We covered two ways to access our data – Using GridDB and Pandas. GridDB is an efficient way when dealing with large amounts of data as it is highly scalable and open-source. Install GridDB today!

10. References

https://www.kaggle.com/fedesoriano/heart-failure-prediction
https://www.kaggle.com/sisharaneranjana/machine-learning-to-the-fore-to-save-lives
https://www.kaggle.com/durgancegaur/a-guide-to-any-classification-problem

If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.