Predictive Maintenance with Python and GridDB

Every asset has a life cycle and thus requires frequent maintenance. However, we may not want to spend resources too soon as that is a waste and we cannot be too late as it is risky. Thus, “when” to repair is an important problem.

Predictive maintenance is a way to predict or forecast the probability of breakdown of a fixed asset. Predictive maintenance is important for all kinds of businesses, from a large company predicting the breakdown of motors to a small businesses predicting the breakdown of printers. It can also be used to save lives for example predict the likelihood of a factory machine breakdowns or even gas leaks.

Traditionally predictive modelling is done with feature engineering and simple regression models, however these methods are difficult to reuse. We will use a more advanced LSTM models. LSTMs have the ability to use sequences of data to make predictions on a rolling basis. The sequence of data can as small as 5 and as large as 100. For the data backend, we will use GridDB which is highly scalable and ensures high reliability. Installing GridDB is simple and is well documented here. To check out the python-GridDB client please refer to this video.


Let us setup GridDB first!

Quick setup of GridDB Python Client on Ubuntu 20.04:

  • Install GridDB

Download and install the deb from here.

  • Install C client

Download and install the Ubuntu from here.

  • Install requirements

1) Swig

tar xvfz v4.0.2.tar.gz 
cd swig-4.0.2 

2) Install python client

wget \ 
unzip . 

Make sure you have python-dev installed for the corresponding python version. We will use python 3.8 for this post.

3) We also need to point to the correct locations

export CPATH=$CPATH:<python header file directory path> 
export LIBRARY_PATH=$LIBRARY_PATH:<c client library file directory path> 


We can also use GridDB with docker as shown here

Python libraries

Next we install the python libraries. Installing numpy, keras, tensorflow, sklearn and pandas is a simple pip install.

pip install keras 
pip install numpy 
pip install tensorflow 
pip install pandas 
pip install sklearn

Predictive Modelling

Step 1: Downloading Dataset

We use a subset of the NASA turbofan dataset that can be downloaded from this Kaggle project. The data has the unit number, times in cycles, three operational settings and 21 sensor measurements. The train/test files have cycles so far and the truth file has the total number of cycles it can run.

Step 2: Importing Libraries

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix, recall_score, precision_score

from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM, Activation
from keras.callbacks import EarlyStopping

Step 3: Data Loading and Processing

Loading Data

dataset_train = pd.read_csv('/content/PM_train.txt',sep=' ',header=None).dropna(axis=1)
dataset_test  = pd.read_csv('/content/PM_test.txt',sep=' ',header=None).dropna(axis= 1)
dataset_truth = pd.read_csv('/content/PM_truth.txt',sep=' ',header=None).dropna(axis=1)

Alternatively, we can use GridDB to get this data frame.

 import griddb_python as griddb

# Initialize container
gridstore = factory.get_store(host= host, port=port, 
            cluster_name=cluster_name, username=uname, 

conInfo = griddb.ContainerInfo("attrition",
                    [["id", griddb.Type.LONG],
              .... #for all 23 variables      
                    griddb.ContainerType.COLLECTION, True)
cont = gridstore.put_container(conInfo) 
cont.create_index("id", griddb.IndexType.DEFAULT)

now we rename the columns for easy identification

features_col_name = ['os1','os2','os3','s1','s2','s3','s4','s5','s6','s7','s8','s9','s10','s11','s12','s13','s14','s15','s16','s17','s18','s19','s20','s21']
col_names = ['id','cycletime'] + features_col_name
dataset_train.columns = col_names

#renaming columns

We do the same for the truth file.


Next we generate the labels. We want to predict failure in the next 15 days. The data is structured such that that last cycle run is the point of failure. However in the test set the last datapoint is not present and that is available in the truth dataset. So, first we take the total cycles run so far, add the cycles left from the truth dataset to get the total time of failure. Finally we subtract the total time left with current time to get time to failure.

#get cycles left for train
dataset_train['ttf'] = dataset_train.groupby(['id'])['cycletime'].transform(max) - dataset_train['cycletime']

# generate column max for test data
rul = dataset_test.groupby('id')['cycletime'].max().reset_index()
dataset_test['ttf'] = dataset_train.groupby(['id'])['cycletime'].transform(max) - dataset_train['cycletime']
dataset_truth['rtf'] = dataset_truth['rul'] + rul['cycletime']
dataset_test = dataset_test.merge(pm_truth , on=['id'],how='left')
dataset_test['ttf'] = dataset_test['rtf'] - dataset_test['cycletime']
dataset_test.drop('rtf', axis=1, inplace=True)

Next we assign labels based on prediction period


dataset_train['label'] = dataset_train['ttf'].apply(lambda x: 1 if x <= period else 0)
dataset_test['label'] = dataset_test['ttf'].apply(lambda x: 1 if x <= period else 0)

Next, we scale the data as LSTM requires data to be scaled


Next, we choose how many datapoints to use for LSTM. we can use 50 predictions. For this, we group the training data in groups of 50.

def gen_sequence(id_df, seq_length, seq_cols):
    df_zeros=pd.DataFrame(np.zeros((seq_length-1, id_df.shape[1])),columns=id_df.columns)
    data_array = id_df[seq_cols].values
    num_elements = data_array.shape[0]
    for start, stop in zip(range(0, num_elements-seq_length), range(seq_length, num_elements)):
        la.append(data_array[start:stop, :])
    return np.array(la)

#generate train data
X_train=np.concatenate(list(list(gen_sequence(dataset_train[dataset_train['id']==id], seq_length, seq_cols))
                         for id in dataset_train['id'].unique()))
y_train=np.concatenate(list(list(gen_sequence(dataset_train[dataset_train['id']==id], seq_length,['label'])) 
                         for id in dataset_train['id'].unique())).max(axis =1)

# generate test data
X_test=np.concatenate(list(list(gen_sequence(dataset_test[dataset_test['id']==id], seq_length, seq_cols)) 
                         for id in dataset_test['id'].unique()))

y_test=np.concatenate(list(list(gen_sequence(dataset_test[dataset_test['id']==id], seq_length, ['label'])) 
                        for id in dataset_test['id'].unique())).max(axis =1)

Step 4: Prediction

Next, we start the prediction process.


We first create an LSTM model in Keras. for that we use the LSTM layer.

nb_features =X_train.shape[2]


model = Sequential()

         input_shape=(timestamp, nb_features),


model.add(Dense(units=1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])



Next, we compile the model. We use the mean_squared_error as the loss and evaluate it on accuracy.

 # fit the network, y_train, epochs=10, batch_size=200, validation_split=0.05, verbose=1,
          callbacks = [EarlyStopping(monitor='val_loss', min_delta=0, patience=0, verbose=0, mode='auto')])

Next, we train the model for 100 epochs., y_train, epochs=100, batch_size=1, verbose=1, shuffle=False) 
Epoch 1/10
98/98 [==============================] - 28s 250ms/step - loss: 0.1596 - accuracy: 0.9394 - val_loss: 0.0611 - val_accuracy: 0.9679

Evaluation and Predictions

Finally, we evaluate the test set, then we rescale the predictions and plot it along with the ground truth.

y_classes = y_prob.argmax(axis=-1)
print('Accuracy of model on test data: ',accuracy_score(y_test,y_classes))

Accuracy of model on test data:  0.9744536780547861

We can also calculate the probability of failure for every machine as follows:

machine_id = 1

failure prob is 0.15824139

Now we can play around with the prediction period, the interval for LSTM and the number of varaibles used to even get better results.


In this post we learned how to train an LSTM predictive maintenance model with Keras, python and GridDB. We can get a predictive accuracy of ~97% with a few lines of code.