Neural Networks have taken the world of machine learning and predictive modelling in the last 5 years. Neural network have the ability to learn complex relationships in data and have been shown to work for a variety of applications from finance to robotics. Inspired by the human brain, Neural Networks work on the principle of signal transmission from one neuron to the other. Neural networks comprise of mainly three types of node layers — an input layer, one or more hidden layers, and an output layer. Each node is an artificial neuron which connects to another using a nonlinear function and has an associated weight and threshold. The neuron is activated only if the output is above the specified threshold value. This is how the data is passed along to the next layer of the network.
Toshiba GridDB is a highly scalable database optimized for IoT and Big Data. GridDB lets us collect, store and query large amount of data easily. Moreover, GridDB is highly scalable and ensures high reliability, as a result it can server as a great database for neural networks training and inference. Installing GridDB is pretty simple, and is well documented here. To checkout the python-gridDB client please refer to this video.
In this post we will create train a simple neural network based classification model in python with GridDB. We will use Keras, which is an easy-to-use free open source Python library for developing and evaluating deep learning models.
Setup
Let’s setup GridDB first!
Quick setup of GridDB Python Client on Ubuntu 20.04:
-
Install GridDB Download and install the deb from here.
-
Install C client Download and install the Ubuntu from here.
-
Install requirements 1) Swig
wget https://github.com/swig/swig/archive/refs/tags/v4.0.2.tar.gz
tar xvfz v4.0.2.tar.gz
cd swig-4.0.2
./autogen.sh
./configure
make
2) Install python client
wget \
https://github.com/griddb/python_client/archive/refs/tags/0.8.4.zip
unzip . 0.8.4.zip
Make sure you have python-dev installed for the corresponding python version. We will use python 3.8 for this post.
3) We also need to point to the correct locations
export CPATH=$CPATH:<python header file directory path>
export LIBRARY_PATH=$LIBRARY_PATH:<c client library file directory path></c></python>
We can also use GridDB with docker as shown here
Python libraries
Next we install the python libraries. Installing matplotlib, numpy, keras, tensorflow and pandas is a simple pip install.
pip install keras
pip install numpy
pip install tensorflow
pip install matplotlib
pip install pandas
Prediction
Step 1: Downloading Dataset
We will use a publicly available dataset from Kaggle. For this post we have picked the mobile price classification. The aim is to classify cellphones into four price categories. Below is the description of the dataset.
column | description |
---|---|
battery_power | Total energy a battery can store in one time measured in mAh |
blue | Has Bluetooth or not |
clock_speed | speed at which microprocessor executes instructions |
dual_sim | Has dual sim support or not |
fc | Front Camera megapixels |
four_g | Has 4G or not |
int_memory | Internal Memory in Gigabytes |
m_dep | Mobile Depth in cm |
mobile_wt | Weight of mobile phone |
n_cores | Number of cores of processor |
pc | Primary Camera mega pixels |
px_height | Pixel Resolution Height |
px_width | Pixel Resolution Width |
ram | Random Access Memory in MegaBytes |
sc_h | Screen Height of mobile in cm |
sc_w | Screen Width of mobile in cm |
talk_time | longest time that a single battery charge will last when you are |
three_g | Has 3G or not |
touch_screen | Has touch screen or not |
wifi | Has wifi or not |
price_range | This is the target variable with values of 0(low cost), 1(medium cost), 2(high cost), and 3(very high cost). |
Step 2: Importing Libraries
We first import relevant libraries i.e pandas for loading the dataset, matplotlib for visualisations and tensorflow for the deep learning model.
import tensorflow as tf
import numpy as np
import pandas as pd
from tensorflow import keras
from tensorflow.keras import layers
Step 3: Data Loading and Processing
Loading Data
First we load the data. For this we use the read_csv functionality in pandas. Note that the dataset has the test and train already sperated into two files.
dataframe = pd.read_csv('train.csv')
Alternatively, we can use GridDB to get this dataframe.
Feature Encoding
We first encode the data sets into categorical and numeric format. For numeric we will normalize the data and for categorical we will convert the data type. We also create a set of features and remove the target from the set of features.
numeric_feats = ["mobile_wt", "m_dep", "int_memory", "fc", "clock_speed", "talk_time","n_cores", "sc_w", "sc_h", "ram", "px_width","px_height","pc", "battery_power"]
dataframe[numeric_feats] = dataframe[numeric_feats].apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))
categorical_feats = ["blue", "four_g", "dual_sim", "wifi", "touch_screen", "three_g","price_range"]
dataframe[categorical_feats] = dataframe[categorical_feats].astype("category")
features = numeric_feats + categorical_feats
features.remove("price_range")
We can now describe the dataset and check the distribution.
dataframe[numeric_feats].describe()
dataframe[categorical_feats].describe()
Splitting Data into Validation and Train
We then create a small validation set to test data on while training to make sure we do not overfit on the data. We take 20% of the data as validation set.
val_dataframe = dataframe.sample(frac=0.2, random_state=1337)
train_dataframe = dataframe.drop(val_dataframe.index)
print("Number of training samples:", len(train_dataframe))
print("Number of validation samples:", len(val_dataframe))
Preparing data for Keras
Since this is a multiclass classification problem. We have to convert the target to one-hot encoding. We do that with tf.keras.utils.to_categorical
. As keras takes in numpy arrays, we convert the pandas dataframe to numpy.
Y_train = tf.keras.utils.to_categorical(train_dataframe["price_range"], num_classes=4)
Y_val = tf.keras.utils.to_categorical(val_dataframe["price_range"], num_classes=4)
X_train = train_dataframe[features].values
X_val = val_dataframe[features].values
Step 4: Prediction
Next we start the prediction process.
Initializing
We will create a simple model for our prediction. We will add 12 dense layers. Note that we will make the input dimensions 20. Then we add a dropout module that helps with overfitting. Finally w e add a few more hidden layers and then end it with a softmax. The softmax gives us four probability scores for the four classes. We can play around with the model configurations, adding or deleting layers.
# define the keras model
model = keras.Sequential()
model.add(layers.Dense(12, input_dim=20, activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(8, activation='relu'))
model.add(layers.Dense(4, activation='softmax'))
Training
Next we compile the model. We use the categorical cross entropy as the loss and evaluate it on accuracy. Ideally we can use AUC or any other metric we seem fit.
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Next we train the model for 200 epochs.
num_epochs = 200
history = model.fit(X_train,
Y_train,
epochs=num_epochs ,
validation_data=(X_val, Y_val))
Epoch 1/200
50/50 [==============================] - 0s 2ms/step - loss: 0.1648 - accuracy: 0.9431 - val_loss: 0.2359 - val_accuracy: 0.9125
Epoch 2/200
50/50 [==============================] - 0s 2ms/step - loss: 0.1871 - accuracy: 0.9388 - val_loss: 0.1986 - val_accuracy: 0.9175
Epoch 3/200
....
Epoch 199/200
50/50 [==============================] - 0s 1ms/step - loss: 0.1245 - accuracy: 0.9575 - val_loss: 0.4880 - val_accuracy: 0.8075
Epoch 200/200
50/50 [==============================] - 0s 1ms/step - loss: 0.1267 - accuracy: 0.9600 - val_loss: 0.4585 - val_accuracy: 0.8150
Evaluation
Next we plot the loss and the accuracy for the train and validation. Ideally the loss should go down and the accuracy would go up.
history = history.history
import matplotlib.pyplot as plt
%matplotlib inline
epochs = list(range(num_epochs))
loss = history["loss"]
val_loss = history["val_loss"]
plt.plot(epochs, loss, 'bo', label="Training Loss")
plt.plot(epochs, val_loss, 'b', label="Validation Loss")
plt.title('Training and Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss Value')
plt.legend()
plt.show()
epochs = list(range(num_epochs))
accuracy = history["accuracy"]
val_accuracy = history["val_accuracy"]
plt.plot(epochs, accuracy, 'bo', label="Accuracy")
plt.plot(epochs, val_accuracy, 'b', label="Val Accuracy")
plt.title('Training and Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss Value')
plt.legend()
plt.show()
Note that the val accuracy is going down but the tain accuracy is constant this may imply that the model is overfitting. In that case we can either try several strategies: reduce epochs, use a different loss function, argument the data and so on.
Predictions
# evaluate the keras model
test = pd.read_csv('test.csv')
test[numeric_feats] = test[numeric_feats].apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))
if "price_range" in categorical_feats:
categorical_feats.remove("price_range")
test[categorical_feats] = test[categorical_feats].astype("category")
feats = test[features].values
predictions = model.predict(feats)
predictions = np.argmax(predictions, axis=1)
Finally we load the testing data and predict. Note that keras will return probabilities for each class, so we take the argmax to get the class.
Conclusion
In this article we learnt how to train a simple neural network for a classification task.
If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.