In this tutorial, we will explore a housing dataset using Python. We will first prune the dataset as per our needs. Later, we will see how can we build a Machine Learning model to fit our dataset and make future predictions.
The outline of the tutorial is as follows:
- Pre-requisites
- About the Dataset
- Importing the Libraries
- Loading the Dataset
- Data Preprocessing
- Data Normalization
- Splitting the Dataset
- Building the Model
- Making Predictions
- Model Evaluation
- Conclusion
1. Prerequisites
This tutorial is executed using Jupyter Notebooks (Anaconda version 4.8.3) with Python version 3.8 on Windows 10 Operating system. The following packages need to be installed before the code execution:
If you are using Anaconda, packages can be installed through multiple ways such as the User Interface, Command Line, or Jupyter Notebooks. The most conventional way to install a python package is via pip
. If you are using the command line or the terminal, type pip install package-name
. Another way to install a package is through conda install package-name
within the Anaconda environment.
Also, note that we will cover two methods to load our dataset in the python environment – Using Pandas
and GridDB
. For using GridDB within the python environment, the following packages are required:
- GridDB C-client
- SWIG (Simplified Wrapper and Interface Generator)
- GridDB Python Client
2. About the Dataset
We will be using a snapshot of the Melbourne Housing Dataset which has been scraped from public resources and is now available on Kaggle. The dataset has been preprocessed to some extent and contains a total of 13580 instances. The number of attributes present in the dataset is 21. The dependent variable is the price of the property while the other 20 attributes are independent. Let us now get started on the code.
3. Importing the Libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
The above cell should execute without any output if you successfully installed the libraries. In case you encounter an error, try the following :
- Reconfirm if the installation was successful. If not, try executing
pip install package-name
again. - Check if your system is compatible with the package version.
4. Loading the Dataset
Using GridDB
GridDB is an open-source time-series database designed for handling large amounts of data. It is optimized for IoT and is highly efficient because of its in-memory architecture. Since dealing with files locally can lead to integration issues in a professional environment, using a reliable database becomes important. GridDB provides that reliability and scalability with fault tolerance.
Moreover, with GridDB’s python client, it has become much easier to include the database and manipulate it directly within the coding environment. Learn more about the GriDB WebAPI here.
Let us now go ahead and load our dataset.
import griddb_python as griddb
sql_statement = ('SELECT * FROM melb_data')
dataset = pd.read_sql_query(sql_statement, container)
The dataset
variable will now have the data in the form of a pandas dataframe. If you are new to GridDB, a tutorial on how to insert data in GridDB might be helpful.
Using Pandas
Another way to load the dataset is using the pandas directly.
dataset = pd.read_csv("melb_data.csv")
5. Data Preprocessing
Great! Now that we have our dataset, let’s see what it actually looks like –
dataset.head()
len(dataset)
13580
As we can see there are a lot of columns, let’s go ahead and print out the column names to get a better idea of the independent and dependent attributes.
dataset.columns
Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
'Longtitude', 'Regionname', 'Propertycount'],
dtype='object')
dataset.describe()
The output of the describe
function conveys that the value of each attribute has a different scale. Therefore, we will need to normalize it before building our model.
Before normalization, we will be taking a subset of the attributes which seem to be directly correlated to the price.
dataset = dataset[["Rooms", "Price", "Bedroom2", "Bathroom","Landsize", "BuildingArea", "YearBuilt"]]
We also need to make sure that our data does not contain any null values before proceeding to model building.
dataset.isna().sum()
Rooms 0
Price 0
Bedroom2 0
Bathroom 0
Landsize 0
BuildingArea 6450
YearBuilt 5375
dtype: int64
As we can see, the two attributes contain several null values. Let’s go ahead and drop those instances.
dataset = dataset.dropna()
len(dataset)
6858
We will now create a new attribute called HouseAge
. The values of the attribute can be derived by subtracting the current year from the YearBuilt
attribute. This is helpful because we do not have to deal with dates anymore. All the attributes are now numerical in nature which will help us with the Machine Learning part later on.
dataset['HouseAge'] = 2022 - dataset["YearBuilt"].astype(int)
dataset.head()
Great! The YearBuilt
attribute is not needed anymore. So, let’s go ahead and drop that.
dataset = dataset.drop("YearBuilt", axis=1)
dataset.head()
6. Data Normalization
As we saw before, the values of the attributes have different scales which can lead to a disparity as the features larger in value will dominate over the smaller ones. Therefore, it is important to bring all the values down to one scale. For that, we will be using the Min-Max Normalization
. It is one of the most common techniques where the minimum values coverts to 0 while the maximum value converts to 1. All the other values spread out between 0 and 1.
There are direct methods present for normalization but they convert the dataframe into a NumPy array. Hence, we lose the column names. For that reason, we will define our own method which takes in a dataframe and returns a new normalized dataframe.
def normalize(df):
result = df.copy()
for feature_name in df.columns:
max_value = df[feature_name].max()
min_value = df[feature_name].min()
result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
return result
df = normalize(dataset)
Let’s have a look at our normalized dataframe.
df.head()
As we can see, all the values lie between 0 and 1. It is now time to split our dataset into train
and test
.
7. Splitting the dataset
We will be doing a 70-30
train-test split. In the case of smaller datasets, one can also do 80-20
.
train, test = train_test_split(df, test_size=0.3)
len(train)
4800
len(test)
2058
Let’s now separate the dependent and independent variables.
train_y = train[["Price"]]
train_x = train.drop(["Price"], axis=1)
test_y = test[["Price"]]
test_x = test.drop(["Price"], axis=1)
train_x.head()
train_y.head()
8. Building the Model
We will use a Linear Regression
model, in this case. Since this is a simple dataset, a Linear Regression Model should do the trick. To build a more sophisticated model, one can also try Decision Trees.
Explore more about Linear Regression with GridDB and Python here.
model = LinearRegression()
model.fit(train_x, train_y)
LinearRegression()
9. Making Predictions
Let us now make predictions on our test
dataset.
predictions = model.predict(test_x)
predictions
array([[0.0890521 ],
[0.06244483],
[0.13166691],
...,
[0.09182388],
[0.20981148],
[0.1077662 ]])
10. Model Evaluation
To quantify how good our predictions are, there are several metrics provided by the sklearn
library. We will be using the mean_absolute_error metric which is one of the most common metrics used for Linear Regression Models.
mean_absolute_error(predictions, test_y)
0.035125149637253696
Great! Our model has a mean absolute error of 0.03
which is not a bad start for a Linear Regression Model.
11. Conclusion
In this tutorial, we saw how can we build a Machine Learning Model for a housing dataset. In the beginning, we covered two methods for loading our dataset into the environment – GridDB
and Pandas. We also pruned the dataset as per our needs. Later on, we used the Linear Regression
function provided by the sklearn
library to fit our dataset.
Learn more about real-time predictions with GridDB and Python here.
If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.