Building a Linear Regression Model for Housing Data using Python and GridDB

In this tutorial, we will explore a housing dataset using Python. We will first prune the dataset as per our needs. Later, we will see how can we build a Machine Learning model to fit our dataset and make future predictions.

The outline of the tutorial is as follows:

Pre-requisites
About the Dataset
Importing the Libraries
Loading the Dataset
Data Preprocessing
Data Normalization
Splitting the Dataset
Building the Model
Making Predictions
Model Evaluation
Conclusion

1. Prerequisites

This tutorial is executed using Jupyter Notebooks (Anaconda version 4.8.3) with Python version 3.8 on Windows 10 Operating system. The following packages need to be installed before the code execution:

If you are using Anaconda, packages can be installed through multiple ways such as the User Interface, Command Line, or Jupyter Notebooks. The most conventional way to install a python package is via pip. If you are using the command line or the terminal, type pip install package-name. Another way to install a package is through conda install package-name within the Anaconda environment.

Also, note that we will cover two methods to load our dataset in the python environment – Using Pandas and GridDB. For using GridDB within the python environment, the following packages are required:

GridDB C-client
SWIG (Simplified Wrapper and Interface Generator)
GridDB Python Client

2. About the Dataset

We will be using a snapshot of the Melbourne Housing Dataset which has been scraped from public resources and is now available on Kaggle. The dataset has been preprocessed to some extent and contains a total of 13580 instances. The number of attributes present in the dataset is 21. The dependent variable is the price of the property while the other 20 attributes are independent. Let us now get started on the code.

3. Importing the Libraries

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_absolute_error

The above cell should execute without any output if you successfully installed the libraries. In case you encounter an error, try the following :

Reconfirm if the installation was successful. If not, try executing pip install package-name again.
Check if your system is compatible with the package version.

4. Loading the Dataset

Using GridDB

GridDB is an open-source time-series database designed for handling large amounts of data. It is optimized for IoT and is highly efficient because of its in-memory architecture. Since dealing with files locally can lead to integration issues in a professional environment, using a reliable database becomes important. GridDB provides that reliability and scalability with fault tolerance.

Moreover, with GridDB’s python client, it has become much easier to include the database and manipulate it directly within the coding environment. Learn more about the GriDB WebAPI here.

Let us now go ahead and load our dataset.

import griddb_python as griddb

sql_statement = ('SELECT * FROM melb_data')
dataset = pd.read_sql_query(sql_statement, container)

The dataset variable will now have the data in the form of a pandas dataframe. If you are new to GridDB, a tutorial on how to insert data in GridDB might be helpful.

Using Pandas

Another way to load the dataset is using the pandas directly.

dataset = pd.read_csv("melb_data.csv")

5. Data Preprocessing

Great! Now that we have our dataset, let’s see what it actually looks like –

dataset.head()

	Suburb	Address	Rooms	Type	Price	Method	SellerG	Date	Distance	Postcode	…	Bathroom	Car	Landsize	BuildingArea	YearBuilt	CouncilArea	Lattitude	Longtitude	Regionname	Propertycount
0	Abbotsford	85 Turner St	2	h	1480000.0	S	Biggin	3/12/2016	2.5	3067.0	…	1.0	1.0	202.0	NaN	NaN	Yarra	-37.7996	144.9984	Northern Metropolitan	4019.0
1	Abbotsford	25 Bloomburg St	2	h	1035000.0	S	Biggin	4/02/2016	2.5	3067.0	…	1.0	0.0	156.0	79.0	1900.0	Yarra	-37.8079	144.9934	Northern Metropolitan	4019.0
2	Abbotsford	5 Charles St	3	h	1465000.0	SP	Biggin	4/03/2017	2.5	3067.0	…	2.0	0.0	134.0	150.0	1900.0	Yarra	-37.8093	144.9944	Northern Metropolitan	4019.0
3	Abbotsford	40 Federation La	3	h	850000.0	PI	Biggin	4/03/2017	2.5	3067.0	…	2.0	1.0	94.0	NaN	NaN	Yarra	-37.7969	144.9969	Northern Metropolitan	4019.0
4	Abbotsford	55a Park St	4	h	1600000.0	VB	Nelson	4/06/2016	2.5	3067.0	…	1.0	2.0	120.0	142.0	2014.0	Yarra	-37.8072	144.9941	Northern Metropolitan	4019.0

5 rows × 21 columns

len(dataset)

As we can see there are a lot of columns, let’s go ahead and print out the column names to get a better idea of the independent and dependent attributes.

dataset.columns

    Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
           'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
           'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
           'Longtitude', 'Regionname', 'Propertycount'],
          dtype='object')

dataset.describe()

	Rooms	Price	Distance	Postcode	Bedroom2	Bathroom	Car	Landsize	BuildingArea	YearBuilt	Lattitude	Longtitude	Propertycount
count	13580.000000	1.358000e+04	13580.000000	13580.000000	13580.000000	13580.000000	13518.000000	13580.000000	7130.000000	8205.000000	13580.000000	13580.000000	13580.000000
mean	2.937997	1.075684e+06	10.137776	3105.301915	2.914728	1.534242	1.610075	558.416127	151.967650	1964.684217	-37.809203	144.995216	7454.417378
std	0.955748	6.393107e+05	5.868725	90.676964	0.965921	0.691712	0.962634	3990.669241	541.014538	37.273762	0.079260	0.103916	4378.581772
min	1.000000	8.500000e+04	0.000000	3000.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1196.000000	-38.182550	144.431810	249.000000
25%	2.000000	6.500000e+05	6.100000	3044.000000	2.000000	1.000000	1.000000	177.000000	93.000000	1940.000000	-37.856822	144.929600	4380.000000
50%	3.000000	9.030000e+05	9.200000	3084.000000	3.000000	1.000000	2.000000	440.000000	126.000000	1970.000000	-37.802355	145.000100	6555.000000
75%	3.000000	1.330000e+06	13.000000	3148.000000	3.000000	2.000000	2.000000	651.000000	174.000000	1999.000000	-37.756400	145.058305	10331.000000
max	10.000000	9.000000e+06	48.100000	3977.000000	20.000000	8.000000	10.000000	433014.000000	44515.000000	2018.000000	-37.408530	145.526350	21650.000000

The output of the describe function conveys that the value of each attribute has a different scale. Therefore, we will need to normalize it before building our model.

Before normalization, we will be taking a subset of the attributes which seem to be directly correlated to the price.

dataset = dataset[["Rooms", "Price", "Bedroom2", "Bathroom","Landsize", "BuildingArea", "YearBuilt"]]

We also need to make sure that our data does not contain any null values before proceeding to model building.

dataset.isna().sum()

    Rooms              0
    Price              0
    Bedroom2           0
    Bathroom           0
    Landsize           0
    BuildingArea    6450
    YearBuilt       5375
    dtype: int64

As we can see, the two attributes contain several null values. Let’s go ahead and drop those instances.

dataset = dataset.dropna()

len(dataset)

We will now create a new attribute called HouseAge. The values of the attribute can be derived by subtracting the current year from the YearBuilt attribute. This is helpful because we do not have to deal with dates anymore. All the attributes are now numerical in nature which will help us with the Machine Learning part later on.

dataset['HouseAge'] = 2022 - dataset["YearBuilt"].astype(int)

dataset.head()

	Rooms	Price	Bedroom2	Bathroom	Landsize	BuildingArea	YearBuilt	HouseAge
1	2	1035000.0	2.0	1.0	156.0	79.0	1900.0	122
2	3	1465000.0	3.0	2.0	134.0	150.0	1900.0	122
4	4	1600000.0	3.0	1.0	120.0	142.0	2014.0	8
6	3	1876000.0	4.0	2.0	245.0	210.0	1910.0	112
7	2	1636000.0	2.0	1.0	256.0	107.0	1890.0	132

Great! The YearBuilt attribute is not needed anymore. So, let’s go ahead and drop that.

dataset = dataset.drop("YearBuilt", axis=1)

dataset.head()

	Rooms	Price	Bedroom2	Bathroom	Landsize	BuildingArea	HouseAge
1	2	1035000.0	2.0	1.0	156.0	79.0	122
2	3	1465000.0	3.0	2.0	134.0	150.0	122
4	4	1600000.0	3.0	1.0	120.0	142.0	8
6	3	1876000.0	4.0	2.0	245.0	210.0	112
7	2	1636000.0	2.0	1.0	256.0	107.0	132

6. Data Normalization

As we saw before, the values of the attributes have different scales which can lead to a disparity as the features larger in value will dominate over the smaller ones. Therefore, it is important to bring all the values down to one scale. For that, we will be using the Min-Max Normalization. It is one of the most common techniques where the minimum values coverts to 0 while the maximum value converts to 1. All the other values spread out between 0 and 1.

There are direct methods present for normalization but they convert the dataframe into a NumPy array. Hence, we lose the column names. For that reason, we will define our own method which takes in a dataframe and returns a new normalized dataframe.

def normalize(df):
    result = df.copy()
    for feature_name in df.columns:
        max_value = df[feature_name].max()
        min_value = df[feature_name].min()
        result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result

df = normalize(dataset)

Let’s have a look at our normalized dataframe.

df.head()

	Rooms	Price	Bedroom2	Bathroom	Landsize	BuildingArea	HouseAge
1	0.142857	0.101928	0.222222	0.000000	0.004216	0.025386	0.143552
2	0.285714	0.150412	0.333333	0.142857	0.003622	0.048201	0.143552
4	0.428571	0.165633	0.333333	0.000000	0.003243	0.045630	0.004866
6	0.285714	0.196753	0.444444	0.142857	0.006622	0.067481	0.131387
7	0.142857	0.169692	0.222222	0.000000	0.006919	0.034383	0.155718

As we can see, all the values lie between 0 and 1. It is now time to split our dataset into train and test.

7. Splitting the dataset

We will be doing a 70-30 train-test split. In the case of smaller datasets, one can also do 80-20.

train, test = train_test_split(df, test_size=0.3)

len(train)

len(test)

Let’s now separate the dependent and independent variables.

train_y = train[["Price"]]
train_x = train.drop(["Price"], axis=1)
test_y = test[["Price"]]
test_x = test.drop(["Price"], axis=1)

train_x.head()

	Rooms	Bedroom2	Bathroom	Landsize	BuildingArea	HouseAge
4860	0.285714	0.333333	0.142857	0.222054	0.041774	0.027981
3434	0.285714	0.333333	0.142857	0.018459	0.064267	0.027981
6048	0.285714	0.333333	0.285714	0.005973	0.049807	0.008516
9918	0.142857	0.222222	0.000000	0.007135	0.031170	0.131387
7855	0.142857	0.222222	0.000000	0.000000	0.030848	0.155718

train_y.head()

	Price
4860	0.103619
3434	0.064494
6048	0.055136
9918	0.174879
7855	0.053219

8. Building the Model

We will use a Linear Regression model, in this case. Since this is a simple dataset, a Linear Regression Model should do the trick. To build a more sophisticated model, one can also try Decision Trees.

Explore more about Linear Regression with GridDB and Python here.

model =  LinearRegression()
model.fit(train_x, train_y)

    LinearRegression()

9. Making Predictions

Let us now make predictions on our test dataset.

predictions = model.predict(test_x)

predictions

    array([[0.0890521 ],
           [0.06244483],
           [0.13166691],
           ...,
           [0.09182388],
           [0.20981148],
           [0.1077662 ]])

10. Model Evaluation

To quantify how good our predictions are, there are several metrics provided by the sklearn library. We will be using the mean_absolute_error metric which is one of the most common metrics used for Linear Regression Models.

mean_absolute_error(predictions, test_y)

    0.035125149637253696

Great! Our model has a mean absolute error of 0.03 which is not a bad start for a Linear Regression Model.

11. Conclusion

In this tutorial, we saw how can we build a Machine Learning Model for a housing dataset. In the beginning, we covered two methods for loading our dataset into the environment – GridDB and Pandas. We also pruned the dataset as per our needs. Later on, we used the Linear Regression function provided by the sklearn library to fit our dataset.

Learn more about real-time predictions with GridDB and Python here.

If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.

E-commerce Data Analysis using GridDB

Using SQL Batch Inserts with...

Analyzing Movies on IMDB using...

Building a Blog Voting Web...