Building a Linear Regression Model for Housing Data using Python and GridDB

In this tutorial, we will explore a housing dataset using Python. We will first prune the dataset as per our needs. Later, we will see how can we build a Machine Learning model to fit our dataset and make future predictions.

The outline of the tutorial is as follows:

  1. Pre-requisites
  2. About the Dataset
  3. Importing the Libraries
  4. Loading the Dataset
  5. Data Preprocessing
  6. Data Normalization
  7. Splitting the Dataset
  8. Building the Model
  9. Making Predictions
  10. Model Evaluation
  11. Conclusion

1. Prerequisites

This tutorial is executed using Jupyter Notebooks (Anaconda version 4.8.3) with Python version 3.8 on Windows 10 Operating system. The following packages need to be installed before the code execution:

  1. Pandas
  2. scikit-learn

If you are using Anaconda, packages can be installed through multiple ways such as the User Interface, Command Line, or Jupyter Notebooks. The most conventional way to install a python package is via pip. If you are using the command line or the terminal, type pip install package-name. Another way to install a package is through conda install package-name within the Anaconda environment.

Also, note that we will cover two methods to load our dataset in the python environment – Using Pandas and GridDB. For using GridDB within the python environment, the following packages are required:

  1. GridDB C-client
  2. SWIG (Simplified Wrapper and Interface Generator)
  3. GridDB Python Client

2. About the Dataset

We will be using a snapshot of the Melbourne Housing Dataset which has been scraped from public resources and is now available on Kaggle. The dataset has been preprocessed to some extent and contains a total of 13580 instances. The number of attributes present in the dataset is 21. The dependent variable is the price of the property while the other 20 attributes are independent. Let us now get started on the code.

3. Importing the Libraries

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_absolute_error

The above cell should execute without any output if you successfully installed the libraries. In case you encounter an error, try the following :

  1. Reconfirm if the installation was successful. If not, try executing pip install package-name again.
  2. Check if your system is compatible with the package version.

4. Loading the Dataset

Using GridDB

GridDB is an open-source time-series database designed for handling large amounts of data. It is optimized for IoT and is highly efficient because of its in-memory architecture. Since dealing with files locally can lead to integration issues in a professional environment, using a reliable database becomes important. GridDB provides that reliability and scalability with fault tolerance.

Moreover, with GridDB’s python client, it has become much easier to include the database and manipulate it directly within the coding environment. Learn more about the GriDB WebAPI here.

Let us now go ahead and load our dataset.

import griddb_python as griddb

sql_statement = ('SELECT * FROM melb_data')
dataset = pd.read_sql_query(sql_statement, container)

The dataset variable will now have the data in the form of a pandas dataframe. If you are new to GridDB, a tutorial on how to insert data in GridDB might be helpful.

Using Pandas

Another way to load the dataset is using the pandas directly.

dataset = pd.read_csv("melb_data.csv")

5. Data Preprocessing

Great! Now that we have our dataset, let’s see what it actually looks like –

dataset.head()
Suburb Address Rooms Type Price Method SellerG Date Distance Postcode Bathroom Car Landsize BuildingArea YearBuilt CouncilArea Lattitude Longtitude Regionname Propertycount
0 Abbotsford 85 Turner St 2 h 1480000.0 S Biggin 3/12/2016 2.5 3067.0 1.0 1.0 202.0 NaN NaN Yarra -37.7996 144.9984 Northern Metropolitan 4019.0
1 Abbotsford 25 Bloomburg St 2 h 1035000.0 S Biggin 4/02/2016 2.5 3067.0 1.0 0.0 156.0 79.0 1900.0 Yarra -37.8079 144.9934 Northern Metropolitan 4019.0
2 Abbotsford 5 Charles St 3 h 1465000.0 SP Biggin 4/03/2017 2.5 3067.0 2.0 0.0 134.0 150.0 1900.0 Yarra -37.8093 144.9944 Northern Metropolitan 4019.0
3 Abbotsford 40 Federation La 3 h 850000.0 PI Biggin 4/03/2017 2.5 3067.0 2.0 1.0 94.0 NaN NaN Yarra -37.7969 144.9969 Northern Metropolitan 4019.0
4 Abbotsford 55a Park St 4 h 1600000.0 VB Nelson 4/06/2016 2.5 3067.0 1.0 2.0 120.0 142.0 2014.0 Yarra -37.8072 144.9941 Northern Metropolitan 4019.0

5 rows × 21 columns

len(dataset)
    13580

As we can see there are a lot of columns, let’s go ahead and print out the column names to get a better idea of the independent and dependent attributes.

dataset.columns
    Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
           'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
           'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
           'Longtitude', 'Regionname', 'Propertycount'],
          dtype='object')
dataset.describe()
Rooms Price Distance Postcode Bedroom2 Bathroom Car Landsize BuildingArea YearBuilt Lattitude Longtitude Propertycount
count 13580.000000 1.358000e+04 13580.000000 13580.000000 13580.000000 13580.000000 13518.000000 13580.000000 7130.000000 8205.000000 13580.000000 13580.000000 13580.000000
mean 2.937997 1.075684e+06 10.137776 3105.301915 2.914728 1.534242 1.610075 558.416127 151.967650 1964.684217 -37.809203 144.995216 7454.417378
std 0.955748 6.393107e+05 5.868725 90.676964 0.965921 0.691712 0.962634 3990.669241 541.014538 37.273762 0.079260 0.103916 4378.581772
min 1.000000 8.500000e+04 0.000000 3000.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1196.000000 -38.182550 144.431810 249.000000
25% 2.000000 6.500000e+05 6.100000 3044.000000 2.000000 1.000000 1.000000 177.000000 93.000000 1940.000000 -37.856822 144.929600 4380.000000
50% 3.000000 9.030000e+05 9.200000 3084.000000 3.000000 1.000000 2.000000 440.000000 126.000000 1970.000000 -37.802355 145.000100 6555.000000
75% 3.000000 1.330000e+06 13.000000 3148.000000 3.000000 2.000000 2.000000 651.000000 174.000000 1999.000000 -37.756400 145.058305 10331.000000
max 10.000000 9.000000e+06 48.100000 3977.000000 20.000000 8.000000 10.000000 433014.000000 44515.000000 2018.000000 -37.408530 145.526350 21650.000000

The output of the describe function conveys that the value of each attribute has a different scale. Therefore, we will need to normalize it before building our model.

Before normalization, we will be taking a subset of the attributes which seem to be directly correlated to the price.

dataset = dataset[["Rooms", "Price", "Bedroom2", "Bathroom","Landsize", "BuildingArea", "YearBuilt"]]

We also need to make sure that our data does not contain any null values before proceeding to model building.

dataset.isna().sum()
    Rooms              0
    Price              0
    Bedroom2           0
    Bathroom           0
    Landsize           0
    BuildingArea    6450
    YearBuilt       5375
    dtype: int64

As we can see, the two attributes contain several null values. Let’s go ahead and drop those instances.

dataset = dataset.dropna()
len(dataset)
    6858

We will now create a new attribute called HouseAge. The values of the attribute can be derived by subtracting the current year from the YearBuilt attribute. This is helpful because we do not have to deal with dates anymore. All the attributes are now numerical in nature which will help us with the Machine Learning part later on.

dataset['HouseAge'] = 2022 - dataset["YearBuilt"].astype(int)
dataset.head()
Rooms Price Bedroom2 Bathroom Landsize BuildingArea YearBuilt HouseAge
1 2 1035000.0 2.0 1.0 156.0 79.0 1900.0 122
2 3 1465000.0 3.0 2.0 134.0 150.0 1900.0 122
4 4 1600000.0 3.0 1.0 120.0 142.0 2014.0 8
6 3 1876000.0 4.0 2.0 245.0 210.0 1910.0 112
7 2 1636000.0 2.0 1.0 256.0 107.0 1890.0 132

Great! The YearBuilt attribute is not needed anymore. So, let’s go ahead and drop that.

dataset = dataset.drop("YearBuilt", axis=1)
dataset.head()
Rooms Price Bedroom2 Bathroom Landsize BuildingArea HouseAge
1 2 1035000.0 2.0 1.0 156.0 79.0 122
2 3 1465000.0 3.0 2.0 134.0 150.0 122
4 4 1600000.0 3.0 1.0 120.0 142.0 8
6 3 1876000.0 4.0 2.0 245.0 210.0 112
7 2 1636000.0 2.0 1.0 256.0 107.0 132

6. Data Normalization

As we saw before, the values of the attributes have different scales which can lead to a disparity as the features larger in value will dominate over the smaller ones. Therefore, it is important to bring all the values down to one scale. For that, we will be using the Min-Max Normalization. It is one of the most common techniques where the minimum values coverts to 0 while the maximum value converts to 1. All the other values spread out between 0 and 1.

There are direct methods present for normalization but they convert the dataframe into a NumPy array. Hence, we lose the column names. For that reason, we will define our own method which takes in a dataframe and returns a new normalized dataframe.

def normalize(df):
    result = df.copy()
    for feature_name in df.columns:
        max_value = df[feature_name].max()
        min_value = df[feature_name].min()
        result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result
df = normalize(dataset)

Let’s have a look at our normalized dataframe.

df.head()
Rooms Price Bedroom2 Bathroom Landsize BuildingArea HouseAge
1 0.142857 0.101928 0.222222 0.000000 0.004216 0.025386 0.143552
2 0.285714 0.150412 0.333333 0.142857 0.003622 0.048201 0.143552
4 0.428571 0.165633 0.333333 0.000000 0.003243 0.045630 0.004866
6 0.285714 0.196753 0.444444 0.142857 0.006622 0.067481 0.131387
7 0.142857 0.169692 0.222222 0.000000 0.006919 0.034383 0.155718

As we can see, all the values lie between 0 and 1. It is now time to split our dataset into train and test.

7. Splitting the dataset

We will be doing a 70-30 train-test split. In the case of smaller datasets, one can also do 80-20.

train, test = train_test_split(df, test_size=0.3)
len(train)
    4800
len(test)
    2058

Let’s now separate the dependent and independent variables.

train_y = train[["Price"]]
train_x = train.drop(["Price"], axis=1)
test_y = test[["Price"]]
test_x = test.drop(["Price"], axis=1)
train_x.head()
Rooms Bedroom2 Bathroom Landsize BuildingArea HouseAge
4860 0.285714 0.333333 0.142857 0.222054 0.041774 0.027981
3434 0.285714 0.333333 0.142857 0.018459 0.064267 0.027981
6048 0.285714 0.333333 0.285714 0.005973 0.049807 0.008516
9918 0.142857 0.222222 0.000000 0.007135 0.031170 0.131387
7855 0.142857 0.222222 0.000000 0.000000 0.030848 0.155718
train_y.head()
Price
4860 0.103619
3434 0.064494
6048 0.055136
9918 0.174879
7855 0.053219

8. Building the Model

We will use a Linear Regression model, in this case. Since this is a simple dataset, a Linear Regression Model should do the trick. To build a more sophisticated model, one can also try Decision Trees.

Explore more about Linear Regression with GridDB and Python here.

model =  LinearRegression()
model.fit(train_x, train_y)
    LinearRegression()

9. Making Predictions

Let us now make predictions on our test dataset.

predictions = model.predict(test_x)
predictions
    array([[0.0890521 ],
           [0.06244483],
           [0.13166691],
           ...,
           [0.09182388],
           [0.20981148],
           [0.1077662 ]])

10. Model Evaluation

To quantify how good our predictions are, there are several metrics provided by the sklearn library. We will be using the mean_absolute_error metric which is one of the most common metrics used for Linear Regression Models.

mean_absolute_error(predictions, test_y)
    0.035125149637253696

Great! Our model has a mean absolute error of 0.03 which is not a bad start for a Linear Regression Model.

11. Conclusion

In this tutorial, we saw how can we build a Machine Learning Model for a housing dataset. In the beginning, we covered two methods for loading our dataset into the environment – GridDB and Pandas. We also pruned the dataset as per our needs. Later on, we used the Linear Regression function provided by the sklearn library to fit our dataset.

Learn more about real-time predictions with GridDB and Python here.

If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.