In this tutorial, we will explore a housing dataset using Python. We will first prune the dataset as per our needs. Later, we will see how can we build a Machine Learning model to fit our dataset and make future predictions.
The outline of the tutorial is as follows:
- About the Dataset
- Importing the Libraries
- Loading the Dataset
- Data Preprocessing
- Data Normalization
- Splitting the Dataset
- Building the Model
- Making Predictions
- Model Evaluation
This tutorial is executed using Jupyter Notebooks (Anaconda version 4.8.3) with Python version 3.8 on Windows 10 Operating system. The following packages need to be installed before the code execution:
If you are using Anaconda, packages can be installed through multiple ways such as the User Interface, Command Line, or Jupyter Notebooks. The most conventional way to install a python package is via
pip. If you are using the command line or the terminal, type
pip install package-name. Another way to install a package is through
conda install package-name within the Anaconda environment.
Also, note that we will cover two methods to load our dataset in the python environment – Using
GridDB. For using GridDB within the python environment, the following packages are required:
2. About the Dataset
We will be using a snapshot of the Melbourne Housing Dataset which has been scraped from public resources and is now available on Kaggle. The dataset has been preprocessed to some extent and contains a total of 13580 instances. The number of attributes present in the dataset is 21. The dependent variable is the price of the property while the other 20 attributes are independent. Let us now get started on the code.
3. Importing the Libraries
import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_absolute_error
The above cell should execute without any output if you successfully installed the libraries. In case you encounter an error, try the following :
- Reconfirm if the installation was successful. If not, try executing
pip install package-nameagain.
- Check if your system is compatible with the package version.
4. Loading the Dataset
GridDB is an open-source time-series database designed for handling large amounts of data. It is optimized for IoT and is highly efficient because of its in-memory architecture. Since dealing with files locally can lead to integration issues in a professional environment, using a reliable database becomes important. GridDB provides that reliability and scalability with fault tolerance.
Let us now go ahead and load our dataset.
import griddb_python as griddb sql_statement = ('SELECT * FROM melb_data') dataset = pd.read_sql_query(sql_statement, container)
dataset variable will now have the data in the form of a pandas dataframe. If you are new to GridDB, a tutorial on how to insert data in GridDB might be helpful.
Another way to load the dataset is using the pandas directly.
dataset = pd.read_csv("melb_data.csv")
5. Data Preprocessing
Great! Now that we have our dataset, let’s see what it actually looks like –
As we can see there are a lot of columns, let’s go ahead and print out the column names to get a better idea of the independent and dependent attributes.
Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG', 'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car', 'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude', 'Longtitude', 'Regionname', 'Propertycount'], dtype='object')
The output of the
describe function conveys that the value of each attribute has a different scale. Therefore, we will need to normalize it before building our model.
Before normalization, we will be taking a subset of the attributes which seem to be directly correlated to the price.
dataset = dataset[["Rooms", "Price", "Bedroom2", "Bathroom","Landsize", "BuildingArea", "YearBuilt"]]
We also need to make sure that our data does not contain any null values before proceeding to model building.
Rooms 0 Price 0 Bedroom2 0 Bathroom 0 Landsize 0 BuildingArea 6450 YearBuilt 5375 dtype: int64
As we can see, the two attributes contain several null values. Let’s go ahead and drop those instances.
dataset = dataset.dropna()
We will now create a new attribute called
HouseAge. The values of the attribute can be derived by subtracting the current year from the
YearBuilt attribute. This is helpful because we do not have to deal with dates anymore. All the attributes are now numerical in nature which will help us with the Machine Learning part later on.
dataset['HouseAge'] = 2022 - dataset["YearBuilt"].astype(int)
YearBuilt attribute is not needed anymore. So, let’s go ahead and drop that.
dataset = dataset.drop("YearBuilt", axis=1)
6. Data Normalization
As we saw before, the values of the attributes have different scales which can lead to a disparity as the features larger in value will dominate over the smaller ones. Therefore, it is important to bring all the values down to one scale. For that, we will be using the
Min-Max Normalization. It is one of the most common techniques where the minimum values coverts to 0 while the maximum value converts to 1. All the other values spread out between 0 and 1.
There are direct methods present for normalization but they convert the dataframe into a NumPy array. Hence, we lose the column names. For that reason, we will define our own method which takes in a dataframe and returns a new normalized dataframe.
def normalize(df): result = df.copy() for feature_name in df.columns: max_value = df[feature_name].max() min_value = df[feature_name].min() result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value) return result
df = normalize(dataset)
Let’s have a look at our normalized dataframe.
As we can see, all the values lie between 0 and 1. It is now time to split our dataset into
7. Splitting the dataset
We will be doing a
70-30 train-test split. In the case of smaller datasets, one can also do
train, test = train_test_split(df, test_size=0.3)
Let’s now separate the dependent and independent variables.
train_y = train[["Price"]] train_x = train.drop(["Price"], axis=1) test_y = test[["Price"]] test_x = test.drop(["Price"], axis=1)
8. Building the Model
We will use a
Linear Regression model, in this case. Since this is a simple dataset, a Linear Regression Model should do the trick. To build a more sophisticated model, one can also try Decision Trees.
Explore more about Linear Regression with GridDB and Python here.
model = LinearRegression() model.fit(train_x, train_y)
9. Making Predictions
Let us now make predictions on our
predictions = model.predict(test_x)
array([[0.0890521 ], [0.06244483], [0.13166691], ..., [0.09182388], [0.20981148], [0.1077662 ]])
10. Model Evaluation
To quantify how good our predictions are, there are several metrics provided by the
sklearn library. We will be using the mean_absolute_error metric which is one of the most common metrics used for Linear Regression Models.
Great! Our model has a mean absolute error of
0.03 which is not a bad start for a Linear Regression Model.
In this tutorial, we saw how can we build a Machine Learning Model for a housing dataset. In the beginning, we covered two methods for loading our dataset into the environment –
GridDB and Pandas. We also pruned the dataset as per our needs. Later on, we used the
Linear Regression function provided by the
sklearn library to fit our dataset.
Learn more about real-time predictions with GridDB and Python here.