Normalization in Java

In this blog post, we will demonstrate normalizing data for machine learning with Java and GridDB.

Source code for this blog can be found on our Github page: https://github.com/griddbnet/Blogs

$ git clone https://github.com/griddbnet/Blogs.git --branch normalization

Before we get started, Let’s understand normalization first.

What is Normalization?

Normalization is a commonly used data preparation technique in Machine Learning that changes the values of columns in the dataset using a common scale. Normalization is required when a range of characteristics are present in the dataset.

In our project, we’re going to use the dataset house.csv that looks something like this:


housesize,lotsize,bedrooms,granite,bathroom,sellingprice
3529,9191,6,0,0,205000 
3247,10061,5,1,1,224900 
4032,10150,5,0,1,197900 
2397,14156,4,1,0,189900
2200,9600,4,0,1,195000
3536,19994,6,1,1,325000
2983,9365,5,0,1,230000

You can see that the dataset contains large numbers with different ranges, making it difficult to perform data analytics operations.

With normalization, we can simplify this dataset by changing the column values on a common scale, such as ranging column values between 0 and 1. Looks something like this.


housesize,lotsize,bedrooms,granite,bathroom,sellingprice
0.571507,0.080533,0.5,1,1,224900
0.107533,0.459595,0,1,0,189900
0.729258,1,1,1,1,325000
0.725437,0,1,0,0,205000
1,0.088772,0.5,0,1,197900
0,0.03786,0,0,1,195000
0.427402,0.016107,0.5,0,1,230000

You may notice that the column sellingprice remains unchanged. This is because the classifier must know which is our outcome variable.

That’s why we set the column sellingprice as our class index for the dataset before passing it to the classifier. You will observe it practically later in this article.

The following formula defines normalization mathematically.

Xn = (X – Xmin) / (Xmax – Xmin)

Where,
Xn = Normalized value.
Xmin = Minimum value of a feature.
Xmax = Maximum value of a feature.

Types of Normalization

Many types of normalization are available in machine learning. But the following types are widely used:

Min-Max Scaling: This type of normalization subtracts the minimum value from the highest value of each column and divides it by the range. Each new column has a minimum value of 0 and a maximum value of 1.

Standardization Scaling: Also known as Z-score normalization centers a variable at zero and standardizes the variance at one. It first subtracts the mean of each observation and then divides it by the standard deviation.

That’s enough for the theory. Let’s dive into something practical.

How to use Normalization?

Project Setup

Our project implements a basic level of normalization in Java and GridDB. Here is the project organization.

To run this project, your system should have:
– Java (Latest version)
– GridDB

Required JAR files

We need two JAR files gridstore-5.5.0.jar and weka-3.7.0.jar. Here, gridstore-5.5.0.jar is an official library for GridDB, and weka-3.7.0.jar is a Machine Learning library used for data analysis.

Download the Weka JAR from the following URL:

http://www.java2s.com/Code/Jar/w/weka.htm

Download the gridstore.jar from here:

https://mvnrepository.com/artifact/com.github.griddb/gridstore-jdbc/5.5.0

Import Packages

For this project, we need some basic Java packages, such as:


import java.util.Properties;
import java.util.Scanner;
import java.io.File;
import java.io.Reader;
import java.io.StringReader;

Include the following packages to connect and operate GridDB:


import com.toshiba.mwcloud.gs.Collection;
import com.toshiba.mwcloud.gs.GSException;
import com.toshiba.mwcloud.gs.GridStore;
import com.toshiba.mwcloud.gs.GridStoreFactory;
import com.toshiba.mwcloud.gs.Query;
import com.toshiba.mwcloud.gs.RowKey;
import com.toshiba.mwcloud.gs.RowSet;

For data analysis, include these packages from Weka:


import weka.classifiers.Evaluation;
import weka.classifiers.functions.LinearRegression;
import weka.core.Instances;
import weka.filters.Filter;
import weka.filters.unsupervised.attribute.Normalize;

Well, that’s the initial step.

Let’s move on to our next step.

Creating a Class for Storing Data

Our project stores data in a GridDB container. So, we need to define the container schema as a static class:


static class House {
    @RowKey int housesize;
    int lotsize;
    int bedrooms;
    int granite;
    int bathroom;
    int sellingprice;
}

The static class House consists of six integer-type columns with a RowKey housesize.

Creating a GridDB Connection

We have the schema class ready. It’s time to connect our program with GridDB.

For creating GridDB connection, we have to initiate a Properties instance defining the credentials for GridDB installation. Have a look at the following code:


Properties props = new Properties();
props.setProperty("notificationMember", "127.0.0.1:10001");
props.setProperty("clusterName", "myCluster");
props.setProperty("user", "admin");
props.setProperty("password", "admin");
GridStore store = GridStoreFactory.getInstance().getGridStore(props);

You may need to change these connection specifics based on your GridDB installation.

Once you have successfully created your connection, it’s time to create a collection.


Collection coll = store.putCollection("House", House.class);

Here, we created a collection object coll for the container House.

Push Data to GridDB

Now we have our GridDB schema and our connection is ready to go.

Let’s put some data to it,

In this purpose, we will retrive data from the house.csv file and push them to GridDB. The following code block achieves this task,


File file1 = new File("house.csv");
Scanner scan = new Scanner(file1);
String data = scan.next();

while (scan.hasNext()){
    String scanData = scan.next();
    String dataList[] = scanData.split(",");
    String housesize = dataList[0];
    String lotsize = dataList[1];
    String bedrooms = dataList[2];
    String granite = dataList[3];
    String bathroom = dataList[4];
    String sellingprice = dataList[5];
    
    House hs = new House();
    
    hs.housesize = Integer.parseInt(housesize);
    hs.lotsize = Integer.parseInt(lotsize);
    hs.bedrooms = Integer.parseInt(bedrooms);
    hs.granite = Integer.parseInt(granite);
    hs.bathroom = Integer.parseInt(bathroom);
    hs.sellingprice = Integer.parseInt(sellingprice);
    coll.put(hs);
}

We created an object of our schema class House to push data to the container.

Retrieve Data from GridDB

Let’s pull data from the GridDB container and make the data available for analysis. Use the following code.


Query query = coll.query("select *");
RowSet rs = query.fetch(false);

Query data from GridDB is similar to general SQL queries. Here, we used the query select * to retrieve all the data from the GridDB container.

We’ll save these data into a string for our next step.


String datastr = "@RELATION house\n\n@ATTRIBUTE houseSize NUMERIC\n@ATTRIBUTE lotSize NUMERIC\n@ATTRIBUTE bedrooms NUMERIC\n@ATTRIBUTE granite NUMERIC\n@ATTRIBUTE bathroom NUMERIC\n@ATTRIBUTE sellingPrice NUMERIC\n\n@DATA\n";

while (rs.hasNext()) {
House hs = rs.next();
datastr = datastr + hs.housesize+","+hs.lotsize+","+hs.bedrooms+","+hs.granite+","+hs.bathroom+","+hs.sellingprice+"\n";
}

Notice that we initialize the string datastr with the arff file structure.

Let’s see what the data looks like before normalization,


System.out.println("==== Before Normalization ===\n"+datastr);

Creating Weka Instances

We will use the Weka Machine Learning library for normalizing data. Hence, we need to convert our data into Weka Instances.

Do you remember we stored data in the string datastr in our previous step?

Yes, we’re going to pass that string in our Weka instances.


Reader dataString = new StringReader(datastr);
Instances dataset = new Instances(dataString);
dataset.setClassIndex(dataset.numAttributes()-1);

The above code creates a Weka Instance.

Normalization

Let’s perform normalization to our dataset,


Normalize normalize = new Normalize();
normalize.setInputFormat(dataset);
Instances newdata = Filter.useFilter(dataset, normalize);

Next, we are going to visualize data after normalization


System.out.println("==== After Normalization ===\n"+newdata);

Lastly, we just applied linear regression to our normalized data.


LinearRegression lr = new LinearRegression();
lr.buildClassifier(newdata);
Evaluation lreval = new Evaluation(newdata);
lreval.evaluateModel(lr, newdata);
System.out.println(lreval.toSummaryString());

That’s all about the project. Let’s test it.

Compile and Execute

If you aren’t using any IDE, you may need to include the JAR files through the classpath:


export CLASSPATH=$CLASSPATH:/Home/User/Download/gridstore.jar

export CLASSPATH=$CLASSPATH:/Home/User/Download/weka-3.7.0.jar

These paths may differ based on your setup.

Now, you can compile the code using this command:


javac NormalizationJava.java

Then execute the code:


java NormalizationJava

Showing the Output

Running the code, you’ll have the following output:


==== Before Normalization ===
@RELATION house

@ATTRIBUTE houseSize NUMERIC
@ATTRIBUTE lotSize NUMERIC
@ATTRIBUTE bedrooms NUMERIC
@ATTRIBUTE granite NUMERIC
@ATTRIBUTE bathroom NUMERIC
@ATTRIBUTE sellingPrice NUMERIC

@DATA
3247,10061,5,1,1,224900
2397,14156,4,1,0,189900
3536,19994,6,1,1,325000
3529,9191,6,0,0,205000
4032,10150,5,0,1,197900
2200,9600,4,0,1,195000
2983,9365,5,0,1,230000

==== After Normalization ===
@relation house-weka.filters.unsupervised.attribute.Normalize-S1.0-T0.0

@attribute houseSize numeric
@attribute lotSize numeric
@attribute bedrooms numeric
@attribute granite numeric
@attribute bathroom numeric
@attribute sellingPrice numeric

@data
0.571507,0.080533,0.5,1,1,224900
0.107533,0.459595,0,1,0,189900
0.729258,1,1,1,1,325000
0.725437,0,1,0,0,205000
1,0.088772,0.5,0,1,197900
0,0.03786,0,0,1,195000
0.427402,0.016107,0.5,0,1,230000

Correlation coefficient                  0.9945
Mean absolute error                   4053.821 
Root mean squared error               4578.4125
Relative absolute error                 13.1339 %
Root relative squared error             10.51   %
Total Number of Instances                7     

BUILD SUCCESSFUL (total time: 1 second)

Wrapping Up

That’s all for normalization in Java with GridDB.

This article demonstrates applying normalization to the dataset for better analysis. As we mentioned, normalization is helpful when the dataset contains a range of characteristics.

Finally, make sure you close the query, container, and the GridDB database.


query.close();
coll.close();
store.close();

If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.