Analyzing a Time Series Real Estate Dataset with GridDB and Java

Dataset and Environment Setup

In this article we will discuss how to analyze and ingest a time series dataset with GridDB and Java. The data we will be analyzing is an open dataset that contains real estate property sales details. You can download the dataset from this link

First of all, let’s take a look at the structure of the dataset. You can have a proper idea on the dataset by referring to the following table.

The reason why we have used GridDB in this implementation is because it has unique features that make it ideal for time series data. This article describes clearly about time series data and GridDB.

Before starting implementation, you need to set up the GridDB server with a Java client. If you haven’t set it up yet, you can follow this quick start guide shown here.

In this article, we will not be focusing on how to Connect, get GridStore and store data. We will be mainly focusing on analyzing the dataset. Now let’s move to the implementation.

Declare column names and get gridstore instance

First we need to declare the attributes of the dataset as a static inner class as follows.

static class Sales{
                @RowKey Date salesdate;
                int MA;
                String type;
                int bedrooms;
        }

According to the dataset, there should be four attributes. Here we have used the same names as the column names for readability. @RowKey syntax is used to identify the row key of the document.

Next, it is required to get the gridstore instance and create a timeseries. To get gridstore, set a number of properties such as notification address, notification port, username name and password.

Read dataset and Store

Now, the data should be read from the dataset and preprocess in order to store in the database.

In our dataset, the date is in the format of “dd/MM/yyyy”. But the date should be stored as a timestamp. Since this dataset doesn’t contain any time given, let’s set the time as “00:00”. Other column values can be stored without doing any change.

File csvFile = new File("ma_lga_12345.csv");
                Scanner sc = new Scanner(csvFile);
                String data = sc.next();
 
                while (sc.hasNext()){
                        String scData = sc.next();
                        String dataList[] = scData.split(",");
                        String salesdate = dataList[0];
                        String MA = dataList[1];
                        String type = dataList[2];
                        String bedrooms = dataList[3];
                        
                        Sales sales = new Sales();
    
                        sales.salesdate = convertDateToTimeStamp(salesdate);
                        sales.MA = Integer.parseInt(MA);
                        sales.type = type;
                        sales.bedrooms = Integer.parseInt(bedrooms);
                        ts.append(sales);
                 }

The above code reads the CSV file line by line extracting relevant data and then creates a sales object. Then the created sales object has been appended to the database.

The ‘converDateToTimeStamp’ method is used to convert the sales Date in the dataset to timestamp as follows.

static Date convertDateToTimeStamp(date){
             
            String OLD_FORMAT = "dd/MM/yyyy";
            String NEW_FORMAT = "yyyy/MM/dd";
            SimpleDateFormat format = new SimpleDateFormat("yyyy/MM/ddHH:mm:ss");          
            SimpleDateFormat sdf = new SimpleDateFormat(OLD_FORMAT);
            Date d = sdf.parse(salesdate);
            sdf.applyPattern(NEW_FORMAT);
            String newDateString = sdf.format(d);
            String datetimes = newDateString +"00"+":00:00";
            Date dates = format.parse(datetimes);
            Long dt = dates.getTime();
            return new Date(dt);
         }

We have used the SimpleDateFormat package in java in converting date to timestamp. By now we have stored all the data we need in the database.

Analyzing data

From this point, we can pay attention to analyzing the dataset we just prepared. These data analyzing techniques that have been applied on real estate sales data can be generalized to use in several other problem scenarios as well.

Retrieving data in a given time range

First, let’s see how we can extract the specified range of time series elements. As an example, let’s take 4 months as the time range. So what we need to do is, extracting the data from the current timestamp to 4 months before the current date. Read this code and understand the implementation of the logic as well as the syntax.

 Date now = TimestampUtils.current();
 Date before = TimestampUtils.add(now, -4, TimeUnit.MONTH);
 RowSet rs = ts.query(before, now).fetch();
while (rs.hasNext()) {
           Sales sales = new Sales();
           sales = rs.next();
           System.out.println( "Sales Date=" + TimestampUtils.format(sales.salesdate) +
                              " MA =" + sales.MA +
                              " Type" + sales.type +
                              " Bedrooms" + sales.bedrooms);
 }

In the first line of this code, we have taken the current timestamp. In order to do that, we have used the TimestampUtils method in the implementation. Then we need to reduce four months of time from the current time. “TimestampUtils.add” method can be used for that. If you want to add a particular time to the current time, you only need to remove the “-” sign in front of the time in the function.

If you need to change the time unit, you simply have to mention the time unit as TimeUnit.MONTH, TimeUnit.HOURS or any other time unit according to the time range you need to add or subtract.

Now, we have two timestamps. First one is the current time. Second one is the timestamp four months back. Querying to get the data in this time range is quite easy.

RowSet rs = ts.query(before, now).fetch();

This gives you all the rows of data between the specified time range. However, you may get more than one row as the output of the above code. So you should read each data row one by one as implemented in the code. Once you get the row and extract it to the variables, you can either print it or apply any operation on the data.

Query the database

Now, let’s discuss how to write a query which includes multiple conditions on attributes same as “SQL queries” and extract data.

Assume that a user needs to get the first 20 unit type house details in decreasing order of “MA” value which are smaller than 50000$ . First let’s see how we can write the query for this scenario.

Select * from sales01 where type=’unit’ and MA <5000 order by MA desc limit 20

If you are familiar with SQL, you may know that this is how we would write SQL queries for this scenario. It’s important to mention that although this query and the syntax of most other queries are similar to SQL, there are some differences between the query language which is used in GridDB (TQL) and SQL.

Let’s see how to get an output from the query we just wrote.

Query query = ts.query("select * from sales01" +
      " where type='unit' and MA < 50000 order by MA desc limit 20");
RowSet res = query.fetch();

The code above shows you how to get the result from any query. Similar to the previous example, you need to check whether there are any resultant rows remaining and apply relevant operation on extracted data.

Computing the average of extracted data

Finally let’s take a look at how we can get an average value of a particular data value which is in a specific time range.

We are going to get the average “MA” value between 4 months time period, where, from 2 months before the given date to 2 months after the given date.

Assume that “salesTime” is a timestamp given by the user. .

Date start = TimestampUtils.add(salesTime, -2, TimeUnit.MONTH);
Date end = TimestampUtils.add(salesTime, 2, TimeUnit.MONTH);
AggregationResult avg = ts.aggregate(start, end, "MA",  
Aggregation.AVERAGE);
 
System.out.println(“avg=" + avg.getDouble());     

As we discussed previously, starting time and ending time can be obtained with the “TimestampUtils.add” method. Now, we need to get the average value of “MA” within the specified time range. For that, we can use the following aggregation method.

AggregationResult avg =ts.aggregate(start, end, "MA", Aggregation.AVERAGE);

Once we get the average value to ‘avg’ variable we can apply suitable operations on it such as storing the value in the database, pass the value to another function as a parameter or simply print the average in std out.

Conclusion

Great, that’s pretty much about it. In this article we discussed some simple methods on how we can analyze real estate sales data in Java. We used GridDB for manipulating this time series data set. Try to get more familiar with these technologies, add more complex methods to analyse data and improve the functionalities further.