Aggregation Functions in GridDB

Aggregation Values and Why They are Important

Aggregation is a way of summarizing data based on certain criteria. Familiar examples include statistics such as minimum value, average value or the sum of all values. We usually use these results for making conclusions or inferences about data and they are also great for observing patterns or trends in your data over time. Other aggregation results such as standard deviation and variance give a means for estimating the reliability of a data sample. An example would be how a high standard deviation or variance could mean that there is a high spread of the measurements in a data set. This can make these measurements less certain or reliable. GridDB provides several ways and functions for developers to aggregate data through their API.

Aggregation in GridDB

GridDB provides various built-in API functions that can be used to collect aggregation data. In the Java API for GridDB, aggregation methods return objects of type AggregationResult. These objects are usually in the form of a single result instead of a set of rows from a container. Aggregation results that GridDB can collect include minimum, maximum, average values, standard deviation, number of rows, sum of all rows, and variance. You can fetch these results using the .aggregate method for TimeSeries containers or through TQL queries for Collections.

Performing Aggregation with TQL Queries

TQL is an abbreviation for Terminology Query Language. It is GridDB’s simplified version of SQL that is used to make standard queries and data operations into containers. TQL also provides the ability to obtain the execution plans or analysis on query statements. Use standard strings to express TQL and issue them with .query or gsQuery functions from the GridDB API.

TQL can be used perform aggregation queries on both collections and TimeSeries containers. Use either the .fetch() gsFetch functions to obtain the aggregation results and use the methods mentioned earlier to obtain the number.

Let’s say we have a Collection in Java with a Row schema in the form of a Java class called ClimateReading. The variable name climateContainer is cast to this collection. We wish to obtain both the standard deviation and the average of the “temperature” column in this container. We start by forming TQL queries as strings and use the query and fetch functions from the GridDB API to obtain the aggregation results.

Container<String,ClimateReading> climateContainer = store.getCollection("climate",ClimateReading.class); 
String averageTql = "SELECT AVG(temperature) FROM climate";
String standardDeviationTql>= "SELECT STDDEV(temperature) FROM climate";
Query<AggregationResult> averageQuery = climateContainer.query(averageTql,AggregationResult.class);
Query<AggregationResult> standardDeviationQuery = climateContainer.query(standardDeviationTql, AggregationResult.class);
RowSet<AggregationResult> rowSet = aggregationQuery.fetch();//Set of one row which has the result
RowSet<AggregationResult> stdDeviationSet = deviationQuery.fetch();
AggregationResult average = rowSet.next();
AggregationResult stdDev = stdDeviationSet.next();
System.out.println("The average temperature is " + average.getDouble());
System.out.println("The standard deviation of the temperature is " + stdDev.getDouble());

Output

The average temperature is 73.56
The standard deviation of the temperature is 4.62

Aggregation on TimeSeries Containers

To retrieve an aggregation result for a TimeSeries container in Java simply use the .aggregate method. Select the column you want to query and the type of aggregation you wish to calculates and two dates you want to set as your starting and ending dates. These four parameters make up your aggregation query. Once this query is fetched, an AggregationResult is returned. Use either .getDouble() or .getLong() calls to retrieve the numbers from AggregationResult objects.

In the example below there is a TimeSeries container set to the variable timeSeries. It has a numeric column called “light” that we wish to obtain the average for. We also want to limit the rows we include in our calculation to be between a set starting date and ending date. If you do not want to have a starting date or ending date on the query, simply set the first or second parameter of your query to null. These dates are cast as Date objects in the variables start and end respectively.

// Get average from earliest record to latest record in the container on the ‘light’ column
AggregationResult averageResult = timeSeries.aggregate(start,end,”light”,Aggregation.AVERAGE);
System.out.println("The average light value is " + averageResult.getDouble()); 

Output:

The average light value is 29.82

Time Weighted Averages

Sometimes when obtaining the average of a column from a database, the normal average that we think of (sum of all values divided by the number of values) is not enough. An aggregation result that is better suited for time-related data is a time average. A time weighted average not only factors all the values of a measurement, but also the amount of time that is between them.

One example could be a sensor that records measurements at different or sporadic time spans. This sensor might have periods where there were 3 seconds between its recordings. After a few hours however, this time span between readings could have increased to 6 seconds or more. One way to account for these discrepancies is to weigh the average itself by the time values of all the entries in the table along with the time spans between their adjacent rows. GridDB provides the ability to obtain time-weighted averages for TimeSeries containers.

Calculating a Time Weighted Average

Time averages are aggregation values that are unique to TimeSeries containers. You obtain these results with the TIME_AVG TQL query or the WEIGHTED_AVERAGE aggregation. Weighted averages in GridDB are calculated by giving every value in a row a weighted value that is half the time span in seconds between the two rows adjacent to it (the rows before and after).

Let’s say we want to collect the weighted average of a column called light in a TimeSeries container. In one row, we can call row 10, it has a time stamp of June 22nd 8:23:43 and a light value of 23. The row preceding it was inserted 4 seconds before it. The following row was inserted 10 seconds after row 10.

The weighted value of row 10 is:

(10 + 4 ) / 2 = 7 

In a container with 5 rows that were inserted at the below times:

Row     Time     Value                    Weighted-Value
0          0        20                                10
1         10        30             (10 + 15 ) / 2 = 12.5
2         25        50                 (15 + 5) / 2 = 10
3         30        70               (5 + 30) / 2 = 17.5
4         60        60                                30

The weighted average is the sum of all values divided by the sum of all weighted values.

Sum of Values:		20 + 30 + 50 + 70  + 60 = 230   
Sum of Weights: 	10 + 12.5 + 17.5 + 10   =  50   
Sum of Values / Sum of Weights:         230/50  = 4.6

We are have a weighted average of 4.6.

More information about the aggregation operations in GridDB can be found here. Learn more about GridDB APIs at the GridDB API reference page.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.