Aggregation Values and Why They are Important
Aggregation is a way of summarizing data based on certain criteria. Familiar examples include statistics such as minimum value, average value or the sum of all values. We usually use these results for making conclusions or inferences about data and they are also great for observing patterns or trends in your data over time. Other aggregation results such as standard deviation and variance give a means for estimating the reliability of a data sample. An example would be how a high standard deviation or variance could mean that there is a high spread of the measurements in a data set. This can make these measurements less certain or reliable. GridDB provides several ways and functions for developers to aggregate data through their API.
Aggregation in GridDB
GridDB provides various built-in API functions that can be used to collect aggregation data. In the Java API for GridDB, aggregation methods return objects of type
AggregationResult. These objects are usually in the form of a single result instead of a set of rows from a container. Aggregation results that GridDB can collect include minimum, maximum, average values, standard deviation, number of rows, sum of all rows, and variance. You can fetch these results using the
.aggregate method for TimeSeries containers or through TQL queries for Collections.
Performing Aggregation with TQL Queries
TQL is an abbreviation for Terminology Query Language. It is GridDB’s simplified version of SQL that is used to make standard queries and data operations into containers. TQL also provides the ability to obtain the execution plans or analysis on query statements. Use standard strings to express TQL and issue them with
gsQuery functions from the GridDB API.
TQL can be used perform aggregation queries on both collections and TimeSeries containers. Use either the
gsFetch functions to obtain the aggregation results and use the methods mentioned earlier to obtain the number.
Let’s say we have a Collection in Java with a Row schema in the form of a Java class called ClimateReading. The variable name
climateContainer is cast to this collection. We wish to obtain both the standard deviation and the average of the “temperature” column in this container. We start by forming TQL queries as strings and use the query and fetch functions from the GridDB API to obtain the aggregation results.
Container<String,ClimateReading> climateContainer = store.getCollection("climate",ClimateReading.class); String averageTql = "SELECT AVG(temperature) FROM climate"; String standardDeviationTql>= "SELECT STDDEV(temperature) FROM climate"; Query<AggregationResult> averageQuery = climateContainer.query(averageTql,AggregationResult.class); Query<AggregationResult> standardDeviationQuery = climateContainer.query(standardDeviationTql, AggregationResult.class); RowSet<AggregationResult> rowSet = aggregationQuery.fetch();//Set of one row which has the result RowSet<AggregationResult> stdDeviationSet = deviationQuery.fetch(); AggregationResult average = rowSet.next(); AggregationResult stdDev = stdDeviationSet.next(); System.out.println("The average temperature is " + average.getDouble()); System.out.println("The standard deviation of the temperature is " + stdDev.getDouble());
The average temperature is 73.56 The standard deviation of the temperature is 4.62
Aggregation on TimeSeries Containers
To retrieve an aggregation result for a TimeSeries container in Java simply use the
.aggregate method. Select the column you want to query and the type of aggregation you wish to calculates and two dates you want to set as your starting and ending dates. These four parameters make up your aggregation query. Once this query is fetched, an AggregationResult is returned. Use either
.getLong() calls to retrieve the numbers from AggregationResult objects.
In the example below there is a TimeSeries container set to the variable
timeSeries. It has a numeric column called “light” that we wish to obtain the average for. We also want to limit the rows we include in our calculation to be between a set starting date and ending date. If you do not want to have a starting date or ending date on the query, simply set the first or second parameter of your query to
null. These dates are cast as Date objects in the variables
// Get average from earliest record to latest record in the container on the ‘light’ column AggregationResult averageResult = timeSeries.aggregate(start,end,”light”,Aggregation.AVERAGE); System.out.println("The average light value is " + averageResult.getDouble());
The average light value is 29.82
Time Weighted Averages
Sometimes when obtaining the average of a column from a database, the normal average that we think of (sum of all values divided by the number of values) is not enough. An aggregation result that is better suited for time-related data is a time average. A time weighted average not only factors all the values of a measurement, but also the amount of time that is between them.
One example could be a sensor that records measurements at different or sporadic time spans. This sensor might have periods where there were 3 seconds between its recordings. After a few hours however, this time span between readings could have increased to 6 seconds or more. One way to account for these discrepancies is to weigh the average itself by the time values of all the entries in the table along with the time spans between their adjacent rows. GridDB provides the ability to obtain time-weighted averages for TimeSeries containers.
Calculating a Time Weighted Average
Time averages are aggregation values that are unique to
TimeSeries containers. You obtain these results with the
TIME_AVG TQL query or the
WEIGHTED_AVERAGE aggregation. Weighted averages in GridDB are calculated by giving every value in a row a weighted value that is half the time span in seconds between the two rows adjacent to it (the rows before and after).
Let’s say we want to collect the weighted average of a column called light in a TimeSeries container. In one row, we can call row 10, it has a time stamp of June 22nd 8:23:43 and a light value of 23. The row preceding it was inserted 4 seconds before it. The following row was inserted 10 seconds after row 10.
The weighted value of row 10 is:
(10 + 4 ) / 2 = 7
In a container with 5 rows that were inserted at the below times:
Row Time Value Weighted-Value 0 0 20 10 1 10 30 (10 + 15 ) / 2 = 12.5 2 25 50 (15 + 5) / 2 = 10 3 30 70 (5 + 30) / 2 = 17.5 4 60 60 30
The weighted average is the sum of all values divided by the sum of all weighted values.
Sum of Values: 20 + 30 + 50 + 70 + 60 = 230 Sum of Weights: 10 + 12.5 + 17.5 + 10 = 50 Sum of Values / Sum of Weights: 230/50 = 4.6
We are have a weighted average of 4.6.