Advanced SQL Queries for Anomaly Detection and Business Reports

Introduction and Purpose

The Use Case

Although we often think of the Internet of Things as something that coordinates moving entities. In fact, many companies in the industry deal with stationary devices. One example is a smart metering system that measures power consumption in a building facility and delivers time series data in the granularity that allows developing power-saving strategies.

Objectives

We will create a business report that compares power consumption in the last seven days split per day of the week against the average consumption per day of the week in the whole previous period of time.

This approach aims to detect anomalies, unveil power consumption patterns, and routinely monitor increases and decreases in power consumption that may result in an unintended high cost.

Methods

The article will explain how to build an advanced SQL query that transforms time series data from the GridDB database into an aggregation.

The query will use a few essential SQL time functions. A detailed commentary on their utilization will be provided.

The tutorial will walk you through the process of creating a query with multiple nested subqueries and explain how to combine them without losing an overview of the whole structure.

Prerequisites

To use the materials from this article, you need to install an SQL client, for instance, SQLWorkbench/J as described in one of the previous GridDB blog posts here.

Tutorial

Raw Data

The smart meters produce a timestamp twice in a minute. The database has only two columns: datetime in the timestamp format and power in a float format.

Many single entries in the database have very small values in the power column, and there are 2,880 entries being produced each day. It is quite challenging to make any meaningful conclusions without an aggregation.

On the other side, it seems quite intuitive that power consumption varies depending on the day of the week: we do not need much power during the week at our homes, but on the weekend, we use a lot of it. On the contrary, the office building would demonstrate high consumption on the weekdays and low consumption on Saturday and Sunday. That’s why we stick to the day of the week as an important dimension in our report.

Let us first check the raw data, though.

SELECT 
    datetime, 
    power 
FROM 
  power2 
LIMIT 10
datetime power
0 2020-12-01 04:00:23 0.0000000000000000120000043
1 2020-12-01 04:00:53 0.0000000000000000120000043
2 2020-12-01 04:01:23 0.0000000000000000120000043
3 2020-12-01 04:01:53 0.0000000000000000120000043
4 2020-12-01 04:02:23 0.0000000000000000120000043
5 2020-12-01 04:02:53 0.0000000000000000120000043
6 2020-12-01 04:03:23 0.0000000000000000120000043
7 2020-12-01 04:03:53 0.0000000000000000120000043
8 2020-12-01 04:04:23 0.0000000000000000120000043
9 2020-12-01 04:04:53 0.0000000000000000120000043

Extracting day of the week

GridDB supports the native SQL EXTRACT() function that helps to separate components of the date, including year, month, day, day of week, etc.

EXTRACT() returns a value in the numeric format. Since we will be using day of week (further referred to as DOW) for aggregating the data, we directly convert it into the string format. An aggregation dimension is not allowed in any format that can be aggregated itself.

We simply wrap one function into the other in the query to get the necessary result.

SELECT 
     CAST(EXTRACT(DAY_OF_WEEK, datetime) as STRING) AS DOW,
     power
FROM 
  power2
LIMIT 10
DOW power
0 2 0.0000000000000000120000043
1 2 0.0000000000000000120000043
2 2 0.0000000000000000120000043
3 2 0.0000000000000000120000043
4 2 0.0000000000000000120000043
5 2 0.0000000000000000120000043
6 2 0.0000000000000000120000043
7 2 0.0000000000000000120000043
8 2 0.0000000000000000120000043
9 2 0.0000000000000000120000043

It’s time for the first subquery: SQL won’t let us group by DOW right here after we’ve just created this new column. We “nest” the first query into the new one that we further use for aggregation.

SELECT
    DOW,
    SUM(power) as power
FROM 
    (SELECT 
        power, 
        CAST(EXTRACT(DAY_OF_WEEK, datetime) as STRING) AS DOW 
    FROM power2
    ) AS TEST 
GROUP BY DOW
DOW power
0 0 14781000.21
1 1 10222437.23
2 2 6996719.71
3 3 17502268.22
4 4 15799145.37
5 5 15348122.85
6 6 14864972.17

Average Consumption per Each Day of Week

We need to keep in mind that we have summed up power consumption for the whole period of time available in the database. We cannot really compare it against only one-week consumption. Nor can we just calculate an average instead of the sum, since this would produce a mean value for the single entries in the table but not an average per day of week.

To solve this problem, we need to count how many times each day of week appears in the data to use it as a divider. We count the number of rows with the SQL native COUNT() function. By doing so, we get the number of timestamps produced on a day, and then we divide it by 2 * 60 * 24 to get the number of days.

SELECT
    DOW,
    COUNT(power) / (2 * 60 * 24) AS count_days,
    SUM(power) as power
FROM 
    (SELECT 
        power, 
        CAST(EXTRACT(DAY_OF_WEEK, datetime) as STRING) AS DOW 
    FROM power2
    ) AS TEST 
GROUP BY DOW
DOW count_days power
0 0 3 14781000.21
1 1 3 10222437.23
2 2 4 6996719.71
3 3 4 17502268.22
4 4 3 15799145.37
5 5 4 15348122.85
6 6 3 14864972.17

We do not need one more output column in the result, so we move the expression counting the unique DOWs and calculate the real average power consumption per day of week straightforwardly in the main query.


SELECT
    DOW,
    SUM(power) / (COUNT(power) / (2 * 60 * 24)) as power
FROM 
    (SELECT 
        power, 
        CAST(EXTRACT(DAY_OF_WEEK, datetime) as STRING) AS DOW 
    FROM power2
    ) AS TEST 
GROUP BY DOW
DOW power
0 0 4927000.07
1 1 3407479.08
2 2 1749179.93
3 3 4375567.06
4 4 5266381.79
5 5 3837030.71
6 6 4954990.72

Filter the Last Seven Days

Now, when we are set with the first part of our comparison, let’s move to finding out power consumption per day of week for the last seven days. We can use the previous query but we do not need to calculate the number of days: Each DOW appears only once.

Instead, we have to limit the data to the last seven days. How to look back into the past?

We use NOW() as our starting point. NOW() produces the timestamp with the date and time of the moment when the query is performed.

We further utilize the TIMESTAMP_ADD() function that is supported by GridDB. This one transforms a timestamp value. TIMESTAMP_ADD() can change different parts of it: year, month, day, hour, second, and millisecond. Although it looks like it can only add something, it can also do the reverse thing. For this purpose, we specify the argument as a negative number.

Simple magic!


SELECT
    DOW,
    SUM(power) as power
FROM 
    (SELECT 
        power, 
        CAST(EXTRACT(DAY_OF_WEEK, datetime) as STRING) AS DOW 
    FROM power2
    WHERE datetime > TIMESTAMP_ADD(DAY, NOW(), -7)
    AND datetime < = NOW()
    ) AS TEST 
GROUP BY DOW
</code>
DOW power
0 0 3332113.50
1 1 3164761.58
2 2 0.00
3 3 1413780.69
4 4 3025183.71
5 5 3062583.10
6 6 3252764.84

We have added a double WHERE clause to filter the data. Just in case the data have some strange values with the timestamps lying in the future, we sort them out by specifying the upper limit as NOW().

Match the Results by Using the SQL Join

We will use the SQL JOIN command to glue two queries together and match power consumption for both time periods. To differentiate between the origins of the columns in our output result, we will do some renaming. Each column in the first query will get _alltime at the end of it and the columns in the second query will receive _lastweek as a label.

SQL JOIN requires us to give aliases to the subqueries, too, to be able to specify the key: A column that will be used to match values from both subqueries.

SELECT * FROM

    (SELECT
        DOW as DOW_alltime,
        SUM(power) / (COUNT(power) / (2 * 60 * 24)) as power_alltime
    FROM 
        (SELECT 
            power, 
            CAST(EXTRACT(DAY_OF_WEEK, datetime) as STRING) AS DOW 
        FROM power2
        ) AS TEST_alltime 
    GROUP BY DOW_alltime) as ALLTIME

JOIN

    (SELECT
        DOW as DOW_lastweek,
        SUM(power) as power_lastweek
    FROM 
        (SELECT 
            power, 
            CAST(EXTRACT(DAY_OF_WEEK, datetime) as STRING) AS DOW 
        FROM power2
        WHERE datetime > TIMESTAMP_ADD(DAY, NOW(), -7)
        AND datetime < = NOW()
        ) AS TEST_lastweek 
    GROUP BY DOW_lastweek) as LASTWEEK

ON 

ALLTIME.DOW_alltime = LASTWEEK.DOW_lastweek
</code>
DOW_alltime power_alltime DOW_lastweek power_lastweek
0 0 4927000.07 0 3332113.50
1 1 3407479.08 1 3164761.58
2 2 1749179.93 2 0.00
3 3 4375567.06 3 1413780.69
4 4 5266381.79 4 3025183.71
5 5 3837030.71 5 3062583.10
6 6 4954990.72 6 3252764.84

We are set to proceed to the final report!

The Final Report

We will do some prettifying to make the final table more readable and possible anomalies more eye-catching:

  • get rid of one of the DOW-columns since they are otherwise identical
  • rename DOW values to make them look human
  • calculate a delta consumption as the difference between the last-week consumption and the overall average consumption.
  • add a column with only two values – either “increased” or “decreased” – to make the differences stand out and easy to classify
  • last but not least, change the order of the columns, starting with the DOW on the very left, followed by the new semi-grouping column, and then by all numeric columns

Just look at the monster query we’ve got in the end!

SELECT 

    CASE DOW_alltime
        WHEN '1' THEN 'Monday'
        WHEN '2' THEN 'Tuesday'
        WHEN '3' THEN 'Wednesday'
        WHEN '4' THEN 'Thursday'
        WHEN '5' THEN 'Friday'
        WHEN '6' THEN 'Saturday'
        WHEN '0' THEN 'Sunday'
    end
    as day_of_week,
    case 
        when (power_alltime - power_lastweek) > 0 then 'descreased'
        when (power_alltime - power_lastweek) < 0 then 'increased'
        else 'the same'
    end as status,
    power_alltime,
    power_lastweek,
    power_alltime - power_lastweek as power_usage_delta

FROM

    (SELECT
        SUM(power) / (COUNT(power) / (2 * 60 * 24)) as power_alltime,
        DOW as DOW_alltime
    FROM 
        (SELECT 
            power, 
            CAST(EXTRACT(DAY_OF_WEEK, datetime) as STRING) AS DOW 
        FROM power2
        ) AS TEST_alltime 
    GROUP BY DOW_alltime) as ALLTIME

JOIN

    (SELECT
        SUM(power) as power_lastweek,
        DOW as DOW_lastweek
    FROM 
        (SELECT 
            power, 
            CAST(EXTRACT(DAY_OF_WEEK, datetime) as STRING) AS DOW 
        FROM power2
        WHERE datetime > TIMESTAMP_ADD(DAY, NOW(), -7)
        AND datetime < = NOW()
        ) AS TEST_lastweek 
    GROUP BY DOW_lastweek) as LASTWEEK

ON 

ALLTIME.DOW_alltime = LASTWEEK.DOW_lastweek
</code>
day_of_week status power_alltime power_lastweek power_usage_delta
0 Sunday descreased 4927000.07 3332113.50 1594886.57
1 Monday descreased 3407479.08 3164761.58 242717.50
2 Tuesday descreased 1749179.93 0.00 1749179.93
3 Wednesday descreased 4375567.06 1413780.69 2961786.36
4 Thursday descreased 5266381.79 3025183.71 2241198.08
5 Friday descreased 3837030.71 3062583.10 774447.61
6 Saturday descreased 4954990.72 3252764.84 1702225.88

This kind of report helps to keep an eye on any undesirable changes in the power consumption patterns.

It allows to pick up anomalies, such as unexpectedly high power consumption on a particular day of week, quickly. The report can be exported and sent to any interested persons to help them recognize the need for action and facilitate decision-making.

Advantages of Using Open-Source Tools for Building Business Reports

Despite that the numerous paid business insights tools are available nowadays, old (like SQL) and new (like GridDB) open-source ones can provide the same quality and even more flexibility for building sophisticated reports.

GridDB supports a wide range of SQL commands with the focus on time functions. GridDB can host vast amounts of data without query performance suffering a millisecond of a delay.

A Short Afterward

We hope that this tutorial also helped you to catch the logic of writing advanced SQL queries and may encourage you to use them on a routine basis.

Happy querying!

If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.