Blog
GridDB Cloud Quick Start Guide
GridDB Cloud v2.0 has officially been released, has a new free tier, and is officially available worldwide. In this quick start guide, you will learn
GridDB Cloud v2.0 has officially been released, has a new free tier, and is officially available worldwide. In this quick start guide, you will learn how to insert IoT data into the GridDB Cloud, learn how to test the viability of your connection, and learn the basic CRUD commands (Create, Read, Update, Delete). Contents 0 Preparation 1 Whitelisting Your IP Address 2 Adding GridDB Users and Granting DB Access 3 Checking your GridDB Connection 4 Creating your First Time Series & Collection Containers 5 CRUD with GridDB Cloud (Create, Read, Update, Delete) 6 Ingesting Sample IoT Data 7 Running Data Analysis Preparation Before we jump into how to use the GridDB Cloud, let’s go over some basic preparation. Sign Up for GridDB Cloud Free Plan If you would like to sign up for a GridDB Cloud Free instance, you can do so in the following link: https://form.ict-toshiba.jp/download_form_griddb_cloud_freeplan_e. Clone GitHub Repository To follow along and run these sample code snippets, please clone the repository here: $ git clone https://github.com/griddbnet/Blogs.git –branch cloud-quick-start-worldwide Source Code Overview We have prepared some basic HTTP Requests that can help jump-start your journey with GridDB Cloud. These requests are shared via three different programming interfaces: CLI scripts (aka bash, in the bash/ dir), node.js, and python. For example, the first command will be to make sure the connection between your machine and the cloud can be made; to do so, we will run an HTTP Requests to a specific endpoint using just bash (with cURL), and then with node.js/python scripts. Set Up For Running Sample Code To ensure your HTTP Requests have the proper connection details, copy the env.example file (from the GitHub Repo shared in the section above) and rename it to .env. You must fill in your personal variables (ie. your GridDB Cloud Web API endpoint as well as your user/pass combo encoded into base64). To get proper creds, you will need to gather the username/password combination into base 64separated by a colon; for example: admin:admin becomes YWRtaW46YWRtaW4=. You can encode those values by using the following website: https://www.base64encode.org/ The WEBAPI Url can found on the main page of your GridDB Dashboard. Here is an example of a .env file. export GRIDDB_WEBAPI_URL=”https://cloud51e32re97.griddb.com:443/griddb/v2/gs_clustermfcloud5314927/dbs/ZV8YUerterlQ8″ export USER_PASS=”TTAxZ2FYMFrewwerZrRy1pc3JrerehZWw6avdfcvxXNyYWVs” Now run $ source .env to load these values into your environment so that you can run these scripts from your CLI. First Steps with GridDB Cloud Your GridDB Cloud instance can be communicated with via HTTP Requests; every action needed to interact with GridDB will require formulating and issuing an HTTP Request with different URLs, parameters, methods, and payload bodies. 1. Whitelisting Your IP Address If you haven’t already, please whitelist your public IP address in the network settings of your GridDB Cloud Management dashboard. You can find your own IP Address by using a simple Google Search: “What is my IP Address?” Or you can go here: https://whatismyipaddress.com/ Go to the network tab and add in your IP address. Note: CIDR Ranges are compatible, so please feel free to add your own if you know it. 2. Adding GridDB Users and Granting DB Access Next, we should create a new GridDB User. From the side panel, click the icon which says GridDB User. From this page, click CREATE DATABASE USER. This user’s name and password will be attached to all of our HTTP Requests as a Basic Authorization Header when using the Web API. Once you create the new user, you will also need to grant access to your database. Click on the user from the table of users in GridDB Users page and from this page, grant access to your database (either READ or ALL). Now we can move on to making actual HTTP requests. 3. Checking your GridDB Connection Let’s start with a sanity check and make sure that we can reach out to the GridDB Cloud instance. With your .env file made and ready to go, you can make sure you get the variables loaded into your environment using the following command: $ source .env. And now run the bash script: If you see a status code of 200, that’s good! That means your connection was succesful. If you are met with a status code of 403 Forbidden, that likely means you have not whitelisted your machine’s IP Address in the network tab. Likewise, a returned code of 401 Unauthorized likely means your credentials are incorrect — please make sure you use user:pass encoded into base64! Check Connection URL Endpoint The Web API uses a base url which we will use and expand upon to build out our requests. The base url looks like this: https://cloud<number>.griddb.com/griddb/v2/<clusterName>/dbs/<database name> To check that our connection exists, we can append the following to our base url /checkConnection. Because we are not sending any data back to the server, we will use the GET HTTP method. Lastly, we need to include basic authentication in our HTTP Request’s headers. For this, we will need to include our username and password encoded into base64. With all that said, here is the final result https://cloud51ergege97.griddb.com/griddb/v2/gs_clustermfcloud51fgerge97/dbs/B2vderewDSJy/checkConnection We can now use this URL with any number of programming languages to communicate with our database. And as a note, the examples in this article will contain the URLs and credentials hardcoded into the example, but the source code (linked above) uses environment variables. cURL Request To check our connection with cURL, you can use the following command (check_connection.sh) curl -i –location ‘https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/checkConnection’ \ –header ‘Authorization: Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’ Because it’s a GET request, it’s rather simple and we only needed to add in the authorization header. You should be able to run this and get an HTTP Response of 200. Python Request Here is that same request written in Python # check_connection.py import requests url = “https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/checkConnection” payload = {} headers = { ‘Authorization’: ‘Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’, ‘User-Agent’: ‘PostmanRuntime/7.29.0’ } response = requests.request(“GET”, url, headers=headers, data=payload) print(response.status_code) node.js Request //checkConnection.js const request = require(‘request’); const options = { ‘method’: ‘GET’, ‘url’: ‘https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/checkConnection’, ‘headers’: { ‘Authorization’: ‘Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’ } }; request(options, function (error, response) { if (error) throw new Error(error); console.log(“Response Status Code: “, response.statusCode); }); 4. Creating your First Time Series & Collection Containers With our connection firmly established, we can create our first containers — both Collection and Time Series — of which are similar to relational tables. You can read more about that here: GridDB Data Model. The URL suffix looks like this: /containers. This request can sometimes require a multitude of data and can have a big range, therefore this request will require an HTTP method of POST. The body of the request requires container name, container type, whether a rowkey exists (bool), and the schema. Let’s first take a look at the structure outside of the context of an HTTP Request and then we will send it inside of a Request body. We will also need to include in our Request’s headers that we are sending a data payload of type JSON like so: ‘Content-Type: application/json’ Time Series Container First, let’s create a Time Series container — we can see here that we select the container type as TIME_SERIES and the first column is of type timestamp. There is also a rowkey section, but this is optional as in a time series container, the rowkey is always the timestamp by default. { “container_name”: “device1”, “container_type”: “TIME_SERIES”, “rowkey”: true, “columns”: [ { “name”: “ts”, “type”: “TIMESTAMP” }, { “name”: “co”, “type”: “DOUBLE” }, { “name”: “humidity”, “type”: “DOUBLE” }, { “name”: “light”, “type”: “BOOL” }, { “name”: “lpg”, “type”: “DOUBLE” }, { “name”: “motion”, “type”: “BOOL” }, { “name”: “smoke”, “type”: “DOUBLE” }, { “name”: “temp”, “type”: “DOUBLE” } ] } Now we simply attach this to the body when we make our Request and we should create our new container. If successful, you should get a status code of 201 (Created). cURL #create_container.sh curl -i -X POST –location ‘https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/containers’ \ –header ‘Content-Type: application/json’ \ –header ‘Authorization: Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’ \ –data ‘{ “container_name”: “device1”, “container_type”: “TIME_SERIES”, “rowkey”: true, “columns”: [ { “name”: “ts”, “type”: “TIMESTAMP” }, { “name”: “co”, “type”: “DOUBLE” }, { “name”: “humidity”, “type”: “DOUBLE” }, { “name”: “light”, “type”: “BOOL” }, { “name”: “lpg”, “type”: “DOUBLE” }, { “name”: “motion”, “type”: “BOOL” }, { “name”: “smoke”, “type”: “DOUBLE” }, { “name”: “temp”, “type”: “DOUBLE” } ] }’ Python #create_container.py import requests import json url = “https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/containers” payload = json.dumps({ “container_name”: “device1”, “container_type”: “TIME_SERIES”, “rowkey”: True, “columns”: [ { “name”: “ts”, “type”: “TIMESTAMP” }, { “name”: “co”, “type”: “DOUBLE” }, { “name”: “humidity”, “type”: “DOUBLE” }, { “name”: “light”, “type”: “BOOL” }, { “name”: “lpg”, “type”: “DOUBLE” }, { “name”: “motion”, “type”: “BOOL” }, { “name”: “smoke”, “type”: “DOUBLE” }, { “name”: “temp”, “type”: “DOUBLE” } ] }) headers = { ‘Content-Type’: ‘application/json’, ‘Authorization’: ‘Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’, ‘User-Agent’: ‘PostmanRuntime/7.29.0’ } response = requests.request(“POST”, url, headers=headers, data=payload) print(response.status_code) node.js //createContainer.js var request = require(‘request’); var options = { ‘method’: ‘POST’, ‘url’: ‘https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/containers’, ‘headers’: { ‘Content-Type’: ‘application/json’, ‘Authorization’: ‘Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’ }, body: JSON.stringify({ “container_name”: “device1”, “container_type”: “TIME_SERIES”, “rowkey”: true, “columns”: [ { “name”: “ts”, “type”: “TIMESTAMP” }, { “name”: “co”, “type”: “DOUBLE” }, { “name”: “humidity”, “type”: “DOUBLE” }, { “name”: “light”, “type”: “BOOL” }, { “name”: “lpg”, “type”: “DOUBLE” }, { “name”: “motion”, “type”: “BOOL” }, { “name”: “smoke”, “type”: “DOUBLE” }, { “name”: “temp”, “type”: “DOUBLE” } ] }) }; request(options, function (error, response) { if (error) throw new Error(error); console.log(“Response Status Code: “, response.statusCode); }); Collection Container Now let’s create a collection container. These containers don’t require a time series column (but they are allowed) and also don’t require rowkey to be set to true. Here are some examples: cURL #create_collection.sh curl -i -X POST –location ‘https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/containers’ \ –header ‘Content-Type: application/json’ \ –header ‘Authorization: Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’ \ –data ‘{ “container_name”: “deviceMaster”, “container_type”: “COLLECTION”, “rowkey”: true, “columns”: [ { “name”: “equipment”, “type”: “STRING” }, { “name”: “equipmentID”, “type”: “STRING” }, { “name”: “location”, “type”: “STRING” }, { “name”: “serialNumber”, “type”: “STRING” }, { “name”: “lastInspection”, “type”: “TIMESTAMP” }, { “name”: “information”, “type”: “STRING” } ] }’ Python #create_collection.py import requests import json url = “https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/containers” payload = json.dumps({ “container_name”: “deviceMaster”, “container_type”: “COLLECTION”, “rowkey”: True, “columns”: [ { “name”: “equipment”, “type”: “STRING” }, { “name”: “equipmentID”, “type”: “STRING” }, { “name”: “location”, “type”: “STRING” }, { “name”: “serialNumber”, “type”: “STRING” }, { “name”: “lastInspection”, “type”: “TIMESTAMP” }, { “name”: “information”, “type”: “STRING” } ] }) headers = { ‘Content-Type’: ‘application/json’, ‘Authorization’: ‘Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’, ‘User-Agent’: ‘PostmanRuntime/7.29.0’ } response = requests.request(“POST”, url, headers=headers, data=payload) print(response.status_code) node.js //createCollection.js var request = require(‘request’); var options = { ‘method’: ‘POST’, ‘url’: ‘https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/containers’, ‘headers’: { ‘Content-Type’: ‘application/json’, ‘Authorization’: ‘Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’ }, body: JSON.stringify({ “container_name”: “deviceMaster”, “container_type”: “COLLECTION”, “rowkey”: true, “columns”: [ { “name”: “equipment”, “type”: “STRING” }, { “name”: “equipmentID”, “type”: “STRING” }, { “name”: “location”, “type”: “STRING” }, { “name”: “serialNumber”, “type”: “STRING” }, { “name”: “lastInspection”, “type”: “TIMESTAMP” }, { “name”: “information”, “type”: “STRING” } ] }) }; request(options, function (error, response) { if (error) throw new Error(error); console.log(response.statusCode); }); 5. CRUD with GridDB Cloud (Create, Read, Update, Delete) 5. CRUD with GridDB Cloud (Create, Read, Update, Delete) Next, let’s go over the commands to Create, Read, Update, and Delete. Adding Rows of Data (Create) We have already created some containers before, but to add to that, we will be creating rows of data to add to our container. We can add rows of data directly inside of our containers. The URL suffix: /containers/:container/rows To PUT a row of data into our container, we will need to use the HTTP Method PUT. Similar to before, we will need to specify that our content is JSON and we will include the row data in our Request body. You can add multiple rows at once, you just need to make sure that your payload is formed to accomdate extra rows and that you don’t have a trailing comma on the last row. Let’s add rows to our device1 container. [ [“2024-01-09T10:00:01.234Z”, 0.003551, 50.0, false, 0.00754352, false, 0.0232432, 21.6], [“2024-01-09T11:00:01.234Z”, 0.303551, 60.0, false, 0.00754352, true, 0.1232432, 25.3], [“2024-01-09T12:00:01.234Z”, 0.603411, 70.0, true, 0.00754352, true, 0.4232432, 41.5] ] You of course also need to be sure that your row’s schema matches your container’s. If it doesn’t, you will be met with an error message and a status code of 400 (Bad Request). cURL https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/containers #add_rows.sh curl –location –request PUT ‘https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/containers/device1/rows’ \ –header ‘Content-Type: application/json’ \ –header ‘Authorization: Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’ \ –data ‘[ [“2024-01-09T10:00:01.234Z”, 0.003551, 50.0, false, 0.00754352, false, 0.0232432, 21.6], [“2024-01-09T11:00:01.234Z”, 0.303551, 60.0, false, 0.00754352, true, 0.1232432, 25.3], [“2024-01-09T12:00:01.234Z”, 0.603411, 70.0, true, 0.00754352, true, 0.4232432, 41.5] ]’ Python #add_rows.py import requests import json url = “https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/containers/device1/rows” payload = json.dumps([ [ “2024-01-09T10:00:01.234Z”, 0.003551, 50, False, 0.00754352, False, 0.0232432, 21.6 ], [ “2024-01-09T11:00:01.234Z”, 0.303551, 60, False, 0.00754352, True, 0.1232432, 25.3 ], [ “2024-01-09T12:00:01.234Z”, 0.603411, 70, True, 0.00754352, True, 0.4232432, 41.5 ] ]) headers = { ‘Content-Type’: ‘application/json’, ‘Authorization’: ‘Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’, ‘User-Agent’: ‘PostmanRuntime/7.29.0’ } response = requests.request(“PUT”, url, headers=headers, data=payload) print(response.text) node.js //addRows.js var request = require(‘request’); var options = { ‘method’: ‘PUT’, ‘url’: ‘https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/containers/device1/rows’, ‘headers’: { ‘Content-Type’: ‘application/json’, ‘Authorization’: ‘Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’ }, body: JSON.stringify([ [ “2024-01-09T10:00:01.234Z”, 0.003551, 50, false, 0.00754352, false, 0.0232432, 21.6 ], [ “2024-01-09T11:00:01.234Z”, 0.303551, 60, false, 0.00754352, true, 0.1232432, 25.3 ], [ “2024-01-09T12:00:01.234Z”, 0.603411, 70, true, 0.00754352, true, 0.4232432, 41.5 ] ]) }; request(options, function (error, response) { if (error) throw new Error(error); console.log(response.body); }); Querying Container (Read) After writing to our containers, we will want to read from our containers. The URL suffix is exactly the same as before: /:cluster/dbs/:database/containers/:container/rows except now we will be using the POST method request. The data expected by the server in these requests are how we expect our row data returned to us — for example, we can choose a row limit, an offset, any conditions, and a sort method. Here is what that body looks like: { “offset” : 0, “limit” : 100, “condition” : “temp >= 30”, “sort” : “temp desc” } The one caveat with making this Request is that because it is a POST request, you will need to send something in the body of the request. Any of the parameters above will do, but including the limit is likely the easiest option to include and has the added benefit of reducing server strain. If successful, you should get a server response with a status code of 200 (OK) and a body with the data requested. #query_container.sh { “columns”: [ { “name”: “ts”, “type”: “TIMESTAMP”, “timePrecision”: “MILLISECOND” }, { “name”: “co”, “type”: “DOUBLE” }, { “name”: “humidity”, “type”: “DOUBLE” }, { “name”: “light”, “type”: “BOOL” }, { “name”: “lpg”, “type”: “DOUBLE” }, { “name”: “motion”, “type”: “BOOL” }, { “name”: “smoke”, “type”: “DOUBLE” }, { “name”: “temp”, “type”: “DOUBLE” } ], “rows”: [ [ “2024-01-09T12:00:01.234Z”, 0.603411, 70.0, true, 0.00754352, true, 0.4232432, 41.5 ] ], “offset”: 0, “limit”: 100, “total”: 1 } cURL curl -i -X POST –location ‘https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/containers/device1/rows’ \ –header ‘Content-Type: application/json’ \ –header ‘Authorization: Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’ \ –data ‘{ “offset” : 0, “limit” : 100, “condition” : “temp >= 30”, “sort” : “temp desc” }’ Python #query_container.py import requests import json url = “https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/containers/device1/rows” payload = json.dumps({ “offset”: 0, “limit”: 100, “condition”: “temp >= 30”, “sort”: “temp desc” }) headers = { ‘Content-Type’: ‘application/json’, ‘Authorization’: ‘Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’, ‘User-Agent’: ‘PostmanRuntime/7.29.0’ } response = requests.request(“POST”, url, headers=headers, data=payload) print(response.text) nodejs //queryContainer.js var request = require(‘request’); var options = { ‘method’: ‘POST’, ‘url’: ‘https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/containers/device1/rows’, ‘headers’: { ‘Content-Type’: ‘application/json’, ‘Authorization’: ‘Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’ }, body: JSON.stringify({ “offset”: 0, “limit”: 100, “condition”: “temp >= 30”, “sort”: “temp desc” }) }; request(options, function (error, response) { if (error) throw new Error(error); console.log(response.body); }); Updating a Row (Update) To cover updates, adding rows of data can be considered updating, but we can also directly update a row (if the container has rowkeys). The way it works is if you push a row of data to your container which has rowkey set as true, and send up a row of data with a rowkey that already exists in your container, it will update the row with whatever new information is pushed along. Let’s push data to our deviceMaster collection container once to add data, and then again to update the row. Let’s craft our row of data [ [“device1”, “01”, “CA”, “23412”, “2023-12-15T10:45:00.032Z”, “working”] ] Now let’s form our HTTP Requests cURL #update_collection.sh curl -i –location –request PUT ‘https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/containers/deviceMaster/rows’ \ –header ‘Content-Type: application/json’ \ –header ‘Authorization: Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’ \ –data ‘[ [“device1”, “01”, “CA”, “23412”, “2023-12-15T10:45:00.032Z”, “working”] ]’ Python #update_collection.py import requests import json url = “https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/containers/deviceMaster/rows” payload = json.dumps([ [ “device1”, “01”, “CA”, “23412”, “2023-12-15T10:45:00.032Z”, “working” ] ]) headers = { ‘Content-Type’: ‘application/json’, ‘Authorization’: ‘Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’, ‘User-Agent’: ‘PostmanRuntime/7.29.0’ } response = requests.request(“PUT”, url, headers=headers, data=payload) print(response.text) node.js //updateCollection.js var request = require(‘request’); var options = { ‘method’: ‘PUT’, ‘url’: ‘https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/containers/deviceMaster/rows’, ‘headers’: { ‘Content-Type’: ‘application/json’, ‘Authorization’: ‘Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’ }, body: JSON.stringify([ [ “device1”, “01”, “CA”, “23412”, “2023-12-15T10:45:00.032Z”, “working” ] ]) }; request(options, function (error, response) { if (error) throw new Error(error); console.log(response.body); }); Updating a Row And now with that data in there, if you change any of the values outside of the first one (the rowkey, the device name), it will update that device’s metadata will keeping the row inside of your container. curl -i –location –request PUT ‘https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/containers/deviceMaster/rows’ \ –header ‘Content-Type: application/json’ \ –header ‘Authorization: Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’ \ –data ‘[ [“device1”, “01”, “NY”, “23412”, “2023-12-20T10:45:00.032Z”, “working”] ]’ Here we are changing the location and the time of last inspection. If you look at your dashboard, the values will be have been updated. Deleting a Row (Delete) We can delete a row simply by using the appropriate HTTP Method (Delete) and then sending in a valid rowkey and container to our server. Let’s delete our deviceMaster’s lone row. The body of the request looks like this. You can add multiple rowkeys inside here to delete multiple rows at once. [ “device1” ] cURL #delete_row.sh curl -v –location –request DELETE ‘https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/containers/deviceMaster/rows’ \ –header ‘Content-Type: application/json’ \ –header ‘Authorization: Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’ \ –data ‘[ “device1” ]’ Python #delete_row.py import requests import json url = “https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/containers/deviceMaster/rows” payload = json.dumps([ “device1” ]) headers = { ‘Content-Type’: ‘application/json’, ‘Authorization’: ‘Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’, ‘User-Agent’: ‘PostmanRuntime/7.29.0’ } response = requests.request(“DELETE”, url, headers=headers, data=payload) print(response.status_code) node.js //deleteRow.js var request = require(‘request’); var options = { ‘method’: ‘DELETE’, ‘url’: ‘https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/containers/deviceMaster/rows’, ‘headers’: { ‘Content-Type’: ‘application/json’, ‘Authorization’: ‘Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’ }, body: JSON.stringify([ “device1” ]) }; request(options, function (error, response) { if (error) throw new Error(error); console.log(response.statusCode); }); Deleting a Row from Time Series Container You can also delete the row of a time series container. As stated before, the time stamp will always be the rowkey in a time series container, so here we just add our time stamp and those rows will be deleted. #delete_container.sh curl –location –request DELETE ‘https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/containers/device1/rows’ \ –header ‘Content-Type: application/json’ \ –header ‘Authorization: Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’ \ –data ‘[ “2024-01-09T10:00:01.234Z”, “2024-01-09T12:00:01.234Z” ]’ TQL Next, let’s try running a TQL Query. If you’re unfamiliar, TQL is GridDB’s special query language. Let’s run a simple query. First, this is what the URL looks like base url + /tql The body of the request will need the container name followed by your query statement. We can also query multiple containers at once: [ {“name” : “deviceMaster”, “stmt” : “select * limit 100”, “columns” : null}, {“name” : “device1”, “stmt” : “select * where temp>=24”, “columns” : [“temp”, “co”]}, ] And this is the response of the above query: [ { “columns”: [ { “name”: “equipment”, “type”: “STRING” }, { “name”: “equipmentID”, “type”: “STRING” }, { “name”: “location”, “type”: “STRING” }, { “name”: “serialNumber”, “type”: “STRING” }, { “name”: “lastInspection”, “type”: “TIMESTAMP”, “timePrecision”: “MILLISECOND” }, { “name”: “information”, “type”: “STRING” } ], “results”: [ [ “device1”, “01”, “NY”, “23412”, “2023-12-20T10:45:00.032Z”, “working” ] ], “offset”: 0, “limit”: 100, “total”: 1, “responseSizeByte”: 31 }, { “columns”: [ { “name”: “temp”, “type”: “DOUBLE” }, { “name”: “co”, “type”: “DOUBLE” } ], “results”: [ [ 25.3, 0.303551 ], [ 41.5, 0.603411 ] ], “offset”: 0, “limit”: 1000000, “total”: 2, “responseSizeByte”: 32 } ] cURL #tql.sh curl -i -X POST –location ‘https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/tql’ \ –header ‘Content-Type: application/json’ \ –header ‘Authorization: Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’ \ –data ‘ [ {“name” : “deviceMaster”, “stmt” : “select * limit 100”, “columns” : null}, {“name” : “device1”, “stmt” : “select * where temp>=24”, “columns” : [“temp”, “co”]} ] ‘ Python #tql.py import requests import json url = “https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/tql” payload = json.dumps([ { “name”: “deviceMaster”, “stmt”: “select * limit 100”, “columns”: None }, { “name”: “device1”, “stmt”: “select * where temp>=24”, “columns”: [ “temp”, “co” ] } ]) headers = { ‘Content-Type’: ‘application/json’, ‘Authorization’: ‘Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’, ‘User-Agent’: ‘PostmanRuntime/7.29.0’ } response = requests.request(“POST”, url, headers=headers, data=payload) print(response.text) node.js //tql.js var request = require(‘request’); var options = { ‘method’: ‘POST’, ‘url’: ‘https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/tql’, ‘headers’: { ‘Content-Type’: ‘application/json’, ‘Authorization’: ‘Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’ }, body: JSON.stringify([ { “name”: “deviceMaster”, “stmt”: “select * limit 100”, “columns”: null }, { “name”: “device1”, “stmt”: “select * where temp>=24”, “columns”: [ “temp”, “co” ] } ]) }; request(options, function (error, response) { if (error) throw new Error(error); console.log(response.body); }); SQL On top of TQL, GridDB and the GridDB Cloud also have SQL functionality. Though the SQL functionality for the GridDB Cloud is limited to reading results (SELECT) and updating some rows (UPDATE). SQL SELECT base url + /sql [ {“type” : “sql-select”, “stmt” : “SELECT * FROM deviceMaster”}, {“type” : “sql-select”, “stmt” : “SELECT temp, co FROM device1 WHERE temp>=24″} ] It is very similar to TQL but we call on the container name from within the query itself, just like “normal” SQL statements. Here is the response body: [ { “columns”: [ { “name”: “equipment”, “type”: “STRING” }, { “name”: “equipmentID”, “type”: “STRING” }, { “name”: “location”, “type”: “STRING” }, { “name”: “serialNumber”, “type”: “STRING” }, { “name”: “lastInspection”, “type”: “TIMESTAMP”, “timePrecision”: “MILLISECOND” }, { “name”: “information”, “type”: “STRING” } ], “results”: [ [ “device1”, “01”, “CA”, “23412”, “2023-12-15T10:45:00.032Z”, “working” ] ], “responseSizeByte”: 47 }, { “columns”: [ { “name”: “temp”, “type”: “DOUBLE” }, { “name”: “co”, “type”: “DOUBLE” } ], “results”: [ [ 25.3, 0.303551 ], [ 41.5, 0.603411 ] ], “responseSizeByte”: 32 } ] cURL #sql_select.sh curl -i -X POST –location ‘https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/sql’ \ –header ‘Content-Type: application/json’ \ –header ‘Authorization: Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’ \ –data ‘[ {“type” : “sql-select”, “stmt” : “SELECT * FROM deviceMaster”}, {“type” : “sql-select”, “stmt” : “SELECT temp, co FROM device1 WHERE temp>=24”} ] ‘ Python #sql_select.py import requests import json url = “https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/sql” payload = json.dumps([ { “type”: “sql-select”, “stmt”: “SELECT * FROM deviceMaster” }, { “type”: “sql-select”, “stmt”: “SELECT temp, co FROM device1 WHERE temp>=24” } ]) headers = { ‘Content-Type’: ‘application/json’, ‘Authorization’: ‘Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’, ‘User-Agent’: ‘PostmanRuntime/7.29.0’ } response = requests.request(“POST”, url, headers=headers, data=payload) print(response.text) node.js //sqlSelect.js var request = require(‘request’); var options = { ‘method’: ‘POST’, ‘url’: ‘https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/sql’, ‘headers’: { ‘Content-Type’: ‘application/json’, ‘Authorization’: ‘Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’ }, body: JSON.stringify([ { “type”: “sql-select”, “stmt”: “SELECT * FROM deviceMaster” }, { “type”: “sql-select”, “stmt”: “SELECT temp, co FROM device1 WHERE temp>=24” } ]) }; request(options, function (error, response) { if (error) throw new Error(error); console.log(response.body); }); SQL SELECT GROUP BY RANGE Because we are using SQL Select, you can use the GridDB’s Group By Range as well. You can learn more about that here: Exploring GridDB’s Group By Range Functionality. We will make our query and group by hours: [ {“type” : “sql-select”, “stmt” : “SELECT temp, co FROM device1 WHERE ts > TO_TIMESTAMP_MS(1594515625984) AND ts < TO_TIMESTAMP_MS(1595040779336) GROUP BY RANGE (ts) EVERY (1, HOUR)"} ] cURL #sql_select_groupby.sh curl -i --location 'https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/sql' \ --header 'Content-Type: application/json' \ --header 'Authorization: Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs' \ --data '[ {"type" : "sql-select", "stmt" : "SELECT temp, co FROM device1 WHERE ts > TO_TIMESTAMP_MS(1594515625984) AND ts < TO_TIMESTAMP_MS(1595040779336) GROUP BY RANGE (ts) EVERY (1, HOUR)"} ] ' Python # sql_select_groupby.py import requests import json url = "https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/sql" payload = json.dumps([ { "type": "sql-select", "stmt": "SELECT temp, co FROM device1 WHERE ts > TO_TIMESTAMP_MS(1594515625984) AND ts < TO_TIMESTAMP_MS(1595040779336) GROUP BY RANGE (ts) EVERY (1, HOUR)" } ]) headers = { 'Content-Type': 'application/json', 'Authorization': 'Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs', 'User-Agent': 'PostmanRuntime/7.29.0' } response = requests.request("POST", url, headers=headers, data=payload) print(response.text) SQL Insert The base URL is the same as SELECT, but you need to append 'update' base url + /:cluster/dbs/:database/sql/update [ {"stmt" : "insert into deviceMaster(equipment, equipmentID, location, serialNumber, lastInspection, information) values('device2', '02', 'MA', '34412', TIMESTAMP('2023-12-21T10:45:00.032Z'), 'working')"} ] cURL #sql_insert.sh curl -i -X POST --location 'https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/sql/update' \ --header 'Content-Type: application/json' \ --header 'Authorization: Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs' \ --data '[ {"stmt" : "insert into deviceMaster(equipment, equipmentID, location, serialNumber, lastInspection, information) values('\''device2'\'', '\''02'\'', '\''MA'\'', '\''34412'\'', TIMESTAMP('\''2023-12-21T10:45:00.032Z'\''), '\''working'\'')"} ]' Python #sql_insert.py import requests import json url = "https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/sql/update" payload = json.dumps([ { "stmt": "insert into deviceMaster(equipment, equipmentID, location, serialNumber, lastInspection, information) values('device2', '02', 'MA', '34412', TIMESTAMP('2023-12-21T10:45:00.032Z'), 'working')" } ]) headers = { 'Content-Type': 'application/json', 'Authorization': 'Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs', 'User-Agent': 'PostmanRuntime/7.29.0' } response = requests.request("POST", url, headers=headers, data=payload) print(response.text) node.js //sqlInsert.js var request = require('request'); var options = { 'method': 'POST', 'url': 'https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/sql/update', 'headers': { 'Content-Type': 'application/json', 'Authorization': 'Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs' }, body: JSON.stringify([ { "stmt": "insert into deviceMaster(equipment, equipmentID, location, serialNumber, lastInspection, information) values('device2', '02', 'MA', '34412', TIMESTAMP('2023-12-21T10:45:00.032Z'), 'working')" } ]) }; request(options, function (error, response) { if (error) throw new Error(error); console.log(response.body); }); SQL Update The base URL is the same as above, but you need to append 'update' base url + /sql/update [ {"stmt" : "update deviceMaster set location = 'LA' where equipmentID = '01'"} ] This command allows you to Update, similar to the NoSQL method described above. We can both update existing rows, or update containers to add new rows. cURL #sql_update.sh curl -i -X POST --location 'https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/sql/update' \ --header 'Content-Type: application/json' \ --header 'Authorization: Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs' \ --data '[ {"stmt" : "update deviceMaster set location = '\''LA'\'' where equipmentID = '\''01'\''"} ]' Python #sql_update.py import requests import json url = "https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/sql/update" payload = json.dumps([ { "stmt": "update deviceMaster set location = 'LA' where equipmentID = '01'" } ]) headers = { 'Content-Type': 'application/json', 'Authorization': 'Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs', 'User-Agent': 'PostmanRuntime/7.29.0' } response = requests.request("POST", url, headers=headers, data=payload) print(response.text) node.js //sqlUpdate.js var request = require('request'); var options = { 'method': 'POST', 'url': 'https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/sql/update', 'headers': { 'Content-Type': 'application/json', 'Authorization': 'Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs' }, body: JSON.stringify([ { "stmt": "update deviceMaster set location = 'LA' where equipmentID = '01'" } ]) }; request(options, function (error, response) { if (error) throw new Error(error); console.log(response.body); }); Dropping Containers We can also drop containers. base url + /containers The request's body can contain one or multiple container names of which will be dropped once we make our request. [ "deviceMaster" ] If successful, you will receive a status code of 204 (No Content) cURL #delete_container.sh curl -i --location --request DELETE 'https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/containers' \ --header 'Content-Type: application/json' \ --header 'Authorization: Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs' \ --data '[ "deviceMaster" ]' Python #delete_container.py import requests import json url = "https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/containers" payload = json.dumps([ "deviceMaster" ]) headers = { 'Content-Type': 'application/json', 'Authorization': 'Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs', 'User-Agent': 'PostmanRuntime/7.29.0' } response = requests.request("DELETE", url, headers=headers, data=payload) print(response.status_code) node.js //deleteContainer.js var request = require('request'); var options = { 'method': 'DELETE', 'url': 'https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/containers', 'headers': { 'Content-Type': 'application/json', 'Authorization': 'Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs' }, body: JSON.stringify([ "deviceMaster" ]) }; request(options, function (error, response) { if (error) throw new Error(error); console.log(response.statusCode); }); 6. Ingesting Sample IoT Data Next, let's take a look at ingesting CSV data. We will ingest IoT data from Kaggle. You can download the raw file from their website here: https://www.kaggle.com/datasets/garystafford/environmental-sensor-data-132k. Here is the python script you can use to ingest the data into our device2 container. We have already created our device1 container, but we will use the same schema to create a new container called device2 #data_ingest.py import pandas as pd import numpy as np import json import requests from datetime import datetime as dt, timezone headers = { 'Content-Type': 'application/json', 'Authorization': 'Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs', "User-Agent":"PostmanRuntime/7.29.0" } base_url = 'https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/' data_obj = { "container_name": "device1", "container_type": "TIME_SERIES", "rowkey": True, "columns": [] } input_variables = [ "ts","co","humidity","light","lpg","motion","smoke","temp" ] data_types = [ "TIMESTAMP", "DOUBLE", "DOUBLE", "BOOL", "DOUBLE", "BOOL", "DOUBLE","DOUBLE" ] for variable, data_type in zip(input_variables, data_types): column = { "name": variable, "type": data_type } data_obj["columns"].append(column) #Create Container url = base_url + 'containers' r = requests.post(url, json = data_obj, headers = headers) All of this is pretty straightforward. Next, let's actually ingest the data. To do so, we will use the pandas python library to read the csv file which we will include inside of our body's request to be ingested. The pandas library also allows us to easily target and transform specific columns in our csv data to make it possible to be ingested. And one last note, because the csv file is over eight megabytes, we will break up our data into chunks and send them to our Cloud that way. Download data from here: https://www.kaggle.com/datasets/garystafford/environmental-sensor-data-132k Here is the rest of the code: iot_data = pd.read_csv('iot_telemetry_data.csv') #2023-12-15T10:25:00.253Z iot_data['ts'] = pd.to_datetime(iot_data['ts'], unit='s').dt.strftime("%Y-%m-%dT%I:%M:%S.%fZ") print(iot_data["ts"]) iot_data = iot_data.drop('device', axis=1) #print(iot_data.dtypes) iot_subsets = np.array_split(iot_data, 20) #Ingest Data url = base_url + 'containers/device1/rows' for subset in iot_subsets: #Convert the data in the dataframe to the JSON format iot_subsets_json = subset.to_json(orient='values') request_body_subset = iot_subsets_json r = requests.put(url, data=request_body_subset, headers=headers) print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~') print('_______________',r.text,'___________') print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~') if r.status_code > 299: print(r.status_code) break else: print(‘Success for chunk’) As you run the script, it should be printing our successful messages for each chunk uploaded, complete with how many rows got successfully updated. Once done, you can check out your device1 with an HTTP Request query or through the portal. 7. Running Data Analysis Lastly, let’s do some simple python analysis. We will query our iot data and then use that data to do some simple analysis and charting of our data. #data_analysis.py import pandas as pd import numpy as np import requests import plotly.express as px from IPython.display import Image # ts object # co float64 # humidity float64 # light bool # lpg float64 # motion bool # smoke float64 # temp float64 headers = { ‘Content-Type’: ‘application/json’, ‘Authorization’: ‘Basic TTAxMU1sd0MxYS1pc3ewrwqJhZWw6aXNyYWVs’, “User-Agent”:”PostmanRuntime/7.29.0″ } base_url = ‘https://cloud5197422.griddb.com/griddb/v2/gs_clustermfcloud5197422/dbs/B2vdfewfwDSJy/’ sql_query1 = (f”””SELECT * from device1 WHERE co < 0.0019050147565559603 """) url = base_url + 'sql' request_body = '[{"type":"sql-select", "stmt":"'+sql_query1+'"}]' data_req1 = requests.post(url, data=request_body, headers=headers) myJson = data_req1.json() dataset = pd.DataFrame(myJson[0]["results"],columns=[myJson[0]["columns"][0]["name"],myJson[0]["columns"][1]["name"],myJson[0]["columns"][2]["name"], myJson[0]["columns"][3]["name"],myJson[0]["columns"][4]["name"],myJson[0]["columns"][5]["name"],myJson[0]["columns"][6]["name"],myJson[0]["columns"][7]["name"]]) print(dataset) lowest_col = dataset.sort_values('co', ascending=False).head(20000) scatter_plot = px.scatter(lowest_col, x='ts', y='co', size='co', color='co', color_continuous_scale='plasma', hover_name='co') # Customize the plot scatter_plot.update_layout( title='Data Analysis', xaxis_title='CO2 Emissions', yaxis_title='Time' ) scatter_plot.update_layout(template='plotly_dark') # Show the plot scatter_plot.show() Conclusion And with that, we have shown you how to interact with your GridDB Cloud and do all sorts of database-related
Introduction GridDB is a time series database available both on-premise and on the cloud optimized for IoT and Big Data that was developed by Toshiba Digital Solutions Corporation. It features a unique key-container model designed specifically to handle both metadata and time series data. Its in-memory architecture allows incredible ingestion and query performance. GridDB is also horizontally scalable to improve both performance and reliability. With the rise in popularity of using Large Language Models (LLMs), general purpose models like GPT-4 have attracted the most interest and media coverage, but many other specialized models exist for the purpose of code generation. While many of these models are extremely large and require immense compute power to train and execute, some LLMs can be trained and executed on platforms that are practical and cost effective. Every database uses a slightly different form of SQL and GridDB is no different. In particular, GridDB has different time functions compared to other SQL databases and also uses a unique Key-Container data model where it is encouraged to store homogeneous data in multiple tables. With these differences in mind, the LLM must be fine-tuned for GridDB and experience tells us that an off the shelf LLM would not produce suitable queries. There are both consumer and business use cases for using an LLM to query GridDB. For consumers, the LLM would enable the end user to ask simple questions about their own data such as “When do I consume the most electricity?”. For business analysts and managers, it extends their business intelligence tools allowing for more ad-hoc queries to dive deeper into their organization’s data. In this technical report, we demonstrate how software developers can utilize an LLM to generate queries for use with GridDB’s SQL interface to enhance their application. The process used to create and use an LLM for GridDB is as follows: Determine SQL Generation performance of other models and the feasibility of fine tuning these other models. Find the datasets that were used to train the models selected in Step 1. Filter out any queries that are not supported by GridDB and fine-tune the model to ensure accuracy is still reasonable. Create a data set that uses GridDB specific features: time series range selection and Key-Container. Fine-tune the model with our new GridDB specific dataset and evaluate the accuracy as measured by the percentage of the number of responses that matched the human answer in the evaluation data split. Demonstrate inference within a Python Flask application. Source Code This project’s code can found on the griddbnet GitHub page: https://github.com/griddbnet/sql_llm_model Existing Model Evaluation Closed source, general purpose models such as GPT-4 were immediately dismissed as the bulk of SQL related material they consumed for training would have been for mainstream databases such as Oracle, Postgres, and SQLite. Not only that, but their closed source nature would make it difficult to train or fine-tune the models for GridDB. DeFog.AI’s SQLCoder model based on LLAMA was tested and performed well, but the original models did not support GridDB’s time series SQL semantics. Furthermore, the hardware and time requirements to fine-tune or run inference with SQLCoder was not feasible. Likewise, StarCoder was examined and its reported accuracy was deemed to be significantly poorer than SQLCoder while being just as difficult to fine-tune. The last two models left for consideration were OpenAI’s GPT-2 and Google’s T5-Small models. After fine tuning both with our selected datasets, T5-Small was generally more accurate and trouble-free while performing fine tuning. Other parties had already used GPT-2 and T5-Small to create SQL generating LLMs and between the two, accuracy was better with T5-Small. Dataset Selection Three individual datasets with similar characteristics that had been used to train text to SQL models were found: https://huggingface.co/datasets/b-mc2/sql-create-context https://huggingface.co/datasets/Clinton/Text-to-sql-v1 https://huggingface.co/datasets/knowrohit07/know_sql Between the three datasets, they have nearly 400,000 rows of data. Each row contains: A context or the SQL schema The question to be converted into a SQL query. The answer, the SQL query based on the question and context. For example: { “answer” : “SELECT COUNT(*) FROM head WHERE age > 56”, “question” : “How many heads of the departments are older than 56 ?” “context” : “CREATE TABLE head (age INTEGER)” } Dataset Filtering The first problem seen in these third party datasets is that some of the queries do not work with GridDB. A simple script was created to execute the context statement and then the query statement for each row in the dataset, if the query was executed successfully, then it would be saved to the filtered dataset. for line in fd.readlines(): data = json.loads(line) data[answer_name] = re.sub(‘”‘, ‘\”, data[answer_name]) try: for stmt in data[context_name].split(“;”): stmt = stmt.strip() table = stmt.split(” “)[2] curs.execute(“DROP TABLE IF EXISTS “+table) curs.execute(stmt) except: pass try: curs.execute(data[answer_name]) good=good+1 print(json.dumps({“question”: data[question_name], “context”: data[context_name], “answer”: data[answer_name]})) except: bad=bad+1 Of the nearly 400,000 queries in the original datasets, 170,000 of them functioned in GridDB and were used to perform initial model fine tuning. The dataset was split 80/10/10 for training, validation, and testing and fine tuning ran on top of the base T5-small model. Dataset Creation None of the data in the filtered dataset supports GridDB specific functionality. While GridDB has many unique SQL features not shared with any other database, we wanted to focus on the two most basic. The first is GridDB’s Key-Container data model. Most relational databases store all data in one table, while GridDB recommends splitting data into multiple tables. For example, data for device #1 would be stored in tsdata_1 and data for device #20 in tsdata_20 so the human question of “What is the maximum temperature recorded on device 12?” would be the SQL query of SELECT max(temp) FROM tsdata_12 instead of SELECT max(temp) FROM tsdata WHERE device = ’12′” as would be normal in another database. The second feature we wanted to support was fetching data, in particular aggregations for a given period. GridDB uses the TIMESTAMP() SQL function to compare time series data in a query. For example, to select average temperature in 2024, you would use the query SELECT avg(temp) from tsdata_12 where ts >= TIMESTAMP(‘2024-01-01’) and ts < TIMESTAMP(‘2025-01-01’). To do this, a new tool was developed that could generate a human question with corresponding SQL query answer based on a given template. Iterating on the template with random values for the time period, container identifier, aggregation, etc would build a reasonable sized data set to fine tune the model with. An example input template: { “context” : “CREATE TABLE IF NOT EXISTS devices (device_id INTEGER, ts TIMESTAMP, co DOUBLE, humidity DOUBLE,light BOOL,lpg DOUBLE,motion BOOL,smoke DOUBLE,temp DOUBLE);”, “queries” : [ { “columns” : [“co”, “humidity”, “lpg”, “smoke”, “temp”], “human” : “What is the {HUMAN_AGGREGATE} {COLUMN} ?”, “sql” : “SELECT {AGGREGATE}({COLUMN}) FROM devices;” } ] } Would produce : {“context”: “CREATE TABLE IF NOT EXISTS devices (device_id INTEGER, ts TIMESTAMP, co DOUBLE, humidity DOUBLE,light BOOL,lpg DOUBLE,motion BOOL,smoke DOUBLE,temp DOUBLE);”, “question”: “What is the lowest smoke in 2009 for all devices?”, “answer”: “SELECT MIN(smoke) FROM devices WHERE ts > TIMESTAMP(‘2009-01-01T00:00:00Z’) and ts < TIMESTAMP(‘2010-01-01T00:00:00Z’);”} {“context”: “CREATE TABLE IF NOT EXISTS devices (device_id INTEGER, ts TIMESTAMP, co DOUBLE, humidity DOUBLE,light BOOL,lpg DOUBLE,motion BOOL,smoke DOUBLE,temp DOUBLE);”, “question”: “What is the highest humidity in June 2011 for all devices?”, “answer”: “SELECT MAX(humidity) FROM devices WHERE ts > TIMESTAMP(‘2011-06-01T00:00:00Z’) and ts < TIMESTAMP(‘2011-07-01T00:00:00Z’);”} We created five to six templated queries that both did and did not use the TIMESTAMP() function for five different contexts which both did and did not use multiple tables per the Key-Container model and then with the dataset creation tool, generated 100 different question/answer pairs per query for a total of 3600 queries. As automatic dataset splitting resulted in a disproportionate amount of one context over another in the test dataset, a second test dataset was generated with only a single question/answer pair for each templated query. Fine Tuning For both the filtered data set and the generated GridDB data set, each training data item combined into a single string of the format: Tables: {context} Question: {question} Answer: The above string is then tokenized using the HuggingFace AutoTokenizer and used as the input identifier while the answer is tokenized as the labels. After the dataset has been tokenized, it is trained using HuggingFace’s Trainer library. Additional tokens for < and <= need to be added to the tokenizer otherwise those symbols with the SQL statements would be ignored by the model during training. def tokenize_function(example): start_prompt = “Tables:\n” middle_prompt = “\n\nQuestion:\n” end_prompt = “\n\nAnswer:\n” data_zip = zip(example[‘context’], example[‘question’]) prompt = [start_prompt + context + middle_prompt + question + end_prompt for context, question in data_zip] example[‘input_ids’] = tokenizer(prompt, padding=’max_length’, truncation=True, return_tensors=”pt”).input_ids example[‘labels’] = tokenizer(example[‘answer’], padding=’max_length’, truncation=True, return_tensors=”pt”).input_ids return example finetuned_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16) tokenizer = AutoTokenizer.from_pretrained(tok_model_name) tokenizer.add_tokens([‘<=’, ‘<= ‘, ‘ <=’, ‘ <‘, ‘<‘, ‘< ‘, ‘>= ‘, ‘ >=’, ‘>=’]) finetuned_model.resize_token_embeddings(len(tokenizer)) tokenized_datasets = dataset.map(tokenize_function, batched=True) output_dir = f’./sql-training-{str(int(time.time()))}’ training_args = TrainingArguments( output_dir=output_dir, learning_rate=5e-3, num_train_epochs=2, per_device_train_batch_size=8, # batch size per device during training per_device_eval_batch_size=8, # batch size for evaluation weight_decay=0.01, logging_steps=50, evaluation_strategy=’steps’, # evaluation strategy to adopt during training eval_steps=500, # number of steps between evaluation ) trainer = Trainer( model=finetuned_model, args=training_args, train_dataset=tokenized_datasets[‘train’], eval_dataset=tokenized_datasets[‘validation’], ) trainer.train() Using an AMD Ryzen Threadripper 2990WX with an NVIDIA 4070GTX, training took approximately 3-4 hours to complete for the filtered dataset and under an hour to complete for the generated dataset. Evaluation Using either the 10% test split of the training dataset or the generated test dataset, the same tokenization method was used to build input for the model. The output answer was generated for every input and compared using HuggingFace’s ROUGE evaluation library. try: for stmt in data[context_name].split(“;”): stmt = stmt.strip() table = stmt.split(” “)[2] curs.execute(“DROP TABLE IF EXISTS “+table) curs.execute(stmt) except: pass try: curs.execute(data[answer_name]) good=good+1 print(json.dumps({“question”: data[question_name], “context”: data[context_name], “answer”: data[answer_name]})) except: bad=bad+1 This evaluation was performed for both the original filtered data set and also the generated GridDB specific data set and ROUGE metrics were gathered. ROUGE or Recall-Oriented Understudy for Gisting Evaluation is a set of metrics used to evaluate text transformation or summarization models by comparing human generated baseline answer versus the model generated response. Each ROUGE metric varies from 0 to 1, with 1 being a perfect match. Metric Filtered Queries GridDB Specific Queries ROUGE-1 0.9220341258369449 0.893189189189189 ROUGE-2 0.8328271928176021 0.8556992481203007 ROUGE-L 0.9039756047111251 0.8807387387387388 ROUGE-1 measures the overlap of the words between the original and infered answer. ROUGE-2 refers to the overlap of pairs of words between the reference and infered answer. ROUGE-L measures the longest sequence of words between the reference and infered answer that match. Application Implementation There are many ways to integrate the LLM into an application. LLM inference could be performed on the edge on the user’s device which would allow for greater scalability but also much higher end user system requirements. If the inference is performed on the server side, it can be bundled into the current application or as a separate service that communicates with the current application. This would allow inference to run on dedicated high performance instances and thus inference would have minimal impact on the existing application’s performance. We will directly bundle the LLM into our application into the demo for simplicity’s sake. Now adding the code to use the model in your application is straight forward. The context can be fetched using GridDB’s NoSQL API: containers = [] x = 0 while x < gridstore.partition_info.partition_count: containers.extend(gridstore.partition_info.get_container_names(x, 0)) x=x+1 conts_and_schemas = {} for cont in containers: col_list = gridstore.get_container_info(cont).column_info_list schema = {} for row in col_list: schema[row[0]] = type_mapping(row[1]) conts_and_schemas[cont] = schema create_stmts = [] for key in conts_and_schemas: create_stmts.append(create_table_statement(key, conts_and_schemas[key])) return create_stmts Inference for a single question is performed in a similiar fashion to how evalution was performed. model = AutoModelForSeq2SeqLM.from_pretrained(“griddb_model_2_epoch”) tokenizer = AutoTokenizer.from_pretrained(“t5-small”) def translate_to_sql_select(context, question): prompt = f”””Tables: {context} Question: {question} Answer: “”” input_ids = tokenizer.encode(prompt, return_tensors=”pt”) outputs = model.generate(input_ids) sql_query = tokenizer.decode(outputs[0], skip_special_tokens=True) return sql_query Finally a Flask route gets the local context, calls the model, executes the query, and returns the response. @app.route(‘/nlquery’) def nlquery(): question = request.args.get(‘question’) context = get_local_context() query = translate_to_sql_select(context,question) curs = conn.cursor() try: curs.execute(query) rows = curs.fetchall() return json.dumps(rows) else: abort(400, ‘Generated query was not successful’) While the model is easily incorporated into any Flask or other Python application as shown, scalability may be difficult as each LLM invocation takes approximately 500 milliseconds using an AMD Ryzen Threadripper and NVIDIA 4070GTX. There are other projects such as https://github.com/Ki6an/fastT5 that will greatly improve the scalability of the GridDB LLM model. Conclusion We hope the process of creating a training dataset, performing the training and using the resulting LLM within an application to query your data was insightful and educational. Using LLM, end users including IoT device owners, corporate analysts, managers, customer service, and others are able to query data stored in GridDB without having to know SQL. While the queries used to demonstrate the LLM in this project are relatively simple, the model appears to be extensible to other query types and methods. Furthermore, the T5-small model is efficient to train, not requiring large investments in hardware to to train or run inference on. In the future, with a larger, more diverse training dataset and advancements even in the base model performing natural language queries will become even more commonplace and accurate. The source code used in the project is available at https://github.com/griddbnet/sql_llm_model. The finished model can be downloaded from HuggingFace
Have you ever needed to share a piece of code or text quickly and securely with someone else? With the growing need for easy-to-use online text storage solutions like Pastebin and GitHub Gist, developers are looking for efficient ways to store and share text files. In this blog post, we’ll walk you through building your own online text storage service using Spring Boot. By following our step-by-step guide, you’ll learn how to create a powerful and secure platform that allows users to store and share text files effortlessly. Whether you’re a beginner or an experienced developer, this guide will equip you with the skills to build a robust text storage service. We will start from the requirements, high-level design, and finally the implementation using Spring Boot with docker-compose. Source Code Code can be found in the griddbnet github page: $ git clone https://github.com/griddbnet/Blogs.git –branch springboot-snippet Requirements ✅ We will have the following functional requirements: Users enter text data into the system with the title User clicks the snippet URL to view/edit the content The system must generate a unique snippet ID ❎ The out of scope: User management (register, login) User authentication and authorization User choose a custom ID ☑️ The assumptions The system supports only text-based data The content does not expire The read-to-write ratio is around 5:1 ☑️ Capacity On average, each snippet size is around 30 KB with a maximum size of 10 MB. With 10 million snippets per month, and each snippet requiring 30 KB of storage: 30/snippet * 1,000,000 snippets/month * 12 months = 3,600,000,000 KB/year, which is equivalent to 3.6 TB of storage per year. High Level Design Typically, there are two methods for storing text content: utilizing object storage such as Amazon S3 and employing a database. This article streamlines the architecture by storing the text snippets directly within the database. Both the metadata and content are housed in the same database. This strategy enhances the speed of data retrieval. However, storing extensive text in this manner can escalate the database workload, potentially affecting performance as the data volume expands. Data Storage We will use three tables. The first table, “users”, stores information about registered users, like name and email. The second table, named “snippets”, stores information about the text metadata, like titles or creation dates, and the users who created it. Finally, the “storage” table is the vault where the actual text content of the user’s entries is securely stored. A user might create multiple snippets but a paste is owned only by a single user. Project Setup To get started developing you’ll need to install the following softwares Java 17 or later, Maven 3.5+, Docker engine, and your favorite text editor (Intellij IDEA, or VSCode) Create a Spring Boot Project Spring boot offers a fast way to build applications. Spring Boot does not generate code to make edits to your files. Instead, when you start your application, Spring Boot dynamically wires up beans and settings and applies them to your application context. With Spring Boot, we can focus more on the business features and less on the infrastructure. Navigate to start.spring.io. This service pulls in all the dependencies you need for an application and does most of the setup. Click generate, it will generate the Spring Boot project and download it as a zip. Now unzip this project and import it into any IDE.<br> To interact with GridDB, we need to add a GridDB Java Client to this project. Add the following dependency into maven pom.xml. <dependency> <groupId>com.github.griddb</groupId> <artifactId>gridstore</artifactId> <version>5.5.0</version> </dependency> <dependency> <groupId>com.github.f4b6a3</groupId> <artifactId>tsid-creator</artifactId> <version>5.2.5</version> </dependency> Building the Web Application Based on practical considerations and familiarity, we choose Thymeleaf to render the frontend.\ Thymeleaf provides an easy-to-understand and easy-to-debug flow, making it a practical choice.\ Thymeleaf eliminates the need for additional package management, reducing complexity and potential vulnerabilities. The Front-End Displaying a list of snippets, we define a basic HTML table structure with header and body sections.\ Inside the body, the code iterates through a list of snippets using a Thymeleaf loop (th:each).\ The first column shows the snippet title inside an anchor element to create a hyperlink to the snippet details. <div th:if=”${snippets.size() > 0}” class=”container”> <table class=”table table-hover table-responsive-xl table-striped”> <thead class=”thead-light”> <tr> <th scope=”col”>Title</th> <th scope=”col”>Created Time</th> <th scope=”col”>Size</th> </tr> </thead> <tbody class=”table-group-divider”> <tr th:each=”snippet : ${snippets}”> <td> <a class=”small” th:href=”@{/snippets/{id}(id=${snippet.id})}” th:text=”${snippet.title}”></a> </td> <th id=”timeAgo” scope=”row”>[[${snippet.timeAgo}]]</th> <th id=”contentSizeHumanReadable” scope=”row”>[[${snippet.contentSizeHumanReadable}]]</th> </tr> </tbody> </table> </div> And this is what the listing page will look like: Creating a new snippet by using html form. The th:action expression directs the form to POST to the /snippets/save endpoint. The th:object=”${snippet}” declares the model object to use for collecting the form data. <form th:action=”@{/snippets/save}” method=”post” enctype=”multipart/form-data” th:object=”${snippet}” id=”snippetForm” style=”max-width: 550px; margin: 0 auto”> <div class=”p-3″> <div class=”form-group row”> <label class=”col-sm-3 col-form-label” for=”title”>Title</label> <div class=”col-sm-9″> <input type=”text” th:field=”*{title}” required minlength=”2″ maxlength=”128″ class=”form-control” id=”title” /> </div> </div> <div class=”form-group row”> <label class=”col-sm-3 col-form-label” for=”content”>Content</label> <div class=”col-sm-9″> <textarea rows=”20″ cols=”80″ th:field=”*{content}” form=”snippetForm” class=”form-control” id=”content” required/> </div> </div> <div class=”form-group row”> <label class=”col-sm-3 col-form-label” for=”userId”>User</label> <div class=”col-sm-9″> <select th:field=”*{userId}”> <option th:each=”user : ${users}” th:text=”${user.fullName}” th:value=”${user.id}”> </select> </div> </div> <div class=”text-center”> <input type=”submit” value=”Save” class=”btn btn-primary btn-sm mr-2″ /> <input type=”button” value=”Cancel” id=”btnCancel” class=”btn btn-secondary btn-sm” /> </div> </div> </form> And this is what the create snippet page will look like: Data access with GridDB First, we create Java POJO classes that represent the underlying table or container in GridDB. We annotate a class with Lombok @Data, which automatically generates getters for all fields, a useful toString method, and hashCode and equals implementations that check all non-transient fields. Also generate setters for all non-final fields, as well as a constructor. We create the data access class according to the previous database design. @Data public class User { @RowKey String id; String email; String fullName; Date createdAt; } @Data public class Snippet { @RowKey String id; String title; String storageId; String userId; Date createdAt; String contentSizeHumanReadable; } @Data public class Storage { @RowKey String id; Blob content; } <br> Next, we create the GridDBConfig class as a central configuration for database operation. The class will do the following: * Read environment variables for connecting to the GridDB database * Create a GridStore class for managing database connection to the GridDB instance * Create GridDB Collection’s container (Table) to manage a set of rows. The container is a rough equivalent of the table in a relational database. * On creating/updating the Collection we specify the name and object corresponding to the column layout of the collection. Also for each collection, we add an index for a column that is frequently searched and used in the condition of the WHERE section of TQL. @Configuration public class GridDBConfig { @Value(“${GRIDDB_NOTIFICATION_MEMBER}”) private String notificationMember; @Value(“${GRIDDB_CLUSTER_NAME}”) private String clusterName; @Value(“${GRIDDB_USER}”) private String user; @Value(“${GRIDDB_PASSWORD}”) private String password; @Bean public GridStore gridStore() throws GSException { // Acquiring a GridStore instance Properties properties = new Properties(); properties.setProperty(“notificationMember”, notificationMember); properties.setProperty(“clusterName”, clusterName); properties.setProperty(“user”, user); properties.setProperty(“password”, password); GridStore store = GridStoreFactory.getInstance().getGridStore(properties); return store; } @Bean public Collection<String, User> userCollection(GridStore gridStore) throws GSException { Collection<String, User> collection = gridStore.putCollection(“users”, User.class); collection.createIndex(“email”); return collection; } @Bean public Collection<String, Snippet> snippetCollection(GridStore gridStore) throws GSException { Collection<String, Snippet> snippetCollection = gridStore.putCollection(AppConstant.SNIPPETS_CONTAINER, Snippet.class); snippetCollection.createIndex(“userId”); snippetCollection.createIndex(“title”); return snippetCollection; } @Bean public Collection<String, Storage> storageCollection(GridStore gridStore) throws GSException { Collection<String, Storage> storageCollection = gridStore.putCollection(AppConstant.STORAGES_CONTAINER, Storage.class); return storageCollection; } } <br> The Service Layer In the Spring framework, the Service Layer is one of the fundamental architectural layers, primarily responsible for implementing the business logic of an application. SnippetService Method fetchAll : query all the snippets from the collection order by newly created first. Method create: * Is responsible for storing the snippet content in the storage collections, and saving the metadata of the new snippet. * Calculate the size of the content and save that info in a human-readable format. * Generate a snippet ID that is non-predictable, collision-free, and should be readable. * We choose Time-Sorted Unique Identifiers (TSID) as the ID of the snippet. By using TSID, we get a time-sorted identifier including a random component, and can be represented as a 13-character string. @Service @RequiredArgsConstructor public class SnippetService { private final Logger log = LoggerFactory.getLogger(SnippetService.class); private final Collection<String, Snippet> snippetCollection; private final Collection<String, Storage> storagCollection; public List<Snippet> fetchAll() { List<Snippet> snippets = new ArrayList<>(0); try { String tql = “SELECT * FROM ” + AppConstant.SNIPPETS_CONTAINER + ” ORDER BY createdAt DESC”; Query<Snippet> query = snippetCollection.query(tql); RowSet<Snippet> rowSet = query.fetch(); while (rowSet.hasNext()) { snippets.add(rowSet.next()); } } catch (GSException e) { log.error(“Error fetch all snippet”, e); } return snippets; } public void create(CreateSnippet createSnippet) { try { Snippet found = fetchOneByTitle(createSnippet.getTitle()); if (found != null) { return; } Blob content = snippetCollection.createBlob(); content.setBytes(1, createSnippet.getContent().getBytes()); Storage storage = new Storage(); storage.setId(KeyGenerator.next(“obj”)); storage.setContent(content); storagCollection.put(storage.getId(), storage); Snippet snippet = new Snippet(); snippet.setTitle(createSnippet.getTitle()); snippet.setStorageId(storage.getId()); snippet.setUserId(createSnippet.getUserId()); snippet.setCreatedAt(new Date()); snippet.setContentSizeHumanReadable(toHumanReadableByNumOfLeadingZeros( createSnippet.getContent().getBytes().length)); snippetCollection.put(KeyGenerator.next(“sn”), snippet); } catch (GSException e) { e.printStackTrace(); } catch (SQLException e) { e.printStackTrace(); } } } The Controller Layer This layer is responsible for handling incoming HTTP requests. The controller will receive requests, process them, and interact with the previous service layer to retrieve or manipulate data. The SnippetsController handles all HTTP request to /snippets. @Controller @RequestMapping(“/snippets”) @RequiredArgsConstructor public class SnippetsController { private static final Logger log = LoggerFactory.getLogger(SnippetsController.class); private final SnippetService snippetService; private final UserService userService; @GetMapping String snippets(Model model) { List<Snippet> snippets = snippetService.fetchAll(); List<SnippetView> snippetViews = snippets.stream() .map(sn -> SnippetView.builder() .id(sn.getId()) .title(sn.getTitle()) .createdAt(sn.getCreatedAt()) .timeAgo(calculateTimeAgoByTimeGranularity(sn.getCreatedAt(), TimeGranularity.MINUTES)) .contentSizeHumanReadable(sn.getContentSizeHumanReadable()) .build()) .collect(Collectors.toList()); List<User> users = userService.fetchAll(); model.addAttribute(“snippets”, snippetViews); return “snippets”; } @GetMapping(“/new”) String newSnippet(Model model) { List<User> users = userService.fetchAll(); model.addAttribute(“snippet”, new CreateSnippet()); model.addAttribute(“users”, users); return “new_snippet”; } @PostMapping(“/save”) String saveSnippet(@ModelAttribute(“snippet”) CreateSnippet createSnippet) { snippetService.create(createSnippet); return “redirect:/snippets”; } } Method snippets(Model model): * Handle GET request to URL /snippets * Call the service layer to get a list of snippets * Format the creation time into time ago format for each snippet * Return a View (in this case, snippets) for rendering HTML content of snippet list Method newSnippet(Model model): * Handle GET request to URL /snippets/new. * Use a Model object to expose a new Snippet (CreateSnippet) to the view template. CreateSnippet contains fields such as title and content. * Call the service layer to save the Snippet data Running the Project with Docker Compose To spin up the project we will utilize Docker, a popular container engine.\ First, build the docker image using the following command: docker compose -f docker-compose-dev.yml build Run the app: docker compose -f docker-compose-dev.yml up Upon successful execution of the command, you can access the website at http://localhost:8080. Conclusion In conclusion, creating a basic online text storage service using Java Spring Boot and GridDB is a simple and effective process. Spring Boot allows us to concentrate on our core development tasks, while Thymeleaf speeds up template prototyping by not requiring a back-end initially. GridDB efficiently handles the storage of text content using the blob data type. To take your service to the next level, consider adding data compression to save on database storage, building metrics to track the viewer count of each text snippet, and encryption. By implementing these enhancements, you can turn a basic service into a powerful and feature-rich
RabbitMQ is a popular message-queuing system, used in a variety of systems where message delivery are of utmost importance. For our case, we would like to use RabbitMQ to ensure delivery of an on-the-field sensor data to be delivered to GridDB for later processing. Of course, we could always send data from the field to our main server via other means, namely HTTP, but those methods of data transfer can be finicky and unsafe; how often have you tried listening a song via Apple Music through a sparsely connected rural part of the state, only to be met with a connection error and then dead silence? Once that connection is broken, it won’t come back until the entire handshake process occurs again, and all of the data sent in the intermediary is completely lost. The goal of RabbitMQ in the context of this project will be to ensure that even if there are connection issues, the data will persist until it receives acknowledgement from the server that the data has been received and saved into GridDB. The Project The goal of this article is to create a proof-of-concept for a very basic IoT message-queue system; we will have one physical sensor out “in the field” reading data from its environment, pushing the readings onto an exchange which will then push the data onto the queue and then finally into our server. Once that server acknowledges that it has received the entirety of the data, it will remove that value from the queue and move on to the next one (if it exists). To accomplish this, first let’s talk hardware. The Hardware We have set up a Raspberry Pi 4 to connect with an air quality sensor Adafruit PMSA003I Air Quality Breakout via this STEMMA Hat and a STEMMA wire; if you are interested in learning more about this particular sensor, you can read about it in the Docs page provided by adafruit. The data will be received from the queue from an Ubuntu server — the specs are not important. Next, let’s take a look at the software. The Software Of course, we are going to be utilizing RabbitMQ for the pushing and receiving of messages of relevant data. RabbitMQ provides various connectors for many programming languages, so we essentially are free to mix and match as we see fit (which is another stealth benefit of utilizing RabbitMQ for your stack). In our case, because we were already provided with a python library to which we can easily read and translate the raw sensor data, we want to push the payload data with Python. We could receive our payload data on the server with another python script with the aid of the GridDB Python Connector, but we will instead opt to receive with Java as it is GridDB’s native interface and doesn’t require any additional downloads. The Plan Overall, our plan is as follows: Install RabbitMQ onto Ubuntu server Read sensor readings and translate into readable data payloads (python) push data onto an Exchange/Queue of our creation Receive queue with Java (and RabbitMQ) Save received payloads directly GridDB How to Run The python script can be easily run: install the required libraries and then simply run the script: python3 app.py. For Java, because we have some dependencies to outside libraries, we need to reference them (they’re in the lib directory) and then run that way. For example: $ cd lib/ $ export CP=.:amqp-client-5.16.0.jar:slf4j-api-1.7.36.jar:slf4j-simple-1.7.36.jar:gridstore-5.6.0.jar:jackson-databind-2.17.2.jar:jackson-core-2.17.2.jar:jackson-annotations-2.17.2.jar $ java -cp $CP ../Recv.java The order of running these two files is not important; the receive will stay on even if the queue is empty. Prereqs & Getting Started Here are list of needs if you would like to follow this project 1:1 Raspberry Pi STEMMA Hat & Wire (or other means of connecting to board) Python, RabbitMQ, GridDB, Java, & various other libraries You can install RabbitMQ from their download page; instructions are straightforward. The only caveats are you will need to create yourself a new user and set the permissions properly: $ sudo rabbitmqctl add_user username password $ sudo rabbitmqctl set_permissions -p / username “.*” “.*” “.*” The credentials here will be the same ones used when forging the connection between the data sender and the data receiver. One note: I was unsuccessful in trying to use “special characters” in my password when making my connection, so I’d advise to keep the password simple for now (ie. just A-z and integers). Implementation: The Producer Finally, let’s get into specifics. We will first focus on our producer (the raspberry pi) and then move on to the consumer (our server). We will also be setting some configs to ensure our messages are delivered and saved into the database. Python Script for Reading Data We are using a modified version of the python script provided by adafruit to read the sensor data. Essentially, our task is very simple: we read the data, convert to JSON, and push to the Exchange/Queue. First, let’s look at the hardware part of the code; after that we will get into the code for creating and pushing onto a queue to the correct machine. import board import busio from adafruit_pm25.i2c import PM25_I2C reset_pin = None i2c = busio.I2C(board.SCL, board.SDA, frequency=100000) # Connect to a PM2.5 sensor over I2C pm25 = PM25_I2C(i2c, reset_pin) aqdata = pm25.read() This snippet of code is all you need to read/translate the sensor readings. With this, assuming everything is connected properly, we will save the current values into the variable we called aqdata. Python Code to Create and Push Data to RabbitMQ Queue Next, let’s look at the RabbitMQ code. First, we want to establish our connection to our ubuntu server. We will point the address to the IP of the machine and set the port to the default. We will also use the credentials we made earlier on our Ubuntu server import pika credentials = pika.PlainCredentials(‘israel’, ‘israel’) parameters = pika.ConnectionParameters(‘192.168.50.206’, 5672, ‘/’, credentials) connection = pika.BlockingConnection(parameters) channel = connection.channel() Next we want to create and set some parameters for our queue, including how we handle pushing data messages to it. channel.confirm_delivery() channel.queue_declare(queue=’airQuality’, durable=True) By default, RabbitMQ prioritizes throughput above all else, meaning we need to change some default configuration options to ensure our data is being sent — even in the case of a weak connection — to our server (also known as the broker). First, we want to enable confirm delivery. This will produce an exception/error if the producer receives a negative acknowledgement (also referred to as a nack) from our broker. This means if our data is falling off, we will at least have a log of it. Unfortunately for us, there isn’t a very robust handling of failed messages on the Python side; if this were for a production project, we would need to migrate from Python to some other language where you can deal with messages in a variety of ways. Namely, I think, we’d like to add batch processing of messages so that there’s less of a chance of dropped data readings, and an easier time of re-sending dropped efforts. Anyway, working with what we have, the next thing we do is turn on durable which will save the queue in the event of a broker crash/reboot. This means the aqdata won’t need to be re-created but the messages inside of the queue won’t necessarily be saved. After that, we read and send data simultaneously: while True: time.sleep(1) try: aqdata = pm25.read() current_time = datetime.datetime.utcnow().replace(microsecond=0) now = current_time.strftime(‘%Y-%m-%dT%H:%M:%S.%fZ’) aqdata[‘ts’] = now aqdata[‘pm1’] = aqdata.pop(‘pm10 standard’) aqdata[‘pm25’] = aqdata.pop(‘pm25 standard’) aqdata[‘pm10’] = aqdata.pop(‘pm100 standard’) aqdata[‘pm1e’] = aqdata.pop(‘pm10 env’) aqdata[‘pm25e’] = aqdata.pop(‘pm25 env’) aqdata[‘pm10e’] = aqdata.pop(‘pm100 env’) aqdata[‘particles03’] = aqdata.pop(‘particles 03um’) aqdata[‘particles05’] = aqdata.pop(‘particles 05um’) aqdata[‘particles10’] = aqdata.pop(‘particles 10um’) aqdata[‘particles25’] = aqdata.pop(‘particles 25um’) aqdata[‘particles50’] = aqdata.pop(‘particles 50um’) aqdata[‘particles100’] = aqdata.pop(‘particles 100um’) #print(aqdata) except RuntimeError: print(“Unable to read from sensor, retrying…”) continue payload = json.dumps(aqdata) try: channel.basic_publish(exchange=”, routing_key=’airQuality’, body=payload, properties=pika.BasicProperties(delivery_mode=pika.DeliveryMode.Persistent), mandatory=True) print(” [x] Sent payload: ” + payload) except pika.exceptions.UnroutableError: # If the message is not confirmed, it means something went wrong print(“Message could not be confirmed”) For this snippet of code, we are reading the sensor data, changing the column names into the ones we want to use on the consumer side, and then pushing the payload into the channel to our queue we made earlier. Some things to note here: we set the mandatory flag to true and set the delivery mode to persistent. These two settings will try to save our messages into disk if they don’t receive positive acknowledgement from our broker that the messages were safely delivered. The exception occurs if the broker ends back to our producer a nack (negative acknowledgement). And so now every 1 second, our script will read sensor values and push it into the queue. Once the data is confirmed by the broker, the producer no longer cares about that data message. Implementation: The Consumer Our consumer will be written in Java and its job is to read from the Queue in our broker (in our case, the same host machine as our consumer), unmarshal the data into a Java Object, and then save the results into GridDB. Consuming the Queue in Java The consumer portion of the code is rather simple: forge the connection and read from the queue. private final static String QUEUE_NAME = “airQuality”; private final static boolean AUTO_ACK = false; ConnectionFactory factory = new ConnectionFactory(); factory.setHost(“localhost”); Connection connection = factory.newConnection(); Channel channel = connection.createChannel(); channel.queueDeclare(QUEUE_NAME, true, false, false, null); System.out.println(” [*] Waiting for messages. To exit press CTRL+C”); Here we are making our connection to our broker (hosted on the same machine as the consumer, hence localhost). We declare the queue we want to read from and set some options; we are using the default values for everything except for the first true which corresponds to durable mode, which we are setting to true, as explained above in the python section, it means that our queue will persist even if the broker goes down. Next, let’s run the actual consume: channel.basicConsume(QUEUE_NAME, AUTO_ACK, deliverCallback, consumerTag -> { }); The only thing I’d like to point out here is that we’ve turned off the AUTO_ACK option (it’s set to FALSE). This means we will need to manually acknowledge either if the message being read from the queue was successful or not. Next, here’s the callback function that is run every-time it reads a new message off of the queue: DeliverCallback deliverCallback = (consumerTag, delivery) -> { byte[] data = delivery.getBody(); try { AirData ad = mapper.readValue(data, AirData.class); String jsonString = mapper.writerWithDefaultPrettyPrinter().writeValueAsString(ad); System.out.println(jsonString); container.put(ad); channel.basicAck(delivery.getEnvelope().getDeliveryTag(), false); } catch (Exception e) { channel.basicNack(delivery.getEnvelope().getDeliveryTag(), false, true); System.out.println(“Setting nack”); } }; Here is what’s going on: we read the message (type of array of bytes), we use the jackson json library to unmarshal the value from raw bytes into a class we declare called AirData: static public class AirData { @JsonProperty(“ts”) @RowKey Date ts; @JsonProperty(“pm1”) double pm1; @JsonProperty(“pm25”) double pm25; @JsonProperty(“pm10”) double pm10; @JsonProperty(“pm1e”) double pm1e; @JsonProperty(“pm25e”) double pm25e; @JsonProperty(“pm10e”) double pm10e; @JsonProperty(“particles03”) double particles03; @JsonProperty(“particles05”) double particles05; @JsonProperty(“particles10”) double particles10; @JsonProperty(“particles25”) double particles25; @JsonProperty(“particles50”) double particles50; @JsonProperty(“particles100”) double particles100; } Next we save that newly made Java object into GridDB and then finally acknowledge to our broker that we received the message. If something goes wrong, we will send a nack and the message will remain in the queue until it gets an ack. GridDB Lastly, let’s go over how GridDB fits into this. We will do our standard connecting to GridDB and then get our timeseries container. In this case, I created the table/container in the shell as it’s easier than writing a one-time use java code. $ sudo su gsadm $ gs_sh gs> createtimeseries aqdata NO ts timestamp pm1 double pm25 double pm10 double pm1e double pm25e double pm10e double particles03 double particles05 double particles10 double particles25 double particles50 double particles100 double And now we make our connection in our Java code: public static GridStore GridDBNoSQL() throws GSException { GridStore store = null; try { Properties props = new Properties(); props.setProperty(“notificationMember”, “127.0.0.1:10001”); props.setProperty(“clusterName”, “myCluster”); props.setProperty(“user”, “admin”); props.setProperty(“password”, “admin”); store = GridStoreFactory.getInstance().getGridStore(props); } catch (Exception e) { e.printStackTrace(); } return store; } Using our AirData class from earlier we grab our newly made container: TimeSeries<airdata> container = store.getTimeSeries(“aqdata”, AirData.class); System.out.println(“Connected to GridDB!”);</airdata> And then we’ve already seen this above, but as we receive new payloads, we immediately save to GridDB and then send the positive acknowledgement: container.put(ad); channel.basicAck(delivery.getEnvelope().getDeliveryTag(), false); Conclusion In this article, we set up a robust system in which our IoT data will safely transferred from our python producer to an exchange with no name (”), transferred to our broker which houses our queue called airQuality, and then finally will be read by our java
With the rise of AI tools, we can automate many manual workloads, including creating presentation slides. Developers can generate slide content programmatically by leveraging OpenAI’s language models and Node.js. This automation surely will save time. By using OpenAI for content generation and Node.js for orchestration, you can effortlessly streamline the process of creating compelling and informative presentations. In this post, we will use the Assistant API model from OpenAI to automate slide content creation, Node.js to create the slide document, and GridDB to save the slide information. Running the Project Clone the source code from this GitHub repository. git clone https://github.com/griddbnet/Blogs.git –branch slides You also need to install Node.js and GridDB for this project to run. If the software requirements are installed, change the directory to the apps project directory and then install all the dependencies: cd apps npm install Create a .env file and copy all environment variables from the .env.example file. You need an OpenAI key for this project, please look in this section on how to get the key. OPENAI_API_KEY=sk-proj-secret VITE_APP_URL=http://localhost:3000 You can change the VITE_APP_URL to your needs and then run the project by running this command: npm run start:build Then open the browser and go to the app URL. Select the data sample and then click the Create Slide button. If the slide presentation is created successfully, a download link will be provided. Getting Started 1. Installing Node.js This project will run on the Node.js platform. You need to install it from here. For this project, we will use the nvm package manager and Node.js v16.20.2 LTS version. # installs nvm (Node Version Manager) curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.7/install.sh | bash # download and install Node.js nvm install 16 # verifies the right Node.js version is in the environment node -v # should print `v16.20.2` # verifies the right NPM version is in the environment npm -v # should print `8.19.4“ To connect Node.js and GridDB database, you need the gridb-node-api npm package which is a Node.js binding developed using GridDB C Client and Node addon API. 2. Setting Up GridDB We will use the GridDB database to save recipes and it’s nutrition analysis. Please look at the guide for detailed installation. We will use Ubuntu 20.04 LTS here. Run GridDB and check if the service is running. Use this command: sudo systemctl status gridstore If not running try to run the database with this command: sudo systemctl start gridstore 3. Setup OpenAI Keys The OpenAI key is on a project basis, so we need to create a project first in the OpenAI platform. To access any OpenAI services, you need a valid key. Go to this link and create a new OpenAI key, make sure to select the right project. You need also to enable any models that you use on a project. For this project, we will need the gpt-4o model. Go to the project settings and then select which models to be enabled. You should save the OpenAI key on the .env file and make sure not to include it in version control by adding it to the .gitignore. 4. Setup AI Assistant This project needs an AI assistant. You need to set it first, go to the project dashboard, and create a new assistant. You need to pay attention to the Instruction field because it will dictate the behavior of the AI assistant. This is the instruction used for this assistant: You are a data scientist assistant. When given data and a query, write the proper code and create the proper visualization. Another setup is you need to enable Code Interpreter, which means the assistant will be able to execute code in a sandbox environment, enabling your prompt to execute code. For more information on this feature, please click here. After the AI assistant creation, you need to copy the assistant ID. This ID will be used as a reference in the code where you can send messages to the assistant. const dataScienceAssistantId = “asst_FOqRnMVXw0WShTSGw70NZJAX” Data Examples This project will use JSON data samples from car spare parts sales. The data reside in the data directory. This is the spare part sales data for the year 2020 to the year 2024: [ { “Year”: 2020, “Quarter”: “Q1”, “Distribution channel”: “Online Sales”, “Revenue ($M)”: 2.10, “Costs ($M)”: 1.905643, “Customer count”: 190, “Time”: “2020 Q1”, “Product Category”: “Engine Parts”, “Region”: “North America”, “Units Sold”: 900, “Average Sale Price ($)”: 2333, “Discounts Given ($)”: 14000, “Returns ($)”: 4500, “Customer Satisfaction Rating”: 8.2, “Salesperson”: “SP120”, “Marketing Spend ($)”: 18000 }, { “Year”: 2020, “Quarter”: “Q1”, “Distribution channel”: “Direct Sales”, “Revenue ($M)”: 2.15, “Costs ($M)”: 2.004112, “Customer count”: 200, “Time”: “2020 Q1”, “Product Category”: “Brakes”, “Region”: “Europe”, “Units Sold”: 1000, “Average Sale Price ($)”: 2150, “Discounts Given ($)”: 12000, “Returns ($)”: 5000, “Customer Satisfaction Rating”: 8.0, “Salesperson”: “SP121”, “Marketing Spend ($)”: 19000 }, … { “Year”: 2024, “Quarter”: “Q2”, “Distribution channel”: “Direct Sales”, “Revenue ($M)”: 3.15, “Costs ($M)”: 2.525112, “Customer count”: 390, “Time”: “2024 Q2”, “Product Category”: “Brakes”, “Region”: “Europe”, “Units Sold”: 1500, “Average Sale Price ($)”: 2095, “Discounts Given ($)”: 22000, “Returns ($)”: 17000, “Customer Satisfaction Rating”: 9.1, “Salesperson”: “SP144”, “Marketing Spend ($)”: 38000 } ] Ideally, the data should be uploaded via the user interface. However, for simplicity in this project, the data will be directly processed when you choose the data samples from the data samples dropdown. How can OpenAI process the file directly? The answer is, that you need to upload manually the data sample files first. Go to the project dashboard and upload the files. You need to pay attention to the purpose of the uploaded files. In this project, the data sample files are used as assistants files. Later these file IDs will be used to identify which file is used when the user selects the data sample from the dropdown. Generating Content When the user selects the data sample and clicks the Create Slide button. The Assistant API will generate the image and text for the content slide. These are a few important steps in the code to generate the slide content: 1. Analyze Data Samples OpenAI will analyze the selected data sample then it will calculate the profit by quarter and year then visualize the plot. The prompt for this process is: const analyzeDataPrompt = “Calculate the profit (revenue minus cost) by quarter and year, and visualize as a line plot across the distribution channels, where the colors of the lines are green, light red, and light blue” And this code will process the prompt and the selected file (see fileId) const thread = await openai.beta.threads.create({ messages: [ { “role”: “user”, “content”: analyzeDataPrompt, “attachments”: [ { file_id: fileId, tools: [{ type: “code_interpreter” }] } ] } ] }); From this code, you can get the plot image. It will be saved in the public directory and will be used in the slide content. 2. Generate Bullet Points The AI Assistant will give an insight into the data and will generate bullet points. This is the prompt to instruct AI to give two insights about the data: const insightPrompt = `Give me two medium-length sentences (~20-30 words per sentence) of the most important insights from the plot you just created, and save each sentence as an item in one array. Give me a raw array, no formatting, no commentary. These will be used for a slide deck, and they should be about the ‘so what’ behind the data.` 3. Generate Insight Title The last step is generating a title for the insight. This is the prompt that is responsible for that: const titlePrompt = “Given the plot and bullet point you created, come up with a very brief title only for a slide. It should reflect just the main insights you came up with.” The full code for generating slide content is in the libs/ai.js file. Generate Slides This project uses the PptxGenJS package to generate the slides. You can look at the full code in the libs/pptx.js file. This is the code that calls the createPresentation() function when all the AI-generated slide information is ready. //… if (bulletPointsSummary.status === “completed”) { const message = await openai.beta.threads.messages.list(thread.id) const dataVisTitle = message.data[0].content[0].text.value presentationOptions = { title: slideTitle, subtitle: slideSubtitle, dataVisTitle: dataVisTitle, chartImagePath: path.join(__dirname, “public”, `${filename}`), keyInsights: “Key Insights:”, bulletPoints: bulletPoints, outputFilename: path.join(__dirname, ‘public’, pptxFilename) }; try { createPresentation(presentationOptions) } catch (error) { console.log(error) } } //… Just note that the generated presentation file will be saved in the public directory with each a unique name. Slides Information To save the slide information, we will use the GridDB database. These are the database field’s documentation: Field Name Type Description id INTEGER A unique identifier for each record. This is the primary key for the container and must be unique for each entry. title STRING The main title of the slide. It is a short descriptive title summarizing the content of the slide. subtitle STRING A secondary title or subheading providing additional context or a brief description related to the main title. chartImage STRING The URL or path to an image of a chart associated with the slide, used to link visual data representations. bulletPoints STRING A string containing bullet points that summarize key information or highlights of the slide. Each bullet point is typically separated by a special character or newline. pptx STRING The URL or path to the PowerPoint file (.pptx) that contains the slide, used to link the presentation file including the slide. The griddbservices.js and libs/griddb.js files are responsible for saving all the slide information to the database. Server Routes The Node.js server provides a few routes for the client. This is the full documentation for the routes: Method Route Description GET / Serves the index.html file from the ‘dist’ folder GET /create/:fileId Triggers the AI assistant to process a file and create a presentation. Returns the save status and PPTX file name GET /metadata Serves the metadata.json file from the data directory GET /data/files Lists all JSON filenames in the data directory GET /data/files/:filename Serves a specific JSON file from the data directory GET /slides Retrieves all slides data from the database The most important route is /create/:fileId which triggers the AI assistant to analyze data samples, create a presentation, and then save all slide information to the database. app.get(‘/create/:fileId’, async (req, res) => { const fileId = req.params.fileId try { const result = await aiAssistant(fileId) if (result.status === “completed”) { const { title: titlePptx, subtitle: subtitlePptx, dataVisTitle: dataVisTitlePptx, chartImage: chartImagePptx, bulletPoints: bulletPointsPptx, outputFilename: pptxFile } = result.data const saveDataStatus = await saveData({ titlePptx, subtitlePptx, dataVisTitlePptx, chartImagePptx, bulletPointsPptx, pptxFile }) res.json({ save: saveDataStatus, data: result.data, pptx: result.pptx }) } else { res.status(500).json({ error: ‘Task not completed’, status: result.status }) } } catch (error) { console.error(‘Error in AI Assistant:’, error) res.status(500).json({ error: ‘Error in AI Assistant’, details: error.message }) } }) The aiAssistant() function will analyze the data sample, create a presentation return all information about the slide, and then save those slide information to the GridDB database using the saveData() function. To get all the slide data just go to the /slides route and it will respond with all slide data saved in the database. User Interface The main user interface consists of two components: Data Dropdown: To select a data sample. Create Slide Button: To trigge presentation creation. Download Generated Presentation Link: The download link for the presentation .pptx file. Further Enhancements This is a prototype project with static data samples. Ideally, in production, you need to provide a better user interface to upload the data and customize the
With the release of GridDB v5.6, we are taking a look at the new features that come bundled with this new update. To read the entirety of the notes, you can read them directly from GitHub: GridDB CE v5.6 Release Notes. You can also read the detailed GridDB documentation, including the new v5.6 updates here: https://www.toshiba-sol.co.jp/en/pro/griddb/docs-en/v5_6/GridDB_FeaturesReference.html Of the new features, today we are focusing on the new data compression algorithm that is now select-able in the gs_node.json config file and automatic time aggregation from the GridDB CLI tool. Prior to v5.6, there were only two methods of compression that were select-able: NO_COMPRESSION and COMPRESSION_ZLIB. Though the default setting is still no compression for all versions, version 5.6 offers a new compression method called COMPRESSION_ZSTD. This compression method promises to be more efficient at compressing your data regularly, and also at compressing the data itself, meaning we can expect a smaller footprint. So, in this article, we will inserting a consistent amount of data into GridDB, comparing the resulting storage space taken up, and then finally comparing between all three compression methods. As for automatic aggregation, we will show a brief demonstration of how it looks at the end of this article. But first, compression. Methodology As explained above, we will need to easily compare between three instances of GridDB with the same dataset. To accomplish this, it seems docker would be the easiest method because we can easily spin up or down new instances and change the compression method for each instance. If we do this, then we simply use the same dataset or the same data generation script for each of the instances. To get a robust enough dataset to really test the compression algorithm differences, we decided on 100 million rows of data. Specifically, we wanted the dataset to be similar enough in some respects that the compression can do its job so that we in turn can effectively measure its effectiveness. The three docker containers will be griddb-server1, griddb-server2, and griddb-server3. The compression levels are set in the docker-compose file, but we will do it the way that makes the most sense to me: server1 is NO_COMPRESSION, server2 is the old compression system (COMPRESSION_ZLIB), and server3 is the new compression system (COMPRESSION_ZSTD). So when we run our gen-script, we can use command line arguments to specify which container we want to target. More on that in the next section. How to Follow Along If you plan to build and test out these methods yourself while you read along, you can grab the source code from our GitHub page: . Once you have the repo, you can start with spinning up your GridDB servers. We will get into how to run the generation data script to push 100m rows of data into your servers in the next section. To get the three servers running, the instructions are laid out in the docker compose file located in the root of the project repository; you can simply run: $ docker compose build $ docker compose up -d If all goes well, you should have three GridDB containers running: griddb-server1, griddb-server2, & griddb-server3. Implementation To implement, we used a node.js script which generated 100m rows of random data. Because our GridDB containers are spun up using Docker, we made all three docker containers for GridDB separate services inside of a docker compose file. We then grabbed that docker network name and used it when running our nodejs script. This means that our nodejs script was also built into a docker container and then we used that to push data into the GridDB containers with the following commands: $ docker build -t gen-data . $ docker run –network docker-griddb_default gen griddb-server1:10001 $ docker run –network docker-griddb_default gen griddb-server2:10001 $ docker run –network docker-griddb_default gen griddb-server3:10001 Here is the nodejs script in its entirety: const griddb = require(‘griddb-node-api’); const process = require(‘process’); var fs = require(‘fs’); var factory = griddb.StoreFactory.getInstance(); var store = factory.getStore({ “notificationMember”: process.argv[2], “clusterName”: “myCluster”, “username”: “admin”, “password”: “admin” }); const conInfo = new griddb.ContainerInfo({ ‘name’: “compressionBlog”, ‘columnInfoList’: [ [“timestamp”, griddb.Type.TIMESTAMP], [“location”, griddb.Type.STRING], [“data”, griddb.Type.FLOAT], [“temperature”, griddb.Type.FLOAT], ], ‘type’: griddb.ContainerType.COLLECTION, ‘rowKey’: false }); function getRandomFloat(min, max) { return Math.random() * (max – min) + min; } const putCont = async (sensorCount, data, temperature) => { const rows = generateSensors(sensorCount, data, temperature); try { const cont = await store.putContainer(conInfo) await cont.multiPut(rows); } catch (error) { console.log(“error: “, error) } } const generateSensors = (sensorCount, data, temperature) => { const arr = [] let now = new Date(); for (let i = 1; i <= sensorCount; i++) { let tmp = []; let newTime = now.setMilliseconds(now.getMinutes() + i) tmp.push(newTime) tmp.push(“A1”) tmp.push(data) tmp.push(temperature) arr.push(tmp) } return arr; } const AMTROWS = 10000; const AMTPASSES = 10000; (async () => { try { console.log(“attempting to gen data and push to GridDB”) for (let i = 0; i < AMTPASSES; i++) { const data = parseFloat(getRandomFloat(1, 10).toFixed(2)) const temperature = parseFloat(getRandomFloat(60, 130).toFixed(2)) await putCont(AMTROWS, data, temperature); } console.log(“Finished pushing data!”) } catch (error) { console.log(“Error putting to container”, error); } })(); The code itself is simple and self explanatory but please note that if you plan to follow along, inserting this volume of rows into GridDB takes a long time and you should be prepared to let the script work for ~10-20 minutes, depending on your server’s hardware. Compression Method Results Now that we have our rows of data inside of our three GridDB containers, we can let GridDB handle the actual compressing of the data. This process happens automatically and in the background; you can read more about that here: https://www.toshiba-sol.co.jp/en/pro/griddb/docs-en/v5_6/GridDB_FeaturesReference.html#database-compressionrelease-function. To check how much space your 100 million rows of data are taking up, you can run the following command against each Docker container of GridDB: $ docker exec griddb-server1 du -sh /var/lib/gridstore 16G /var/lib/gridstore/ Which checks the storage space used up by GridDB in total, including any swap files and logs. If you just want the data: $ docker exec griddb-server1 du -sh /var/lib/gridstore/data 12G /var/lib/gridstore/data/ This, of course, must be repeated for all three containers. You can also verify the compression method in your GridDB container like so: $ docker exec griddb-server3 cat /var/lib/gridstore/conf/gs_node.json | grep “storeCompressionMode” “storeCompressionMode”: “COMPRESSION_ZSTD”, Beyond testing the storage space used, we tested how long it took to load the data and how long a query takes. You can see the results here in the following table. For every row/cell, a lower value is better and idincated superior user experience and usability. NO_COMPRESSION COMPRESSION_ZLIB COMPRESSION_ZSTD (added v5.6) Search (ms) 32,644 20,666 11,475 Agreggation (ms) 30,261 13,302 8,402 Storage (gridstore) 11,968,312 (17GB) 7,162,824 (6.9GB) 6,519,520 (6.3GB) Storage (/data) 17,568,708 (12GB) 1,141,152 (1.1GB) 1,140,384 (1.1GB) Insert (m:ss.mmm) 14:42.452 15:02.748 15:05.404 To test the query speed, we did both select * and aggregation queries like: select AVG(data) from and then took the average of 3 results and placed them into the table. The results are clear: compression helps a lot more than it hurts. It helps save on storage space but also helps query speeds. Version 5.6’s compression method seems to both save storage space and also help query speed by a meaningful amount. All of this is done of course on consumer level hardware. Automatic Aggregation with CLI This functionality utilizes cron on your linux machine to regularly run the script you create. But essentially, what this addition allows is for you to run an aggregation on one of your containers, and then push all of those values onto another table, allowing for you to periodically run new queries, perhaps in the background when your resources aren’t in use. This way you can have updated/fresh values on hand without needing to conduct your aggregations and wait for possibly long calculation times. The way it works is you can now Insert values from one table into another like so: gs[public]> INSERT OR REPLACE INTO device_output (ts, co) SELECT ts,avg(co) FROM device WHERE ts BETWEEN TIMESTAMP(‘2020-07-12T00:29:38.905Z’) AND TIMESTAMP(‘2020-07-19T23:58:25.634Z’) GROUP BY RANGE(ts) EVERY (20,SECOND); The 34,468 records had been inserted. And so, knowing this, we can do some clever things, like writing a GridDB CLI script file (.gsh), and allowing for that script to get the latest values from a table, run aggregation, and then push them out into your etl_output file. Once you write that script file, you can set up a cron job to regularly schedule the script to run in the background. This process will allow your agg output file to be regularly updated with new, up-to-date values completely automatically! Here is an example script file directly from the docs page: # gs_sh script file (sample.gsh) # If no table exists, create a partitioning table with intervals of 30 days to output data. CREATE TABLE IF NOT EXISTS etl_output (ts TIMESTAMP PRIMARY KEY, value DOUBLE) PARTITION BY RANGE (ts) EVERY (30, DAY); # Retrieve the last run time registered. If it does not exist, retrieve the time one hour before the present. SELECT case when MAX(ts) ISNULL THEN TIMESTAMP_ADD(HOUR,NOW(),-1) else MAX(ts) end AS lasttime FROM etl_output; # Store the retrieved time in a variable. getval LastTime # Set the aggregation range between the time retrieved and the present time and obtain the average value for every 20 seconds. Register or update the results into the output container. INSERT OR REPLACE INTO etl_output (ts, value) SELECT ts,avg(value) FROM etl_input WHERE ts BETWEEN TIMESTAMP(‘$LastTime’) AND NOW() GROUP BY RANGE(ts) EVERY (20, SECOND); In this example, we’re placing aggregated results from etl_input into etl_output. Pretty
GridDB running via Docker containers isn’t a new topic. We have covered it before: https://griddb.net/en/blog/run-a-griddb-server-in-docker-desktop/ & https://griddb.net/en/blog/improve-your-devops-with-griddb-server-and-client-docker-containers/. In this blog, we want to again touch on using GridDB on Docker, but will focus instead on using GridDB on ARM architecture, namely a Mac with Apple silicon (M1, M2, etc). So, in this blog, we will provide a docker image which works with ARM devices, as well as walk through how to spin up application containers to work in conjunction with your docker container service. Running GridDB & GridDB Applications with Docker First, you can read the source code that accompanies this article here: https://github.com/griddbnet/griddb-docker-arm. It contains the docker image itself which you can build to run on your ARM machine. The image itself is also available for pulling from the GridDB.net Dockerhub page. The full image/tag name is: griddbnet/griddb:arm-5.5.0. The nodejs application repo is also available: griddbnet/nodejs-arm:latest Running GridDB Server To pull and run this image: $ docker network create griddb-net $ docker pull griddbnet/griddb:arm-5.5.0 $ docker run –name griddb-server \ –network griddb-net \ -e GRIDDB_CLUSTER_NAME=myCluster \ -e GRIDDB_PASSWORD=admin \ -e NOTIFICATION_MEMBER=1 \ -d -t griddbnet/griddb:arm-5.5.0 These commands will create a network for your GridDB server and any containers you intend to run with it. It will also download the built image and then run the image on your machine. Once you confirm it’s running, you can try running application code, using your GridDB container as the data store. Running Application Containers First, let’s grab the source code and build our nodejs container to run some arbitrary code using GridDB as our connection. $ git clone https://github.com/griddbnet/Blogs.git –branch docker-arm Next, here are the commands to run some node.js GridDB code against your containerized server. First, let’s run the sample code that accompanies the official node.js GridDB repo $ cd Blogs/nodejs/node-api $ docker build -t griddb_node_app . $ docker run –name griddb-node –network griddb-net -e GRIDDB_CLUSTER_NAME=myCluster -e GRIDDB_USERNAME=admin -e GRIDDB_PASSWORD=admin -e IP_NOTIFICATION_MEMBER=griddb-server griddb_node_app First, we need to grab the source code which contains some modified files when compared to the official source code (changes to allow the C_Client to run on macos/ARM, which is required for non java programming language connectors). Then we build the image and run it, setting some options such as cluster name, user/pass combo, and finally the IP_NOTIFICATION_MEMBER which explictly tells the container the ip address of the GridDB server container. Of course here, when running this, you are simply running the sample code provided, not your own. But it also lays out the framework for running your own GridDB nodejs code. The flow is as follows: you write your code, build the docker image, and then run it with explict case of choosing the docker network and pointing to the correct hostname/ip address. To go along with the nodejs application interface, JDBC and Java have also been tested and confirmed to work with an ARM based Mac using an M-series chip. Examples of Creating Application Container To build and run your own application in docker, the process is simple: you write the application in your language of choice, write the Dockerfile for that application, and then finally build & run the container, ensuring the use the same network as used when running the GridDB container. Node.js For example, let’s say you wrote a quick node.js script to generate some ‘fake’ data. To keep the application connection agnostic, you can keep the connection details as command line arguments, meaning when you run your docker container, you can simply enter in the docker container you wish to connect to similar to how it was done above. If you enter in the environment details when running the docker container. These details will then be picked up by our entry point script. Here is the Dockerfile for installing the GridDB Node.js connector, along with the c_client connector on an ARM machine. Most of the file is installing everything necessary, including installing the included c_client rpm file. In this instance, we are simply copying over the one file we want to run (gen-data.js) along with the entrypoint script. FROM rockylinux:9.3 ENV GRIDDB_NODE_API_VERSION=0.8.5 ENV NODE_PATH=/root/node-api-${GRIDDB_NODE_API_VERSION} # Install griddb server RUN set -eux \ && dnf update -y \ # Install nodejs version 16.x and c client for griddb nodejs_client && dnf install -y curl make python3 tar –allowerasing \ && dnf groupinstall -y ‘Development Tools’ RUN curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash RUN source ~/.nvm/nvm.sh && nvm install 20 && nvm use 20 COPY ./lib/griddb-c-client-5.5.0-linux.aarch64.rpm / RUN rpm -Uvh /griddb-c-client-5.5.0-linux.aarch64.rpm SHELL [“/bin/bash”, “–login”, “-c”] # Copy entrypoint script and sample for fixlist RUN mkdir /app COPY run-griddb.sh gen-data.js /app/ WORKDIR /root # Install nodejs client RUN curl -L https://github.com/griddb/node-api/archive/refs/tags/${GRIDDB_NODE_API_VERSION}.tar.gz -o ${GRIDDB_NODE_API_VERSION}.tar.gz -sS \ && tar -xzvf ${GRIDDB_NODE_API_VERSION}.tar.gz \ && cd node-api-${GRIDDB_NODE_API_VERSION} WORKDIR /root/node-api-${GRIDDB_NODE_API_VERSION} RUN npm install RUN rm ../${GRIDDB_NODE_API_VERSION}.tar.gz WORKDIR /app # Set permission executable for script RUN chmod a+x run-griddb.sh # Run sample CMD [“/bin/bash”, “run-griddb.sh”] And here is the simple run-griddb.sh script. All it does is basically run the node command with the proper arg details to connect to our GridDB docker container. #!/bin/bash if [ -z “$GRIDDB_CLUSTER_NAME” ]; then GRIDDB_CLUSTER_NAME=’dockerGridDB’ fi if [ -z “$NOTIFICATION_ADDRESS” ]; then NOTIFICATION_ADDRESS=239.0.0.1 fi if [ -z “$NOTIFICATION_PORT” ]; then NOTIFICATION_PORT=31999 fi if [ -z “$GRIDDB_USERNAME” ]; then GRIDDB_USERNAME=’admin’ fi if [ -z “$GRIDDB_PASSWORD” ]; then GRIDDB_PASSWORD=’admin’ fi if [ -z “$IP_NOTIFICATION_MEMBER” ]; then echo “Run GridDB node_api client with GridDB server mode MULTICAST : $NOTIFICATION_ADDRESS $NOTIFICATION_PORT $GRIDDB_CLUSTER_NAME $GRIDDB_USERNAME $GRIDDB_PASSWORD” source ~/.nvm/nvm.sh && nvm use 20 node sample1.js $NOTIFICATION_ADDRESS $NOTIFICATION_PORT $GRIDDB_CLUSTER_NAME $GRIDDB_USERNAME $GRIDDB_PASSWORD else echo “Run GridDB node_api client with GridDB server mode FixedList : $IP_NOTIFICATION_MEMBER:10001 $GRIDDB_CLUSTER_NAME $GRIDDB_USERNAME $GRIDDB_PASSWORD” source ~/.nvm/nvm.sh && nvm use 20. node gen-data.js $IP_NOTIFICATION_MEMBER:10001 $GRIDDB_CLUSTER_NAME $GRIDDB_USERNAME $GRIDDB_PASSWORD fi $ docker build -t nodejs-gen-griddb . We are building our current Dockerfile with the tag of nodejs-gen-griddb. Then we run it, specifying the connection details: $ docker run –network griddb-net -e GRIDDB_CLUSTER_NAME=myCluster -e GRIDDB_USERNAME=admin -e GRIDDB_PASSWORD=admin -e IP_NOTIFICATION_MEMBER=griddb-server nodejs-gen-griddb JDBC Here is another example, connecting to our GridDB server using Java and JDBC so that we can run SQL commands. First, we create our java program. In this case, we simply want to make a connection and then create a new table. String notificationMember = args[0]; String clusterName = args[1]; String databaseName = args[2]; // String notificationMember = “griddb-server:20001”; // String clusterName = “myCluster”; // String databaseName = “public”; String username = “admin”; String password = “admin”; String encodeClusterName = URLEncoder.encode(clusterName, “UTF-8”); String encodeDatabaseName = URLEncoder.encode(databaseName, “UTF-8”); String jdbcUrl = “jdbc:gs://” + notificationMember + “/” + encodeClusterName + “/” + encodeDatabaseName; System.out.println(jdbcUrl); Properties prop = new Properties(); prop.setProperty(“user”, username); prop.setProperty(“password”, password); con = DriverManager.getConnection(jdbcUrl, prop); System.out.println(“Connected to cluster via SQL Interface”); String SQL = “CREATE TABLE IF NOT EXISTS devices (ts TIMESTAMP PRIMARY KEY, co DOUBLE, humidity DOUBLE,light BOOL,lpg DOUBLE,motion BOOL,smoke DOUBLE,temp DOUBLE) USING TIMESERIES WITH (expiration_type=’PARTITION’,expiration_time=90,expiration_time_unit=’DAY’) PARTITION BY RANGE (ts) EVERY (60, DAY)SUBPARTITION BY HASH (ts) SUBPARTITIONS 64;”; Statement stmt = con.createStatement(); stmt.executeUpdate(SQL); System.out.println(“Successfully created container called: devices”); And now we create the dockerfile to build this java program to be run against the GridDB server. FROM alpine:3.14 WORKDIR /app RUN apk add –no-cache wget RUN apk add openjdk11 RUN wget https://repo1.maven.org/maven2/com/github/griddb/gridstore-jdbc/5.6.0/gridstore-jdbc-5.6.0.jar ENV CLASSPATH /app/gridstore-jdbc-5.6.0.jar COPY ./src ./src WORKDIR /app/src/main/java/ RUN javac net/griddb/jdbc/Jdbc.java CMD [“java”, “net/griddb/jdbc/Jdbc.java”, “griddb-server:20001”, “myCluster”, “public”] For this build process, we install java and wget, download the latest griddb jdbc driver, add it to our class path environment, and then simply compile and run our java code. If all goes well, you should be able to run the docker image and set the network to be equal to where your GridDB server is connected and have it work that way. In this case, we left the command line arguments within the Dockerfile itself, meaning you can simply change how the code is executed to keep it flexible. Conclusion And now you should be able to run both nodejs and JDBC containers on your ARM devices. If you get other programming languages running ony our machines, please let us know in the GridDB forum: