Blog

GridDB Cloud on Azure Marketplace: How To Migrate (3 Ways!)

If you are thinking about switching to the GridDB Cloud Azure Marketplace instance, first, you can read about how to do that here: GridDB Cloud on Microsoft Azure Marketplace. Second, you may be worried about how you may transfer your existing data from your GridDB Free Plan, from your local GridDB CE instance, or even from Postgresql. Here are the distinct sections: Migrating from GridDB Free Plan Migrating From PostgreSQL Migrating From GridDB CE In this blog, we will walkthrough the migration process of moving your data from a GridDB Free Plan, a third party database (postgresql in this case), and a local GridDB CE instance. The process is different for each one, so let’s go through them 1-by-1. Migrating from GridDB Free Plan First of all, if you are unsure what the GridDB Free Plan is, you can look here: GridDB Cloud Quick Start Guide This is by far the easiest method of conducting a full-scale migration. The high level overview is that the TDSL (Toshiba Digital Solution) support team will handle everything for you. TDSL Support When you sign up for the GridDB Pay As You Go plan, as part of the onboarding process, you will receive an email with the template you will need to use when contacting support for various functions, including data migration! So, grab your pertinent information (contract ID, GridDB ID, etc) and the template and let’s send an email. Compose an email to tdsl-ms-support AT toshiba-sol.co.jp with the following template Contract ID: [your id] GridDB ID: [your id] Name: Israel Imru E-mail: imru@fixstars.com Inquiry Details: I would like to migrate from my GridDB Free Plan Instance to my GridDB pay as you go plan Occurrence Date: — Collected Information: — The team will usually respond within one business day to confirm your operation and with further instructions. For me, they sent me the following: Please perform the following operations in the management GUI of the source system. After completing the operations, inform us of the date and time when the operations were performed. 1. Log in to the management GUI. 2. From the menu on the left side of the screen, click [Query]. 3. Enter the following query in the [QUERY EDITOR]: SELECT 2012 4. Click the [Execute] button Best regards, Toshiba Managed Services Support Desk Once I ran the query they asked me, I clicked on query history, and copied the timestamp and sent that over to them. That was all they needed — armed with this information, they told me to wait 1-2 business days and they would seamlessly migrate my instance along with an estimated time slot when the migration would be completed. Once it was done, all of my data, including the IP Whitelist and my Portal Users were all copied over to my pay as you go plan. Cool! Migrating from Postgresql There is no official way of doing conducting this sort of migration, so for now, we can try simply exporting our tables into CSV files and then importing those files individually into our Cloud instance. Luckily with the GridDB Cloud CLI tool this process is much easier than ever before. So let’s first export our data and go from there. Exporting Postgresql Data First, the dataset I’m working with here is simply dummy data I ingested using a python script. Here’s the script: import psycopg2 import psycopg2.extras from faker import Faker import random import time # — YOUR DATABASE CONNECTION DETAILS — # Replace with your actual database credentials DB_NAME = “template1” DB_USER = “postgres” DB_PASSWORD = “yourpassword” DB_HOST = “localhost” # Or your DB host DB_PORT = “5432” # Default PostgreSQL port # — DATA GENERATION SETTINGS — NUM_RECORDS = 50000 # Initialize Faker fake = Faker() # Generate a list of fake records print(f”Generating {NUM_RECORDS} fake records…”) records_to_insert = [] for _ in range(NUM_RECORDS): name = fake.catch_phrase() # Using a more specific Faker provider quantity = random.randint(1, 1000) price = round(random.uniform(0.50, 500.00), 2) records_to_insert.append((name, quantity, price)) print(“Finished generating records.”) # SQL statements create_table_query = “”” CREATE TABLE IF NOT EXISTS sample_data ( id SERIAL PRIMARY KEY, name VARCHAR(255) NOT NULL, quantity INTEGER, price REAL ); “”” # Using execute_batch is much more efficient for large inserts insert_query = “INSERT INTO sample_data (name, quantity, price) VALUES %s;” conn = None try: # Establish a connection to the database conn = psycopg2.connect( dbname=DB_NAME, user=DB_USER, password=DB_PASSWORD, host=DB_HOST, port=DB_PORT ) # Create a cursor cur = conn.cursor() # Create the table if it doesn’t exist print(“Ensuring ‘sample_data’ table exists…”) cur.execute(create_table_query) # Optional: Clean the table before inserting new data print(“Clearing existing data from the table…”) cur.execute(“TRUNCATE TABLE sample_data RESTART IDENTITY;”) # Start the timer start_time = time.time() print(f”Executing bulk insert of {len(records_to_insert)} records…”) psycopg2.extras.execute_values( cur, insert_query, records_to_insert, template=None, page_size=1000 # The number of rows to send in each batch ) print(“Bulk insert complete.”) # Commit the changes to the database conn.commit() # Stop the timer end_time = time.time() duration = end_time – start_time print(f”Successfully inserted {cur.rowcount} rows in {duration:.2f} seconds.”) # Close the cursor cur.close() except (Exception, psycopg2.DatabaseError) as error: print(f”Error while connecting to or working with PostgreSQL: {error}”) if conn: conn.rollback() # Roll back the transaction on error finally: # Close the connection if it was established if conn is not None: conn.close() print(“Database connection closed.”) Once you run this script, you will have 50k rows in your PSQL instance. Now let’s export this to CSV: $ psql –host 127.0.0.1 –username postgres –password –dbname template1 psql (14.18 (Ubuntu 14.18-0ubuntu0.22.04.1)) SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off) Type “help” for help. template1=# select COUNT(*) from sample_data; count ——- 50000 (1 row) template1=# COPY sample_data TO ‘/tmp/sample.csv’ WITH (FORMAT CSV, HEADER); COPY 50000 template1=# \q And now that we have our CSV data, let’s install the CLI Tool and ingest it. Ingesting CSV Data into GridDB Cloud You can download the latest CLI Tool from the Github releases page: https://github.com/Imisrael/griddb-cloud-cli/releases. For me, I installed the .deb file $ wget https://github.com/Imisrael/griddb-cloud-cli/releases/download/v0.1.4/griddb-cloud-cli_0.1.4_linux_amd64.deb $ sudo dpkg -i griddb-cloud-cli_0.1.4_linux_amd64.deb $ vim ~/.griddb.yaml And enter your credentials: cloud_url: “https://cloud97.griddb.com:443/griddb/v2/gs_clustermfclo7/dbs/ZQ8” cloud_username: “kG-israel” cloud_pass: “password” And ingest: $ griddb-cloud-cli ingest /tmp/sample.csv ✔ Does this container already exist? … NO Use CSV Header names as your GridDB Container Col names? id,name,quantity,price ✔ Y/n … YES ✔ Container Name: … migrated_data ✔ Choose: … COLLECTION ✔ Row Key? … true ✔ (id) Column Type … INTEGER ✔ Column Index Type1 … TREE ✔ (name) Column Type … STRING ✔ (quantity) Column Type … INTEGER ✔ (price) Column Type … FLOAT ✔ Make Container? { “container_name”: “migrated_data”, “container_type”: “COLLECTION”, “rowkey”: true, “columns”: [ { “name”: “id”, “type”: “INTEGER”, “index”: [ “TREE” ] }, { “name”: “name”, “type”: “STRING”, “index”: null }, { “name”: “quantity”, “type”: “INTEGER”, “index”: null }, { “name”: “price”, “type”: “FLOAT”, “index”: null } ] } … YES {“container_name”:”migrated_data”,”container_type”:”COLLECTION”,”rowkey”:true,”columns”:[{“name”:”id”,”type”:”INTEGER”,”index”:[“TREE”]},{“name”:”name”,”type”:”STRING”,”index”:null},{“name”:”quantity”,”type”:”INTEGER”,”index”:null},{“name”:”price”,”type”:”FLOAT”,”index”:null}]} 201 Created Container Created. Starting Ingest 0 id id 1 name name 2 quantity quantity 3 price price ✔ Is the above mapping correct? … YES Ingesting. Please wait… Inserting 1000 rows 200 OK Inserting 1000 rows 200 OK And after some time, your data should be ready in your GridDB Cloud instance! $ griddb-cloud-cli sql query -s “SELECT COUNT(*) from migrated_data” [{“stmt”: “SELECT COUNT(*) from migrated_data” }] [[{“Name”:””,”Type”:”LONG”,”Value”:50000}]] And another confirmation $ griddb-cloud-cli read migrated_data -p -l 1 [ { “name”: “migrated_data”, “stmt”: “select * limit 1”, “columns”: null, “hasPartialExecution”: true }] [ [ { “Name”: “id”, “Type”: “INTEGER”, “Value”: 1 }, { “Name”: “name”, “Type”: “STRING”, “Value”: “Enterprise-wide multi-state installation” }, { “Name”: “quantity”, “Type”: “INTEGER”, “Value”: 479 }, { “Name”: “price”, “Type”: “FLOAT”, “Value”: 194.8 } ] ] Migrating from GridDB CE If you want to move all of your local data from GridDB Community Edition over to your GridDB Pay As You Go cloud database, you can now use the GridDB Cloud CLI Tool for the job! You will also of course need to export your CE containers that you wish to migrate. Prereqs As explained above, you will need: the GridDB Cloud CLI Tool from GitHub and the GridDB CE Import/Export Tool installed onto your machine. Step by Step Process of Migrating Let’s run through an example of exporting out entier GridDB CE Database and then running the migration from the CLI tool. This is going to assume you have the Import tool already set up, you can read more about that here: https://griddb.net/en/blog/using-the-griddb-import-export-tools-to-migrate-from-postgresql-to-griddb/ First, you’d run the export tool like so: $ cd expimp/bin $ ./gs_export -u admin/admin -d all –all This command will export all of your containers into a directory called ‘all’. $ ./gs_export -u admin/admin -d all –all Export Start. Directory : /home/israel/development/expimp/bin/all Number of target containers : 7 public.p01 : 2 public.p02 : 0 public.p03 : 0 public.device3 : 1015 public.device2 : 1092 public.device1 : 1944 public.col02 : 10000 Number of target containers:7 ( Success:7 Failure:0 ) Export Completed. Next, ensure your GridDB Cloud CLI Tool is set up, and once it is, you can run the migrate command. Let’s look at how it works: $ griddb-cloud-cli migrate -h Use the export tool on your GridDB CE Instance to create the dir output of csv files and a properties file and then migrate those tables to GridDB Cloud Usage: griddb-cloud-cli migrate [flags] Examples: griddb-cloud-cli migrate <directory> Flags: -f, –force Force create (no prompt) -h, –help help for migrate So in our case, we want to use migrate with the -f flag to not show us prompts because we have 7 containers to create and migrate! $ griddb-cloud-cli migrate -f all And here is an example of some of the output: {“container_name”:”device2″,”container_type”:”TIME_SERIES”,”rowkey”:true,”columns”:[{“name”:”ts”,”type”:”TIMESTAMP”,”index”:null},{“name”:”co”,”type”:”DOUBLE”,”index”:null},{“name”:”humidity”,”type”:”DOUBLE”,”index”:null},{“name”:”light”,”type”:”BOOL”,”index”:null},{“name”:”lpg”,”type”:”DOUBLE”,”index”:null},{“name”:”motion”,”type”:”BOOL”,”index”:null},{“name”:”smoke”,”type”:”DOUBLE”,”index”:null},{“name”:”temp”,”type”:”DOUBLE”,”index”:null}]} 201 Created inserting into (device2). csv: all/public.device2_2020-07-12_2020-07-13.csv 200 OK inserting into (device2). csv: all/public.device2_2020-07-13_2020-07-14.csv 200 OK inserting into (device2). csv: all/public.device2_2020-07-14_2020-07-15.csv 200 OK inserting into (device2). csv: all/public.device2_2020-07-15_2020-07-16.csv 200 OK inserting into (device2). csv: all/public.device2_2020-07-16_2020-07-17.csv 200 OK inserting into (device2). csv: all/public.device2_2020-07-17_2020-07-18.csv 200 OK inserting into (device2). csv: all/public.device2_2020-07-18_2020-07-19.csv 200 OK inserting into (device2). csv: all/public.device2_2020-07-19_2020-07-20.csv 200 OK inserting into (device2). csv: all/public.device2_2020-07-20_2020-07-21.csv 200 OK {“container_name”:”device3″,”container_type”:”TIME_SERIES”,”rowkey”:true,”columns”:[{“name”:”ts”,”type”:”TIMESTAMP”,”index”:null},{“name”:”co”,”type”:”DOUBLE”,”index”:null},{“name”:”humidity”,”type”:”DOUBLE”,”index”:null},{“name”:”light”,”type”:”BOOL”,”index”:null},{“name”:”lpg”,”type”:”DOUBLE”,”index”:null},{“name”:”motion”,”type”:”BOOL”,”index”:null},{“name”:”smoke”,”type”:”DOUBLE”,”index”:null},{“name”:”temp”,”type”:”DOUBLE”,”index”:null}]} 201 Created inserting into (device3). csv: all/public.device3_2020-07-12_2020-07-13.csv 200 OK inserting into (device3). csv: all/public.device3_2020-07-13_2020-07-14.csv 200 OK inserting into (device3). csv: all/public.device3_2020-07-14_2020-07-15.csv 200 OK inserting into (device3). csv: all/public.device3_2020-07-15_2020-07-16.csv 200 OK inserting into (device3). csv: all/public.device3_2020-07-16_2020-07-17.csv 200 OK inserting into (device3). csv: all/public.device3_2020-07-17_2020-07-18.csv 200 OK inserting into (device3). csv: all/public.device3_2020-07-18_2020-07-19.csv 200 OK inserting into (device3). csv: all/public.device3_2020-07-19_2020-07-20.csv 200 OK {“container_name”:”p01″,”container_type”:”COLLECTION”,”rowkey”:true,”columns”:[{“name”:”name”,”type”:”STRING”,”index”:null},{“name”:”names”,”type”:”STRING_ARRAY”,”index”:null},{“name”:”barr”,”type”:”BOOL_ARRAY”,”index”:null},{“name”:”tsarr”,”type”:”TIMESTAMP_ARRAY”,”index”:null}]} 201 Created {“container_name”:”p02″,”container_type”:”COLLECTION”,”rowkey”:true,”columns”:[{“name”:”id”,”type”:”STRING”,”index”:null},{“name”:”date”,”type”:”STRING”,”index”:null}]} 201 Created {“container_name”:”p03″,”container_type”:”COLLECTION”,”rowkey”:true,”columns”:[{“name”:”id”,”type”:”LONG”,”index”:null},{“name”:”c1″,”type”:”STRING”,”index”:null},{“name”:”c2″,”type”:”BOOL”,”index”:null}]} 201 Created inserting into (p01). csv: all/public.p01.csv 200 OK Lastly we can verify that our containers are in there: $ griddb-cloud-cli show device3 { “container_name”: “device3”, “container_type”: “TIME_SERIES”, “rowkey”: true, “columns”: [ { “name”: “ts”, “type”: “TIMESTAMP”, “timePrecision”: “MILLISECOND”, “index”: [] }, { “name”: “co”, “type”: “DOUBLE”, “index”: [] }, { “name”: “humidity”, “type”: “DOUBLE”, “index”: [] }, { “name”: “light”, “type”: “BOOL”, “index”: [] }, { “name”: “lpg”, “type”: “DOUBLE”, “index”: [] }, { “name”: “motion”, “type”: “BOOL”, “index”: [] }, { “name”: “smoke”, “type”: “DOUBLE”, “index”: [] }, { “name”: “temp”, “type”: “DOUBLE”, “index”: [] } ] } $ griddb-cloud-cli sql query -s “SELECT COUNT(*) FROM device2” [{“stmt”: “SELECT COUNT(*) FROM device2” }] [[{“Name”:””,”Type”:”LONG”,”Value”:1092}]] Looks good to me! Conclusion And with that, we have learned three different methods of migrating from a variation of GridDB to the new Azure Marketplace GridDB

More
Building Resume Creator with Multi-agent AI

In this blog, we will build an AI-powered resume-creation system that automates the tedious and time-consuming tasks involved in manual resume creation. By leveraging multi-agent AI systems, we will streamline the process of information gathering, and content writing to produce resumes with minimal human intervention. Limitations of Manual Resume Processing Inefficient Information Gathering The manual process of collecting and organizing information is time-consuming and requires significant effort. Inconsistent Formatting Manual resume creation often leads to formatting inconsistencies. The process requires manual adjustments to maintain professional formatting standards, which can be error-prone and time-consuming. Content Writing and Rewriting Challenges The manual process requires significant effort in crafting and editing content. Writing compelling and well-structured content by hand is labor-intensive, requiring multiple revisions and edits. Automating Resume Creation using AI Creating a resume manually involves several steps: Information Gathering: Collecting and organizing your personal details, job history, skills, and education. Formatting: Ensuring the resume looks attractive and professional, often without clear guidelines. Content Writing: Crafting and refining content to make it concise, compelling, and relevant. Proofreading and Editing: Checking for errors and polishing the resume to a professional standard. With the AI system, we can automate these steps using multi-agent systems. Each agent performs a specific task, such as extracting information, generating content, or formatting the resume. By coordinating these agents, we can create a fully automated resume creation system. Running the Resume Creator Before we dive into the technical details, you can run the resume creator system by following these steps: 1) Clone the repository: git clone https://github.com/junwatu/resume-creator-multi-agent-ai.git 2) Install the dependencies: cd resume-creator-multi-agent-ai cd apps npm install 3) Create a .env file in the apps directory and add the following environment variables: OPENAI_API_KEY=api-key-here GRIDDB_WEBAPI_URL= GRIDDB_USERNAME= GRIDDB_PASSWORD= VITE_APP_BASE_URL=http://localhost VITE_PORT=3000 Please refer to the Prerequisites section for more details on obtaining the OpenAI API key and GridDB credentials. 4) Start the server: npm run start 5) Open your browser and go to http://localhost:3000 to access the resume creator system. How it Works? In this blog, we automate the information gathering and content writing for the resume, tasks that are usually manual and time-consuming. This system diagram illustrates the resume creation process discussed in this blog, showcasing the collaboration between two main AI agents: Here’s a brief description: The system starts with User Input and requires an environment setup that includes Team Initialization and OpenAI API Key. Two AI agents work together: Profile Analyst (Agent AI 1): Handles data extraction from user input, breaking down information into categories like Name, Experience, Skills, Education, and Job History. Resume Writer (Agent AI 2): Takes structured information and handles the writing aspect. The workflow follows these key steps: Data Extraction: Organizes raw user input into structured categories. This is the information-gathering step. Structured Information: Stores the organized data into the GridDB Cloud database. Resume Crafting: Combines the structured data with writing capabilities. This is the content writing step. Create Resume: Generates the content. Final Resume: Produces the completed document. Prerequisites KaibanJS KaibanJS is the JavaScript framework for building multi-agent AI systems. We will use it to build our resume creation system. OpenAI We will use the o1-mini model from OpenAI. It is a smaller version of the o1 model, suitable for tasks that require complex reasoning and understanding. Create a project, an API key, and enable the o1-mini model in the OpenAI platform. Make sure to save the API key in the .env file. OPENAI_API_KEY=api-key-here GridDB Cloud The GridDB Cloud offers a free plan tier and is officially available worldwide. This database will store the structured information extracted by the Profile Analyst agent and also the final resume generated by the Resume Writer agent. You need these GridDB environment variables in the .env file: GRIDDB_WEBAPI_URL= GRIDDB_USERNAME= GRIDDB_PASSWORD= Check the below section on how to get these values. Sign Up for GridDB Cloud Free Plan If you would like to sign up for a GridDB Cloud Free instance, you can do so in the following link: https://form.ict-toshiba.jp/download_form_griddb_cloud_freeplan_e. After successfully signing up, you will receive a free instance along with the necessary details to access the GridDB Cloud Management GUI, including the GridDB Cloud Portal URL, Contract ID, Login, and Password. GridDB WebAPI URL Go to the GridDB Cloud Portal and copy the WebAPI URL from the Clusters section. It should look like this: GridDB Username and Password Go to the GridDB Users section of the GridDB Cloud portal and create or copy the username for GRIDDB_USERNAME. The password is set when the user is created for the first time, use this as the GRIDDB_PASSWORD. For more details, to get started with GridDB Cloud, please follow this quick start guide. IP Whitelist When running this project, please ensure that the IP address where the project is running is whitelisted. Failure to do so will result in a 403 status code or forbidden access. You can use a website like What Is My IP Address to find your public IP address. To whitelist the IP, go to the GridDB Cloud Admin and navigate to the Network Access menu. Node.js We will use Node.js LTS v22.12.0 to build a server that handles the communication between the user interface, AI agents, and OpenAI API and store data in the GridDB Cloud database. React We will use React to build the user interface for the resume creation system. Where the user can input their details and generate a resume with a click of a button. Building the Resume Creation System Node.js Server We will use Node.js to build the server that handles the communication between the user interface, AI agents, and OpenAI API. The server will also store the structured information in the GridDB Cloud database. This table provides an overview of the API routes defined in the server.js code, including HTTP methods, endpoints, descriptions, and any parameters. HTTP Method Endpoint Description Parameters POST /api/resumes Creates a new resume. Calls the generateResume function to generate content, saves to the database, and returns the response. Body: { content: string } GET /api/resumes Fetches all resumes stored in the database. None GET /api/resumes/:id Fetches a specific resume by its ID. Path: id (Resume ID) DELETE /api/resumes/:id Deletes a specific resume by its ID. Path: id (Resume ID) The main route code for the resume creation is as follows: app.post(‘/api/resumes’, async (req, res) => { try { const resumeData = req.body || {}; const result = await generateResume(resumeData.content || undefined); console.log(result); const resume = { id: generateRandomID(), rawContent: resumeData.content, formattedContent: result.result, status: result.status, createdAt: new Date().toISOString(), information: JSON.stringify(result.stats), } // Save resume to database const dbResponse = await dbClient.insertData({ data: resume }); if (result.status === ‘success’) { const all = { message: ‘Resume created successfully’, data: result.result, stats: result.stats, dbStatus: dbResponse } res.status(201).json(all); } else { res.status(400).json({ message: ‘Failed to generate resume’, error: result.error }); } } catch (error) { res.status(500).json({ error: ‘Server error while creating resume’, details: error.message }); } }); When the user submits their data, the server calls the generateResume function to generate the resume content. The result is then saved to the GridDB Cloud database, and the resume content is returned as a response. Multi-agent AI We will use KaibanJS to build the multi-agent AI system for the resume creation process. You can find the agent code in the team.kban.js file and this system consists of two main agents: Profile Analyst (Agent AI 1) The Profile Analyst agent is responsible for extracting structured information from the user input. It categorizes the input into fields such as Name, Experience, Skills, Education, and Job History. The effectiveness of these fields depends on the quality and diversity of the submitted data. const profileAnalyst = new Agent({ name: ‘Carla Smith’, role: ‘Profile Analyst’, goal: ‘Extract structured information from conversational user input.’, background: ‘Data Processor’, tools: [] // Tools are omitted for now }); This profile agent will use this task code to extract user data: const processingTask = new Task({ description: `Extract relevant details such as name, experience, skills, and job history from the user’s ‘aboutMe’ input. aboutMe: {aboutMe}`, expectedOutput: ‘Structured data ready to be used for a resume creation.’, agent: profileAnalyst }); The expectedOutput is the structured data that will be used by the Resume Writer agent to generate the resume content. The description and expectedOutput mimic the prompts if were interact with ChatGPT. However, in this case, this is done by the Profile Analyst agent. Resume Writer (Agent AI 2) The Resume Writer agent is responsible for crafting the resume content based on the structured information provided by the Profile Analyst agent. It generates well-structured, compelling content that effectively showcases the user’s qualifications and achievements. const resumeWriter = new Agent({ name: ‘Alex Morra’, role: ‘Resume Writer’, goal: `Craft compelling, well-structured resumes that effectively showcase job seekers qualifications and achievements.`, background: `Extensive experience in recruiting, copywriting, and human resources, enabling effective resume design that stands out to employers.`, tools: [] }); This resume agent will use this task code to generate the resume content: const resumeCreationTask = new Task({ description: `Utilize the structured data to create a detailed and attractive resume. Enrich the resume content by inferring additional details from the provided information. Include sections such as a personal summary, detailed work experience, skills, and educational background.`, expectedOutput: `A professionally formatted resume in raw markdown format, ready for submission to potential employers`, agent: resumeWriter }); The result of this task is markdown-formatted resume content that can be easily converted into a PDF or other formats and it’s easy to process by the user interface. Save Data to GridDB Cloud Database The GridDB Cloud database stores the structured information extracted by the Profile Analyst agent and the final resume generated by the Resume Writer agent. This is the schema data used to store the resume information in the GridDB Cloud database: { “id”: “string”, “rawContent”: “string”, “formattedContent”: “string”, “status”: “string”, “createdAt”: “string”, “information”: “string” } Field Type Description id string A unique identifier for each resume. rawContent string The original user input for the resume. formattedContent string The final formatted resume content. status string Indicates the success or failure of the resume generation process. createdAt string The timestamp of when the resume was created. information string The OpenAI token information. GridDB Cloud provides a RESTful API that allows us to interact with the database. We will use this API to store and retrieve the resume information. The griddb-client.js file contains the code to interact with the GridDB Cloud database. It includes functions to insert, retrieve, and delete resume data. To insert new data, you can use the endpoint /containers/${containerName}/rows. This endpoint allows you to add a new row of data to the database: async function insertData({ data, containerName = ‘resumes’ }) { console.log(data); try { const timestamp = data.createdAt instanceof Date ? data.createdAt.toISOString() : data.createdAt; const row = [ parseInt(data.id), // INTEGER data.rawContent, // STRING data.formattedContent, // STRING data.status, // STRING timestamp, // TIMESTAMP (ISO format) data.information // STRING ]; const path = `/containers/${containerName}/rows`; return await makeRequest(path, [row], ‘PUT’); } catch (error) { throw new Error(`Failed to insert data: ${error.message}`); } } GridDB also supports SQL-like queries to interact with the database. Here’s an example of an SQL query to retrieve all resumes from the database: SELECT * FROM resumes; and to retrieve a specific resume by its ID: SELECT * FROM resumes WHERE id = ‘resume-id’; Let’s take an example how to insert data into the GridDB Cloud database: const sql = “insert into resumes (id, rawContent, formattedContent, status, createdAt, information) values(3, ‘raw contenct here’, ‘ formatted content here’, ‘success’, TIMESTAMP(‘2025-01-02’), ‘{tokens: 300}’)”; const response = await fetch(`${process.env.GRIDDB_WEBAPI_URL}’/sql/dml/update’`, { method: ‘POST’, headers: { ‘Content-Type’: ‘application/json’, ‘Authorization’: `Basic ${authToken}`, }, body: JSON.stringify(payload), }); const responseText = await response.text(); The code above inserts the resume data into the GridDB Cloud database using the /sql/dml/update endpoint and the SQL query. All these data operations will be handled by the Node.js server and exposed as API endpoints for the user interface to interact with. User Interface The ResumeCreator component is built using React and allows users to input their details in a text and generate a resume with the click of a button. The user interface is designed to be simple. import { useState } from ‘react’; import { Card, CardContent } from ‘@/components/ui/card’; import { Button } from ‘@/components/ui/button’; import { Textarea } from ‘@/components/ui/textarea’; import { Alert, AlertDescription } from ‘@/components/ui/alert’; import { ResumeMarkdownRenderer } from ‘./ResumeMarkdownRenderer.tsx’; const ResumeCreator = () => { const [isSubmitting, setIsSubmitting] = useState(false); const [submitStatus, setSubmitStatus] = useState(null); const [resumeText, setResumeText] = useState(“default resume text”); const [markdownContent, setMarkdownContent] = useState(null); const BASE_URL = import.meta.env.VITE_APP_BASE_URL + ‘:’ + import.meta.env.VITE_PORT; const handleSubmit = async () => { setIsSubmitting(true); setSubmitStatus(null); try { const response = await fetch(`${BASE_URL}/api/resumes`, { method: ‘POST’, headers: { ‘Content-Type’: ‘application/json’, }, body: JSON.stringify({ content: resumeText }), }); if (!response.ok) { throw new Error(‘Failed to create resume’); } const aiResume = await response.json(); setMarkdownContent(aiResume.data); setSubmitStatus(‘success’); } catch (error) { console.error(‘Error creating resume:’, error); setSubmitStatus(‘error’); } finally { setIsSubmitting(false); setTimeout(() => setSubmitStatus(null), 5000); } }; return ( <div className=”max-w-4xl mx-auto p-8 space-y-6″> <h1 className=”text-3xl font-bold text-center”> Resume Creator <div className=”w-40 h-1 bg-green-500 mx-auto mt-1″></div> </h1> {submitStatus && ( <Alert className={submitStatus === ‘success’ ? ‘bg-green-50’ : ‘bg-red-50’}> <AlertDescription> {submitStatus === ‘success’ ? ‘Resume created successfully!’ : ‘Failed to create resume. Please try again.’} </AlertDescription> </Alert> )} {markdownContent ? ( <ResumeMarkdownRenderer markdown={markdownContent} /> ) : ( <div className=”space-y-6″> <h2 className=”text-2xl font-semibold”>About Me</h2> <Card className=”border-2″> <CardContent className=”p-6″> <p className=”text-sm text-gray-600 mb-4″> Enter your professional experience, skills, and education. Our AI will help format this into a polished resume. </p> <Textarea value={resumeText} onChange={(e: React.ChangeEvent<HTMLTextAreaElement>) => setResumeText(e.target.value)} className=”min-h-[400px] font-mono” placeholder=”Enter your resume content here…” /> </CardContent> </Card> <div className=”flex justify-center”> <Button onClick={handleSubmit} disabled={isSubmitting} className=”bg-green-500 hover:bg-green-600 text-white px-8 py-2 rounded-md” > {isSubmitting ? ‘Creating…’ : ‘Create Resume’} </Button> </div> </div> )} </div> ); }; export default ResumeCreator; The core functionality of the ResumeCreator component is to create a user resume using AI and render the result. It uses useState to manage input (resumeText), generated markdown (markdownContent), submission status (submitStatus), and submission progress (isSubmitting). The handleSubmit function sends a POST request to the /api/resumes route at the backend (${BASE_URL}/api/resumes), passing the user’s input, and updates the state based on the API’s response. Read here for the Node.js API routes. The UI includes a text area for input, a submit button to trigger the API call, and a markdown renderer ResumeMarkdownRenderer component to display the AI-generated resume. Alerts notify the user of the submission status while loading states to ensure a smooth experience. Further Improvements Enhanced Data Extraction: Improve the Profile Analyst agent’s ability to extract and categorize information more accurately and efficiently. Advanced Content Generation: Enhance the Resume Writer agent’s content generation capabilities to produce more compelling and personalized resumes. User Interface Enhancements: Add more features to the user interface, such as resume templates, customization options, and real-time editing. Conclusion In this blog, we have built an AI-powered resume-creation system that automates the tedious and time-consuming tasks involved in manual resume creation. By leveraging multi-agent AI systems, we have streamlined the process of information gathering and content writing to produce resumes with minimal human

More
Sports Analytics with GridDB

Introduction In modern sports, data-driven decision-making has become essential for gaining a competitive edge. Every step, shot, or lap generates a stream of events that require high-speed ingestion, efficient storage, and rapid querying—challenges that traditional relational databases struggle to handle at scale. To address this, organizations are increasingly looking into alternatives. GridDB, a highly scalable and efficient time-series database, is designed to manage large volumes of continuously generated data such as above. By leveraging GridDB, teams can analyze critical performance metrics such as player speed, fatigue levels, and tactical positioning over time. These insights enable coaches and analysts to make informed decisions on game tactics, substitutions, and training regimens based on real-time and historical data. In this article, we explore how GridDB, integrated within a Spring Boot application, can be used for a soccer analytics use case—optimizing player substitutions and refining game strategies with data-driven precision. Understanding the Use Case A single soccer match generates hundreds of timestamped events—such as a midfielder’s pass at a given time e.g. 20:05:32 or a striker’s shot at time 20:10:15—each enriched with outcomes and metadata. The sequential nature of this data reveals crucial patterns, like player fatigue or shifts in attacking momentum, that static analyses often miss. For engineers, the challenge lies in efficiently managing this high-speed, high-volume data stream. To simulate this type of data, we will use events/15946.json dataset from StatsBomb, which logs an entire match’s events—including passes, shots, and tackles—with millisecond precision. Our Spring Boot application, powered by GridDB, will focus on: Performance Tracking: Monitoring pass accuracy to detect signs of fatigue. Strategy Optimization: Analyzing shot frequency to uncover attacking opportunities. Setting Up GridDB Cluster and Spring Boot Integration Project Structure Here’s a suggested project structure for this application: ├───my-griddb-app │ │ pom.xml │ │ │ ├───src │ │ ├───main │ │ │ ├───java │ │ │ │ └───mycode │ │ │ │ │ MySpringBootApplication.java │ │ │ │ │ │ │ │ │ ├───config │ │ │ │ │ GridDBConfig.txt │ │ │ │ │ │ │ │ │ ├───controller │ │ │ │ │ MatchEventsController.java │ │ │ │ │ │ │ │ │ └───service │ │ │ │ MatchEventsService.java │ │ │ │ MetricsCollectionService.java │ │ │ │ RestTemplateConfig.java │ │ │ │ │ │ │ └───resources │ │ │ │ application.properties │ │ │ │ │ │ │ └───templates │ │ │ pass-accuracy-graph.html This structure separates controllers, models, repositories, services, and the application entry point into distinct layers, enhancing modularity and maintainability. It can be further customized based on individual requirements. Set Up GridDB Cloud For this exercise, we will be using GridDB Cloud version. Start by visiting the GridDB Cloud portal and [signing up](GridDB Cloud Free Plan | TOSHIBA DIGITAL SOLUTIONS CORPORATION) for an account. Based on requirements, either the free plan or a paid plan can be selected for broader access. After registration ,an email will be sent containing essential details, including the Web API URL and login credentials. Once the login details are received, log in to the Management GUI to access the cloud instance. Create Database Credentials Before interacting with the database, we must create a database user: Navigate to Security Settings: In the Management GUI, go to the “GridDB Users” tab. Create a Database User: Click “Create Database User,” enter a username and password, and save the credentials. For example, set the username as soccer_admin and a strong password. Store Credentials Securely: These will be used in your application to authenticate with GridDB Cloud. Set Allowed IP Addresses To restrict access to authorized sources, configure the allowed IP settings: Navigate to Security Settings: In the Management GUI, go to the “Network Access” tab and locate the “Allowed IP” section and add the . Add IP Addresses: For development, you can temporarily add your local machine’s IP. Add POM Dependency Here’s an example of how to configure the dependency in thepom.xml file: <project xmlns=”http://maven.apache.org/POM/4.0.0″ xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xsi:schemaLocation=”http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd”> <modelVersion>4.0.0</modelVersion> <groupId>com.example</groupId> <artifactId>my-griddb-app</artifactId> <version>1.0-SNAPSHOT</version> <name>my-griddb-app</name> <url>http://maven.apache.org</url> <parent> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-parent</artifactId> <version>3.2.4</version> <relativePath /> <!– lookup parent from repository –> </parent> <properties> <maven.compiler.source>17</maven.compiler.source> <maven.compiler.target>17</maven.compiler.target> </properties> <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-actuator</artifactId> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>3.8.1</version> <scope>test</scope> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> <exclusions> <exclusion> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-logging</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-test</artifactId> <scope>test</scope> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-thymeleaf</artifactId> </dependency> <!– JSON processing –> <dependency> <groupId>org.glassfish.jersey.core</groupId> <artifactId>jersey-client</artifactId> <version>2.35</version> </dependency> <dependency> <groupId>org.json</groupId> <artifactId>json</artifactId> <version>20210307</version> </dependency> <dependency> <groupId>com.fasterxml.jackson.core</groupId> <artifactId>jackson-databind</artifactId> <version>2.15.0</version> <!– or the latest version –> </dependency> <!– Lombok –> <dependency> <groupId>org.projectlombok</groupId> <artifactId>lombok</artifactId> <optional>true</optional> </dependency> </dependencies> </project> Technical Implementation Implementing a soccer analytics solution with GridDB and Spring Boot involves three key steps: Ingesting the StatsBomb events/15946.json dataset into GridDB. Querying the data to extract time-series metrics. Visualizing the results to generate actionable insights. Below, we explore each phase in detail, showcasing GridDB’s time-series capabilities and its seamless integration within a Spring Boot architecture. Step 1: Data Ingestion The events/15946.json file logs a sequence of match events—passes, shots, tackles—each record containing essential fields such as: Timestamp (e.g., “2021-06-11T20:05:32.456”) Player Name (player.name) Event Type (type.name) Outcome (e.g., pass.outcome.name as “Complete”, shot.outcome.name as “Goal”) To efficiently store and query this data in GridDB, we first need define a time-series container in GridDb cloud as below. Container Setup We define a container name match_events in GridDB Cloud using the time-series type with timestamp as the row key. Next, we will create schema which will includes the following columns: timestamp (TIMESTAMP, NOT NULL, Row Key) player_name (STRING) event_type (STRING) event_outcome (STRING) minute (INTEGER) second (INTEGER) team_name (STRING) Afterwards we implement MetricsCollectionServicewhich fetches data from JSON file and pushing the data in database. Here the implentation of MetricCollectionService.java : package mycode.service; import org.json.JSONArray; import org.json.JSONObject; import org.springframework.stereotype.Service; import org.springframework.beans.factory.annotation.Value; import java.io.OutputStream; import java.net.HttpURLConnection; import java.net.URL; import java.time.LocalDate; import java.time.format.DateTimeFormatter; import java.util.Scanner; @Service public class MetricsCollectionService { private static String gridDBRestUrl; private static String gridDBApiKey; @Value(“${griddb.rest.url}”) public void setgridDBRestUrl(String in) { gridDBRestUrl = in; } @Value(“${griddb.api.key}”) public void setgridDBApiKey(String in) { gridDBApiKey = in; } public void collect() { try { // Fetch JSON Data from GitHub String jsonResponse = fetchJSONFromGitHub( “https://raw.githubusercontent.com/statsbomb/open-data/master/data/events/15946.json”); JSONArray events = new JSONArray(jsonResponse); // Process and Send Data to GridDB Cloud sendBatchToGridDB(events); } catch (Exception e) { e.printStackTrace(); } } private static String fetchJSONFromGitHub(String urlString) throws Exception { URL url = new URL(urlString); HttpURLConnection conn = (HttpURLConnection) url.openConnection(); conn.setRequestMethod(“GET”); conn.setRequestProperty(“Accept”, “application/json”); if (conn.getResponseCode() != 200) { throw new RuntimeException(“Failed to fetch data: HTTP error code : ” + conn.getResponseCode()); } Scanner scanner = new Scanner(url.openStream()); StringBuilder response = new StringBuilder(); while (scanner.hasNext()) { response.append(scanner.nextLine()); } scanner.close(); return response.toString(); } private static void sendBatchToGridDB(JSONArray events) { JSONArray batchData = new JSONArray(); boolean startProcessing = false; for (int i = 0; i < events.length(); i++) { JSONObject event = events.getJSONObject(i); JSONArray row = new JSONArray(); if (event.has(“index”) && event.getInt(“index”) == 10) { startProcessing = true; } if (!startProcessing) { continue; // Skip records until we reach index == 7 } // Extract and format fields String formattedTimestamp = formatTimestamp(event.optString(“timestamp”, null)); row.put(formattedTimestamp); row.put(event.optJSONObject(“player”) != null ? event.getJSONObject(“player”).optString(“name”, null) : null); row.put(event.optJSONObject(“type”) != null ? event.getJSONObject(“type”).optString(“name”, null) : null); JSONObject passOutcome = event.optJSONObject(“pass”); JSONObject shotOutcome = event.optJSONObject(“shot”); if (passOutcome == null && shotOutcome == null) { continue; } if (passOutcome != null) { if (passOutcome.has(“outcome”)) { row.put(passOutcome.getJSONObject(“outcome”).optString(“name”, null)); } else { row.put(JSONObject.NULL); } } else if (shotOutcome != null) { if (shotOutcome.has(“outcome”)) { row.put(shotOutcome.getJSONObject(“outcome”).optString(“name”, null)); } else { row.put(JSONObject.NULL); } } else { row.put(JSONObject.NULL); } row.put(event.optInt(“minute”, -1)); row.put(event.optInt(“second”, -1)); row.put(event.optJSONObject(“team”) != null ? event.getJSONObject(“team”).optString(“name”, null) : null); batchData.put(row); } sendPutRequest(batchData); } private static String formatTimestamp(String inputTimestamp) { try { String todayDate = LocalDate.now().format(DateTimeFormatter.ISO_DATE); return todayDate + “T” + inputTimestamp + “Z”; } catch (Exception e) { return “null”; // Default if parsing fails } } private static void sendPutRequest(JSONArray batchData) { try { URL url = new URL(gridDBRestUrl); HttpURLConnection conn = (HttpURLConnection) url.openConnection(); conn.setDoOutput(true); conn.setRequestMethod(“PUT”); conn.setRequestProperty(“Content-Type”, “application/json”); conn.setRequestProperty(“Authorization”, gridDBApiKey); // Encode username and password for Basic Auth // Send JSON Data OutputStream os = conn.getOutputStream(); os.write(batchData.toString().getBytes()); os.flush(); int responseCode = conn.getResponseCode(); if (responseCode == HttpURLConnection.HTTP_OK || responseCode == HttpURLConnection.HTTP_CREATED) { System.out.println(“Batch inserted successfully.”); } else { System.out.println(“Failed to insert batch. Response: ” + responseCode); } conn.disconnect(); } catch (Exception e) { e.printStackTrace(); } } } Ingestion Logic This steps involves fetching data from GridDB using the REST API and grouping it into 5-minute intervals. package mycode.service; import com.fasterxml.jackson.databind.JsonNode; import com.fasterxml.jackson.databind.ObjectMapper; import org.springframework.stereotype.Service; import java.net.URI; import java.net.http.HttpClient; import java.net.http.HttpRequest; import java.net.http.HttpResponse; import java.util.HashMap; import java.util.Map; @Service public class MatchEventsService { private static final String GRIDDB_URL = “https://cloud5114.griddb.com:443/griddb/v2/gs_clustermfcloud5114/dbs/9UkMCtv4/containers/match_events/rows”; private static final String AUTH_HEADER = “Basic TTAyY…lhbEAx”; private final HttpClient httpClient = HttpClient.newHttpClient(); private final ObjectMapper objectMapper = new ObjectMapper(); public Map<Integer, Integer> getPassCountByFiveMin(String playerName) { try { // Build the HTTP request based on your curl HttpRequest request = HttpRequest.newBuilder() .uri(URI.create(GRIDDB_URL)) .header(“Content-Type”, “application/json”) .header(“Authorization”, AUTH_HEADER) .POST(HttpRequest.BodyPublishers.ofString(“{\”offset\”: 0, \”limit\”: 55555}”)) .build(); // Fetch the response HttpResponse<string> response = httpClient.send(request, HttpResponse.BodyHandlers.ofString()); JsonNode rootNode = objectMapper.readTree(response.body()); JsonNode rows = rootNode.get(“rows”); // Process data: count passes every 5 minutes Map<Integer, Integer> passCountByFiveMin = new HashMap<>(); for (JsonNode row : rows) { String currentPlayer = row.get(1).asText(); String eventType = row.get(2).asText(); int minute = row.get(4).asInt(); if (playerName.equals(currentPlayer) && “Pass”.equals(eventType)) { // Group by 5-minute intervals (0-4, 5-9, 10-14, etc.) int fiveMinInterval = (minute / 5) * 5; passCountByFiveMin.merge(fiveMinInterval, 1, Integer::sum); } } return passCountByFiveMin; } catch (Exception e) { e.printStackTrace(); return new HashMap<>(); } } }</string> Step 3: Visualization To deliver insights, Spring Boot exposes REST endpoints via a @RestController: Endpoints: /api/pass-accuracy/{player} returns a JSON array of {time, accuracy} pairs; /api/shot-frequency/{team} returns {time, shots}. Implementation: The controller calls the query service, maps GridDB results to DTOs, and serializes them with Spring’s Jackson integration. package mycode.controller; import mycode.service.MatchEventsService; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.stereotype.Controller; import org.springframework.ui.Model; import org.springframework.web.bind.annotation.*; import java.util.Map; @Controller public class MatchEventsController { @Autowired private MatchEventsService matchEventsService; @GetMapping(“/pass-accuracy/{playerName}”) public String getPassCountEveryFiveMin(@PathVariable String playerName, Model model) { Map<Integer, Integer> passCountByFiveMin = matchEventsService.getPassCountByFiveMin(playerName); // Prepare data for the chart model.addAttribute(“playerName”, playerName); model.addAttribute(“timeIntervals”, passCountByFiveMin.keySet().stream().sorted().toList()); model.addAttribute(“passCounts”, passCountByFiveMin.values()); return “pass-accuracy-graph”; // Thymeleaf template name } } Running the Project To run the project, execute the following command to build and run our application: mvn clean install && mvn spring-boot:run   Accessing the Dashboard After launching the application, open a web browser and navigate to: http://localhost:9090/pass-accuracy/{{player name}}. For example, http://localhost:9090/pass-accuracy/Lionel%20Andr%C3%A9s%20Messi%20Cuccittini This visualization displays a chart representing pass accuracy trends over time. It provides insights into the player’s fatigue levels over time and their overall activity on the field. Similarly various insights can be generated from this saved data, providing valuable analytics for team performance and decision-making. For example, Player Pass Accuracy Over Time Data: Count of “Pass” events with outcome.name = “Complete” vs. “Incomplete” per player, bucketed by 5-minute intervals. Visualization: Line graph with time (x-axis) and pass accuracy percentage (y-axis) for a key player (e.g., a midfielder). Insight: If pass accuracy drops below 70% late in the game (e.g., after minute 70), the player may be fatigued—time for a substitution. Graph: Goal Proximity Over Time Data: Count of “Shot” events with shot.outcome.name = “Goal” or near-miss outcomes (e.g., “Off Target”), bucketed by 10-minute intervals. Visualization: Stacked bar graph with time (x-axis) and shot outcomes (y-axis). Insight: Periods with frequent near-misses (e.g., minute 30-40) suggest missed opportunities—adjust tactics to capitalize on pressure. Conclusion: As demonstrated, GridDB efficiently processes the timestamped complexity of soccer events, delivering structured insights with precision. Integrated within a Spring Boot application, its high-performance ingestion, optimized time-series indexing, and rapid querying capabilities enable the extraction of actionable metrics from StatsBomb data—identifying player fatigue and strategic opportunities with millisecond accuracy. As sports technology continues to evolve, such implementations highlight the critical role of specialized databases in unlocking the full potential of temporal

More
GridDB for Environmental Monitoring in Smart Cities

Introduction Smart cities are transforming urban landscapes by leveraging technology to improve efficiency and sustainability. A key component of smart cities is environmental monitoring, which involves the collection, aggregation, and analysis of real-time data to address challenges like air pollution, traffic congestion, and resource management. By leveraging innovative database technologies, smart cities can unlock actionable insights that drive sustainability and improve urban living standards. This article delves into how GridDB, a high-performance database tailored for time-series and IoT data, supports environmental monitoring in smart cities. Using real-time data streams like pollen levels, illness risk metrics, and air quality indices, we showcase how GridDB effectively manages large data volumes with speed and precision, empowering informed decision-making for a smarter, more sustainable urban future. Understanding the Use Case To demonstrate a practical application, we integrate environmental data from Ambee, a trusted provider of real-time environmental information. Using GridDB’s robust capabilities, we aggregate and analyze this data, uncovering patterns and insights that can guide policymakers and stakeholders in enhancing urban sustainability. In this article, we focus on three specific datasets crucial for smart city management: Pollen Data: This dataset tracks allergen levels to predict periods of heightened allergy risks, enabling authorities to issue timely health advisories. Illness Risk Data: By analyzing environmental conditions, this dataset assesses the probability of disease outbreaks, aiding public health planning. Air Quality Data: Monitoring pollutants such as PM2.5, PM10, and NO2 ensures compliance with health standards and helps mitigate pollution’s impact on urban life. Together, these datasets, sourced from Ambee, serve as the basis for our study, highlighting the potential of GridDB for real-time environmental monitoring and data-driven decision-making. Setting Up GridDB Cluster and Spring Boot Integration: For Environmental Monitoring The first step is to set up a GridDB cluster and integrate it with our Spring Boot application as follows. Setting up GridDB Cluster GridDB provides flexible options to meet different requirements. For development, a single-node cluster on our local machine may be sufficient. However, in production, distributed clusters across multiple machines are typically preferred for improved fault tolerance and scalability. For detailed guidance on setting up clusters based on our deployment strategy, refer to the GridDB documentation. To set up a GridDB cluster, follow the steps mentioned here. Setting up Spring Boot Application Once our GridDB cluster is operational, the next step is connecting it to ourSpring Boot application. The GridDB Java Client API provides the necessary tools to establish this connection. To simplify the process, you can include the griddb-spring-boot-starter library as a dependency in our project, which offers pre-configured beans for a streamlined connection setup.    Setting Up API Access To begin, visit www.getambee.com and create an account. After registering, you’ll be provided with an API key, which is required for authentication when making requests to their endpoints. This key grants access to various environmental data services offered by the platform.   Pricing Plans Ambee offers flexible pricing plans to suit different needs: Free Tier: Ideal for developers or small-scale projects, with limited API calls per month. Paid Plans: Designed for larger-scale applications, these plans provide higher API limits and additional features. Project Structure Here’s a suggested project structure for this application: └───my-griddb-app │ pom.xml │ ├───src │ ├───main │ │ ├───java │ │ │ └───mycode │ │ │ │ MySpringBootApplication.java │ │ │ │ │ │ │ ├───config │ │ │ │ GridDBConfig.java │ │ │ │ │ │ │ ├───controller │ │ │ │ ChartController.java │ │ │ │ │ │ │ ├───dto │ │ │ │ AirQualityDTO.java │ │ │ │ IllnessRiskDTO.java │ │ │ │ PollenDataDTO.java │ │ │ │ │ │ │ └───service │ │ │ ChartService.java │ │ │ MetricsCollectionService.java │ │ │ RestTemplateConfig.java │ │ │ │ │ └───resources │ │ │ application.properties │ │ │ │ │ └───templates │ │ charts.html │ │ │ └───test │ └───java │ └───com │ └───example │ AppTest.java This structure separates controllers, models, repositories, services, and the application entry point into distinct layers, enhancing modularity and maintainability. Add GridDB Dependency To enable interaction with GridDB in our Spring Boot project, we must include the GridDB Java Client API dependency. This can be accomplished by adding the appropriate configuration to the project build file, such as pom.xml for Maven or the equivalent file for Gradle. Here’s an example of how to configure the dependency in thepom.xml file: <project xmlns=”http://maven.apache.org/POM/4.0.0″ xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xsi:schemaLocation=”http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd”> <modelVersion>4.0.0</modelVersion> <groupId>com.example</groupId> <artifactId>my-griddb-app</artifactId> <version>1.0-SNAPSHOT</version> <name>my-griddb-app</name> <url>http://maven.apache.org</url> <parent> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-parent</artifactId> <version>3.2.4</version> <relativePath /> <!– lookup parent from repository –> </parent> <properties> <maven.compiler.source>17</maven.compiler.source> <maven.compiler.target>17</maven.compiler.target> </properties> <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-actuator</artifactId> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>3.8.1</version> <scope>test</scope> </dependency> <!– GridDB dependencies –> <dependency> <groupId>com.github.griddb</groupId> <artifactId>gridstore</artifactId> <version>5.6.0</version> </dependency> <!– Spring Boot dependencies –> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> <exclusions> <exclusion> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-logging</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-test</artifactId> <scope>test</scope> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-thymeleaf</artifactId> </dependency> <!– JSON processing –> <dependency> <groupId>com.fasterxml.jackson.core</groupId> <artifactId>jackson-databind</artifactId> <version>2.15.0</version> <!– or the latest version –> </dependency> <!– Lombok –> <dependency> <groupId>org.projectlombok</groupId> <artifactId>lombok</artifactId> <optional>true</optional> </dependency> </dependencies> </project> Configure GridDB Connection After adding the GridDB dependency, the next step is configuring the connection details for our GridDB cluster in our Spring Boot application. This is usually configured in the application.properties file, where you can specify various settings for the application. Here’s a quick example of how to set up those connection details: GRIDDB_NOTIFICATION_MEMBER=127.0.0.1:10001 GRIDDB_CLUSTER_NAME=myCluster GRIDDB_USER=admin GRIDDB_PASSWORD=admin management.endpoints.web.exposure.include=* server.port=9090 #API token api.token=<enter your API Key></enter> griddb.cluster.port: The port number on which the GridDB cluster is listening. griddb.cluster.user: The username for accessing the GridDB cluster. griddb.cluster.password: The password for the specified GridDB user (replace with ouractual password). server.port=9090: Sets the port on which ourSpring Boot application will run. api.token : API token for authentication purposes. Create GridDB Client Bean To interact effectively with GridDB in our Spring Boot application,we need to create a dedicated Spring Bean to manage the GridDB connection. This bean will establish the connection using the parameters defined in the application.properties file and will act as the central interface for interacting with the GridDB cluster across the application. Here’s an example of how to define this bean in a Java class named GridDbConfig.java: package mycode.config; import java.util.Properties; import org.springframework.beans.factory.annotation.Value; import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.Configuration; import org.springframework.context.annotation.PropertySource; import com.toshiba.mwcloud.gs.GSException; import com.toshiba.mwcloud.gs.GridStore; import com.toshiba.mwcloud.gs.GridStoreFactory; @Configuration @PropertySource(“classpath:application.properties”) public class GridDBConfig { @Value(“${GRIDDB_NOTIFICATION_MEMBER}”)   private String notificationMember; @Value(“${GRIDDB_CLUSTER_NAME}”)   private String clusterName; @Value(“${GRIDDB_USER}”)   private String user; @Value(“${GRIDDB_PASSWORD}”)   private String password; @Bean   public GridStore gridStore() throws GSException {     // Acquiring a GridStore instance     Properties properties = new Properties();     properties.setProperty(“notificationMember”, notificationMember);     properties.setProperty(“clusterName”, clusterName);     properties.setProperty(“user”, user);     properties.setProperty(“password”, password);     return GridStoreFactory.getInstance().getGridStore(properties); } } Metric Collection Our primary focus is on the MetricCollection.java class, where API calls to external services are made, and the collected data is processed and stored. This class serves as a bridge between the external APIs and the backend system, ensuring seamless integration with GridDB for real-time analytics and decision-making. Ambee provides a rich set of endpoints to fetch real-time environmental data. For this project, we’ll focus on the following key endpoints: Pollen Data API: This endpoint provides allergen levels for specific locations, helping to monitor pollen concentrations and predict allergy outbreaks. It is essential for public health, as it allows authorities to issue timely advisories for people with respiratory conditions or allergies. Air Quality API: This API provides data on various pollutants such as PM2.5, PM10, CO, and NO2. Monitoring air quality is crucial for ensuring compliance with health standards and mitigating pollution’s impact on public health. It helps cities take proactive steps in reducing pollution and protecting residents’ well-being. Illness Risk API: This endpoint returns calculated risk scores based on environmental conditions such as pollution levels, temperature, and humidity. It plays a critical role in public health, enabling early detection of potential health risks and informing decisions on preventive measures to reduce illness outbreaks in urban areas. All above APIs use RESTful architecture and return responses in JSON format. Key Components of the MetricCollection.java Class API Key Management: The API key is injected into the service via the @Value annotation. This allows the project to securely access the external Ambee API services without hardcoding sensitive credentials directly into the codebase. RestTemplate for API Calls: The RestTemplate object is used for making HTTP requests. It simplifies the process of invoking REST APIs and handling the response data. We are using restTemplate.getForObject() to fetch JSON data from the APIs and return it as a string. This data can then be processed or converted into a more structured format, like objects or entities, for further analysis. Data Storage in GridDB: The environmental data fetched from the APIs is stored in GridDB’s time-series containers for efficient management and querying. Here’s how the data is persisted: Pollen Data: A time-series container named “PollenData” is created using the PollenDataDTO class. Each data entry is appended to this container. Illness Risk Data: A similar container, “IllnessRiskData”, is created for storing data related to contamination risks using the IllnessRiskDTO class. Air Quality Data: Another time-series container, “AirQualityData”, is set up for storing air quality metrics using the AirQualityDTO class. Fetching and Processing the Data Each method in the MetricCollection.java class is designed to collect specific environmental data: Air Quality: The getAirQualityData method is responsible for fetching real-time air quality data for a given city. It contacts the Ambee air quality API and retrieves the data in JSON format. Pollen Data: Similarly, the getPollenData method makes a request to the Ambee pollen API to gather pollen data, which is vital for assessing allergens in the air, particularly for individuals with respiratory conditions. Illness Risk: The getIllnessRiskData method provides critical insights into potential health risks caused by environmental factors such as pollution levels or seasonal changes, allowing for proactive health management. Data Transformation After retrieving the raw JSON data, it can be parsed and converted into Java objects for streamlined processing. By mapping the JSON response to custom Java classes, such as AirQualityDTO, IllnessRiskDTO, and PollenDataDTO, the data becomes easier to manage and transform within the system for further analysis and visualization. Below is the implementation of these DTO classes. @Data @NoArgsConstructor @AllArgsConstructor public class AirQualityDTO { @RowKey public Date updatedAt; private double lat; private double lng; private String city; private String state; private double pm10; private double pm25; private double no2; private double so2; private double co; private double ozone; private int aqi; private String pollutant; private double concentration; private String category; } @Data @NoArgsConstructor @AllArgsConstructor public class IllnessRiskDTO { @RowKey public Date createdAt; private double lat; private double lng; private String iliRisk; } @Data @NoArgsConstructor @AllArgsConstructor public class PollenDataDTO { @RowKey public Date updatedAt; private double lat; private double lng; private String grassPollenRisk; private String treePollenRisk; private String weedPollenRisk; private int grassPollenCount; private int treePollenCount; private int weedPollenCount; } Storing Data in GridDB After transforming the data into Java objects, it is stored in GridDB for real-time querying and analysis. Here’s a brief snippet of how the data insertion look: Scheduling Data Collection To ensure the data is regularly collected, you can use Spring Boot’s @Scheduled annotation to trigger API calls at fixed intervals. This makes sure the data is updated regularly to support real-time monitoring and analytics. @Scheduled(fixedRate = 4000) public void collectMetrics() throws GSException, JsonMappingException, JsonProcessingException, ParseException { List<pollendatadto> pollenData = fetchPollenData(); List<illnessriskdto> illnessRiskData = fetchIllnessRiskData(); List<airqualitydto> airQualityData = fetchAirQualityData(); // Store Pollen Data TimeSeries<pollendatadto> pollenSeries = store.putTimeSeries(“PollenData”, PollenDataDTO.class); for (PollenDataDTO data : pollenData) { pollenSeries.append(data); } // Store Illness Risk Data TimeSeries<illnessriskdto> illnessRiskSeries = store.putTimeSeries(“IllnessRiskData”, IllnessRiskDTO.class); for (IllnessRiskDTO data : illnessRiskData) { illnessRiskSeries.append(data); } // Store Air Quality Data TimeSeries<airqualitydto> airQualitySeries = store.putTimeSeries(“AirQualityData”, AirQualityDTO.class); for (AirQualityDTO data : airQualityData) { airQualitySeries.append(data); } }</airqualitydto></illnessriskdto></pollendatadto></airqualitydto></illnessriskdto></pollendatadto> By following above steps, we can effectively extract data from Ambee, load it into GridDB database. Data Querying in GridDB and Visualization with Thymeleaf Once the data is stored and available in GridDB, the next step is to visualize this data in a way that provides actionable insights. In this section, we’ll explore how to build a dashboard using Spring Boot, Thymeleaf, and Chart.js to render charts that displays realtime environment data. Here are the steps to achieve this: Building the Chart Controller The ChartController acts as the intermediary between backend data in GridDB and the frontend visualizations displayed on the dashboard. Its responsibilities include handling HTTP requests, interacting with the service layer to fetch data, and passing that data to Thymeleaf templates for rendering. Here’s how the ChartController is implemented: package mycode.controller; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.stereotype.Controller; import org.springframework.ui.Model; import org.springframework.web.bind.annotation.GetMapping; import mycode.service.ChartService; import mycode.dto.AirQualityDTO; import mycode.dto.IllnessRiskDTO; import mycode.dto.PollenDataDTO; import java.util.List; @Controller public class ChartController { @Autowired private ChartService chartService; @GetMapping(“/charts”) public String showCharts(Model model) { try { // Fetch data for charts List<pollendatadto> pollenData = chartService.getPollenData(); List<illnessriskdto> illnessRiskData = chartService.getIllnessRiskData(); List<airqualitydto> airQualityData = chartService.getAirQualityData(); // Add data to the model for Thymeleaf model.addAttribute(“pollenData”, pollenData); model.addAttribute(“illnessRiskData”, illnessRiskData); model.addAttribute(“airQualityData”, airQualityData); } catch (Exception e) { model.addAttribute(“error”, “Unable to fetch data: ” + e.getMessage()); } return “charts”; } }</airqualitydto></illnessriskdto></pollendatadto> Implementing the Chart Service The ChartService acts as the business logic layer, encapsulating the operations needed to query GridDB and process the results. In this context, the ChartService class accesses a GridStore database container to retrieve environmental monitoring data. The service then compiles these objects into a list representing the environmental metrics, ready for use in analysis or visualization. Here’s how the ChartService is implemented: package mycode.service; import java.text.SimpleDateFormat; import java.util.ArrayList; import java.util.Date; import java.util.List; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.stereotype.Service; import com.toshiba.mwcloud.gs.Container; import com.toshiba.mwcloud.gs.GridStore; import com.toshiba.mwcloud.gs.Query; import com.toshiba.mwcloud.gs.Row; import com.toshiba.mwcloud.gs.RowSet; import com.toshiba.mwcloud.gs.TimeSeries; import mycode.dto.AirQualityDTO; import mycode.dto.IllnessRiskDTO; import mycode.dto.PollenDataDTO; @Service public class ChartService { @Autowired GridStore store; public List<pollendatadto> getPollenData() throws Exception {

More
Python Client v5.8 Changes & Usage (Client is now based on Java!)

The GridDB Python client has been updated to now use the native GridDB Java interface with JPype and Apache Arrow. Prior to this release, the python client relied on the c_client, and translated some of those commands with swig and other tools. The main benefit of committing to this change in underlying technology is being able to query GridDB and get back an Apache Arrow recordbatch object in return. We will go over how this change can directly affect your python workflows with a concrete example later on in this article. Another benefit is how SQL is now handled. With the c_client as the base, you could query the database only using TQL, but not SQL, meaning that certain partitioned tables were simply not accessible to your python client. There were workarounds, for example: Pandas with Python GridDB SQL Queries, but now with this new client, this sort of thing will work out of the box. So with that out of the way, let’s see how we can install the new GridDB Python Client and explore some of the changes in this new version Installation And Prereqs To install, as explained briefly above, you will need to have Java installed and set to your environment variable JAVA_HOME. You will also need maven and python3.12 Here is how I installed these packages and set the Java home on Ubuntu 22.04: $ sudo apt install maven python3.12 default-jdk $ export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 Installation To grab the source code of the python client, navigate to its github page and clone the repo. Once cloned, we can run a maven install to build our apache library and then install the python client. git clone https://github.com/griddb/python_client.git cd python_client/java mvn install cd .. cd python python3.12 -m pip install . cd .. #this puts you back in the root directory of python_client Jar Files and Your Environment Variables On top of having the JAVA HOME set and having Java installed, there are a couple of .jar files you will need to be in your CLASSPATH as well. Specifically we need pyarrow, gridstore and in some cases, py-memory-netty. Two of these three we can just download generic versions from the maven repository, but Apache Arrow will rely on a modified version which we built in the previous step. So let’s download and set these jar files. $ mkdir lib && cd lib $ curl -L -o gridstore.jar https://repo1.maven.org/maven2/com/github/griddb/gridstore/5.8.0/gridstore-5.8.0.jar $ curl -L -o arrow-memory-netty.jar https://repo1.maven.org/maven2/org/apache/arrow/arrow-memory-netty/18.3.0/arrow-memory-netty-18.3.0.jar $ cp ../java/target/gridstore-arrow-5.8.0.jar gridstore-arrow.jar $ export CLASSPATH=$CLASSPATH:./gridstore.jar:./gridstore-arrow.jar:./arrow-memory-netty.jar If you are unsure if your CLASSPATH is set, you can always run an echo: $ echo $CLASSPATH :./gridstore.jar:./gridstore-arrow.jar:./arrow-memory-netty.jar With this set in your CLASSPATH, you can start your python3 griddb scripts without explicity setting the classpath options when starting the JVM at the top of the file. If you don’t set the CLASSPATH, you can use the option like the sample code does: import jpype # If no CLASSPATH Set in the environment, you can force the JVM to start with these jars explicitly jpype.startJVM(classpath=[“./gridstore.jar”, “./gridstore-arrow.jar”, “./arrow-memory-netty.jar”]) import griddb_python as griddb import sys or if you set the CLASSPATH and make this permanent (for example, editing your .bashrc file), you can get away with not importing jpype at all as the modified pyarrow will do it for you. # No jpype set here; still works import griddb_python as griddb import sys Running Samples To ensure everything is working properly, we should try running the sample code. Navigate to the sample dir, make some changes, and then run! # from the root of this python client repo $ cd sample We will need to change the connection details as GridDB CE runs mostly in FIXED_LIST mode now, meaning we need a notification member, not a host/port combo: try: #Get GridStore object # Changed here to notification_member vs port & address gridstore = factory.get_store(notification_member=argv[1], cluster_name=argv[2], username=argv[3], password=argv[4]) And depending on you have your CLASSPATH variables set, you can either add arrow-memory-netty to your jypype start jvm method, or set your classpath as explained. The code should now run just fine: python_client/sample$ python3.12 sample1.py 127.0.0.1:10001 myCluster admin admin Person: name=name02 status=False count=2 lob=[65, 66, 67, 68, 69, 70, 71, 72, 73, 74] API Differences & Usage The README for the Python client page explains what features are still currently missing (when compared to the previous v.0.8.5): – Array type for GridDB – Timeseries-specific function – Implicit data type conversion But there is also some functionality that is gained with this move: – Compsite RowKey, Composite Index GEOMETRY type and TIMESTAMP(micro/nano-second) type – Put/Get/Fetch with Apache Arrow – Operations for Partitioning table Using SQL These new features basically come from the ability to use Java and then by extension JDBC and SQL. To use SQL, you can simply add the GridDB JDBC jar to your CLASSPATH (or jvm start options). From there, you can use SQL and use them on partition tables. Taken from the samples, here’s what SQL can look like: import jpype import jpype.dbapi2 jpype.startJVM(classpath=[“./gridstore.jar”, “./gridstore-arrow.jar”, “./gridstore-jdbc.jar”]) import griddb_python as griddb import sys ### SQL create table/insert url = “jdbc:gs://127.0.0.1:20001/myCluster/public” conn = jpype.dbapi2.connect(url, driver=”com.toshiba.mwcloud.gs.sql.Driver”, driver_args={“user”:”admin”, “password”:”admin”}) curs = conn.cursor() curs.execute(“DROP TABLE IF EXISTS Sample”) curs.execute(“CREATE TABLE IF NOT EXISTS Sample ( id integer PRIMARY KEY, value string )”) print(‘SQL Create Table name=Sample’) curs.execute(“INSERT INTO Sample values (0, ‘test0’),(1, ‘test1’),(2, ‘test2’),(3, ‘test3’),(4, ‘test4’)”) print(‘SQL Insert’) For the most part, this stuff is the same as before as we could always start up the jpype jvm with python and run JDBC. What is truly new is using Apache Arrow. Using Apache Arrow with GridDB/Python/Nodejs Part of what makes Arrow so useful in the modern era is its ability to “[allow] for zero-copy reads and fast data access and interchange without serialization overhead between these languages and systems.”(https://en.wikipedia.org/wiki/Apache_Arrow). To showcase this, we will create a python script to generate 10000 rows of ‘random’ data, then we will query the result directly in an Arrow RecordBatch object, and once that obj exists, we will stream it over tcp (without serialzing/deserializing) directly over to nodejs (so called Zero-Copying). First, let’s generate 10000 rows of data. We will create a timeseries container with ‘random’ data in all of the columns. from datetime import datetime, timezone, timedelta import griddb_python as griddb import sys import pandas as pd import pyarrow as pa import uuid import random import socket import warnings warnings.filterwarnings(‘ignore’) def generate_random_timestamps(start_date_str, num_timestamps, min_interval_minutes=5, max_interval_minutes=30): date_format = “%Y-%m-%dT%H:%M:%S” current_time = datetime.fromisoformat(start_date_str.replace(“Z”, “”)).replace(tzinfo=timezone.utc) timestamp_list = [] for _ in range(num_timestamps): timestamp_str = current_time.strftime(date_format) + “.000Z” timestamp_list.append(timestamp_str) random_minutes = random.randint(min_interval_minutes, max_interval_minutes) current_time += timedelta(minutes=random_minutes) return timestamp_list start_point = “2024-12-01T10:00:00.000Z” number_of_stamps = 10000 min_interval = 5 max_interval = 20 generated_datelist = generate_random_timestamps( start_point, number_of_stamps, min_interval, max_interval ) factory = griddb.StoreFactory.get_instance() gridstore = factory.get_store( notification_member=”127.0.0.1:10001″, cluster_name=”myCluster”, username=”admin”, password=”admin” ) col = gridstore.get_container(“col01”) ra = griddb.RootAllocator(sys.maxsize) blob = bytearray([65, 66, 67, 68, 69, 70, 71, 72, 73, 74]) conInfo = griddb.ContainerInfo(“col01”, [[“ts”, griddb.Type.TIMESTAMP], [“name”, griddb.Type.STRING], [“status”, griddb.Type.BOOL], [“count”, griddb.Type.LONG], [“lob”, griddb.Type.BLOB]], griddb.ContainerType.TIME_SERIES, True) i=0 rows=[] while i < 10000: rows.append([datetime.strptime(generated_datelist[i], "%Y-%m-%dT%H:%M:%S.%f%z"),str(uuid.uuid1()), False, random.randint(0, 1048576), blob]) i=i+1 Next let's insert with multiput. First we'll format the list of rows into a dataframe. Then we'll convert the dataframe into a recordbatch and then use multiput to insert the batch into GridDB: df = pd.DataFrame(rows, columns=["ts", "name", "status", "count", "lob"]) col = gridstore.put_container(conInfo) rb = pa.record_batch(df) col.multi_put(rb, ra) Now that our data is inside GridDB, let's query it (this is for the sake of education, obviously this makes no sense in the 'real world'). col = gridstore.get_container("col01") q = col.query("select *") q.set_fetch_options(root_allocator=ra) rs = q.fetch() result = [] rb = rs.next_record_batch() #gets all of the rows as a recordbatch obj And now finally, let's stream our record batch over to some other programming environment to showcase Apache Arrow's supreme flexibility. We will use nodejs as the consumer here. #stream our rows through socket HOST = '127.0.0.1' PORT = 2828 with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as server_socket: server_socket.bind((HOST, PORT)) server_socket.listen(1) print(f"Python producer listening on {HOST}:{PORT}") conn, addr = server_socket.accept() print(f"Connected by {addr}") with conn: with conn.makefile(mode='wb') as f: # Use the file-like object as the sink for the stream writer with pa.ipc.new_stream(f, rb.schema) as writer: writer.write_batch(rb) Run the python script and it will create the container, add the rows of data, query those rows of data, and then start a server which is listening for a consumer to connect to it, which will then send all of the rows to the consumer. Nodejs will now connect to our producer and print out the records const net = require('net'); const { RecordBatchReader } = require('apache-arrow'); const HOST = '127.0.0.1'; // Ensure this port matches your Python producer's port const PORT = 2828; const client = new net.Socket(); client.connect(PORT, HOST, async () => { console.log(`Connected to Python producer at ${HOST}:${PORT}`); try { const reader = await RecordBatchReader.from(client); let schemaPrinted = false; for await (const recordBatch of reader) { if (!schemaPrinted) { console.log(“Successfully parsed schema from stream.”); console.log(`Schema:`, reader.schema.fields.map(f => `${f.name}: ${f.type}`).join(‘, ‘)); console.log(“— Processing data batches —“); schemaPrinted = true; } // Convert the record batch to a more familiar JavaScript object format const data = recordBatch.toArray().map(row => row.toJSON()); console.log(“Received data batch:”, data); } console.log(“——————————-“); console.log(“Stream finished.”); } catch (error) { console.error(“Error processing Arrow stream:”, error); } }); client.on(‘close’, () => { console.log(‘Connection closed’); }); client.on(‘error’, (err) => { console.error(‘Connection error:’, err.message); }); Run this nodejs script like so: $ npm install $ node consumer.js Connected to Python producer at 127.0.0.1:2828 Successfully parsed schema from stream. Schema: ts: Timestamp, name: Utf8, status: Bool, count: Int64, lob: Binary — Processing data batches — Received data batch: [ { ts: 1733047200000, name: ’65d16ce6-55d3-11f0-8070-8bbd0177d9e6′, status: false, count: 820633n, lob: Uint8Array(10) [ 65, 66, 67, 68, 69, 70, 71, 72, 73, 74 ] }, { ts: 1733047500000, name: ’65d16ce7-55d3-11f0-8070-8bbd0177d9e6′, status: false, count: 931837n, lob: Uint8Array(10) [ 65, 66, 67, 68, 69, 70, 71, 72, 73, 74 ] }, ….cutoff Conclusion And with that, we have successfully showcased the new GridDB Python

More
Automated Speech Dubbing Using GPT-4o Audio and Node.js

What This Blog is About Easy communication across languages is crucial in today’s interconnected world. Traditional translation and dubbing methods often fall short—they’re too slow, prone to errors, and struggle to scale effectively. For instance, human-based translation can introduce subjective inaccuracies, while manual dubbing processes frequently fail to keep pace with real-time demands or large-scale projects. However, advancements in AI have revolutionized audio translation, making it faster and more accurate. This blog provides a step-by-step guide to building an automated dubbing system. Using GPT-4o Audio for context-aware audio translations, Node.js for data handling, and GridDB for scalable storage, you’ll learn how to process speech, translate it, and deliver dubbed audio instantly. This guide will explain how to automate speech dubbing, ensuring seamless communication across languages, and the term “speech” is used interchangeably with “audio.” Prerequisites You should have access to the GPT-4o Audio models. Also, you should give the app permission to use the microphone in the browser. How to Run the App The source code for this project is available in this repository. You don’t need to clone it to run the app, as the working application is already dockerized. However, to run the project you need the Docker installed. Please note, that this app is tested on ARM machines such as Apple MacBook M1 or M2. While it is optimized for ARM architecture, it possible run on non-ARM machines with minor modifications, such as using a different GridDB Docker image for x86 systems. 1..env Setup Create an empty directory, for example, speech-dubbing, and change to that directory: mkdir speech-dubbing cd speech-dubbing Create a .env file with these keys: OPENAI_API_KEY= GRIDDB_CLUSTER_NAME=myCluster GRIDDB_USERNAME=admin GRIDDB_PASSWORD=admin IP_NOTIFICATION_MEMBER=griddb-server:10001 VITE_APP_BASE_URL=http://localhost VITE_PORT=3000 To get the OPENAI_API_KEY please read this section. 2. Docker Compose Configuration Before run the app create a docker-compose.yml file with this configuration settings: networks: griddb-net: driver: bridge services: griddb-server: image: griddbnet/griddb:arm-5.5.0 container_name: griddb-server environment: – GRIDDB_CLUSTER_NAME=${GRIDDB_CLUSTER_NAME} – GRIDDB_PASSWORD=${GRIDDB_PASSWORD} – GRIDDB_USERNAME=${GRIDDB_USERNAME} – NOTIFICATION_MEMBER=1 – IP_NOTIFICATION_MEMBER=${IP_NOTIFICATION_MEMBER} networks: – griddb-net ports: – “10001:10001” clothes-rag: image: junwatu/speech-dubber:1.2 container_name: speech-dubber-griddb env_file: .env networks: – griddb-net ports: – “3000:3000″ 3. Run When steps 1 and 2 are finished, run the app with this command: docker-compose up -d If everything running, you will get a similar response to this: [+] Running 3/3 ✔ Network tmp_griddb-net Created 0.0s ✔ Container speech-dubber-griddb Started 0.2s ✔ Container griddb-server Started 0.2s 4. Test the Speech Dubber App These are the steps to use the app: Open the App: Open your browser and navigate to http://localhost:3000. Start Recording: Click the record button. Allow Microphone Access: When prompted by the browser, click “Allow this time.” Speak: Record your message in English. Stop Recording: Click the stop button when done. Wait while the app processes and translates your audio. Play the Translation: Use the playback controls to listen to the translated Japanese audio. The demo below summarizes all the steps: Environment Setup OpenAI API Key You can create a new OpenAI project or use the existing one and then create and get the OpenAI API key here. Later, you need to save this key in the .env file. By default, OpenAI will restrict the models from public access even if you have a valid key. You also need to enable these models in the OpenAI project settings: Docker For easy development and distribution, this project uses a docker container to “package” the application. For easy Docker installation, use the Docker Desktop tool. GridDB Docker This app needs a GridDB server and it should be running before the app. In this project, we will use the GridDB docker for ARM machines. To test the GridDB on your local machine, you can run these docker commands: docker network create griddb-net docker pull griddbnet/griddb:arm-5.5.0 docker run –name griddb-server \ –network griddb-net \ -e GRIDDB_CLUSTER_NAME=myCluster \ -e GRIDDB_PASSWORD=admin \ -e NOTIFICATION_MEMBER=1 \ -d -t griddbnet/griddb:arm-5.5.0 By using the Docker Desktop, you can easily check if the GridDB docker is running. For more about GridDB docker for ARM, please check out this blog. Development If you are a curious developer or need further development, you can clone and examine the project source code. Primarily, you must have Node.js, FFmpeg, and GridDB installed on your system. System Architecture The flow of the speech dubbing process is pretty simple: The process begins with the user speaking into the browser, which captures the audio. This recorded audio is then sent to the Node.js server, where it undergoes processing. The server calls the GPT-4o Audio model to translate the audio content into another language. Once the audio is translated, the server saves the original and translated audio, along with relevant metadata, to the GridDB database for storage. Finally, the translated audio is sent back to the browser, where the user can play it through an HTML5 audio player. Capturing Speech Input Accessing the Microphone To record audio, the first step is to access the user’s microphone. This is achieved using the navigator.mediaDevices.getUserMedia API. const stream = await navigator.mediaDevices.getUserMedia({ audio: true }); The code above will prompt the user for permission to access the microphone. Recording Audio Once microphone access is granted, the MediaRecorder API is used to handle the actual recording process. The audio stream is passed to MediaRecorder to create a recorder instance: mediaRecorderRef.current = new MediaRecorder(stream); As the recording progresses, audio chunks are collected through the ondataavailable event: mediaRecorderRef.current.ondataavailable = (event: BlobEvent) => { audioChunksRef.current.push(event.data); }; When the recording stops (onstop event), the chunks are combined into a single audio file (a Blob) and made available for upload: mediaRecorderRef.current.onstop = () => { const audioBlob = new Blob(audioChunksRef.current, { type: ‘audio/wav’ }); const audioUrl = URL.createObjectURL(audioBlob); setAudioURL(audioUrl); audioChunksRef.current = []; uploadAudio(audioBlob); }; The uploadAudio function will upload the audio blob into the Node.js server. Node.js Server This Node.js server processes audio files by converting them to MP3, translating the audio content using OpenAI, and storing the data in a GridDB database. It provides endpoints for uploading audio files, querying data from the database, and serving static files. Routes Table Here’s a summary of the endpoints or API available in this server: Method Endpoint Description GET / Serves the main HTML file (index.html). POST /upload-audio Accepts an audio file upload, converts it to MP3, processes it using OpenAI, and saves data to GridDB. GET /query Retrieves all records from the GridDB database. GET /query/:id Retrieves a specific record by ID from the GridDB database. Audio Conversion The default recording file format sent by the client is WAV. However, in the Node.js server, this file is converted into the MP3 format for better processing. The audio conversion is done by fluent-ffmpeg npm package: const convertToMp3 = () => { return new Promise((resolve, reject) => { ffmpeg(originalFilePath) .toFormat(‘mp3’) .on(‘error’, (err) => { console.error(‘Conversion error:’, err); reject(err); }) .on(‘end’, () => { fs.unlinkSync(originalFilePath); resolve(mp3FilePath); }) .save(mp3FilePath); }); }; If you want to develop this project for further enhancements, you need to install ffmpeg in your system. Speech Dubbing Target Language The gpt-4o-audio-preview model from OpenAI will translate the recorded audio content into another language. const audioBuffer = fs.readFileSync(mp3FilePath); Note that this model requires audio in base64-encoded format, so you have to encode the audio content into the base 64: const base64str = Buffer.from(audioBuffer).toString(‘base64’); The default language for the audio translation is “Japanese”. However, you can change it in the source code or add UI for language selector for further enhancement. const language = “Japanese”; // Process audio using OpenAI const result = await processAudio(base64str, language); The response result of the processAudio function is in JSON format that contains this data: { “language”: “Japanese”, “filename”: “translation-Japanese.mp3”, “result”: { “index”: 0, “message”: { “role”: “assistant”, “content”: null, “refusal”: null, “audio”: { “id”: “audio_6758f02de0b48190ba109885b931122c”, “data”: “base64-encoded_audio”, “expires_at”: 1733885501, “transcript”: “こんにちは。今朝はとても晴天です。” } }, “finish_reason”: “stop” } } This JSON data is sent to the client, and with React, we can use it to render components, such as the HTML5 audio element, to play the translated audio. GPT-4o Audio The gpt-4o-audio model is capable of generating audio and text response based on the audio input. The model response is controlled by the system and user prompts. However, this project only uses the system prompt: { role: “system”, content: `The user will provide an English audio. Dub the complete audio, word for word in ${language}. Keep certain words in original language for which a direct translation in ${language} does not exist.` }, The response type, text or audio is set by the modalities parameter, and the audio voice is set by the audio parameter: export async function processAudio(base64Str, language) { try { const response = await openai.chat.completions.create({ model: “gpt-4o-audio-preview”, modalities: [“text”, “audio”], audio: { voice: “alloy”, format: “mp3” }, messages: [ { role: “system”, content: `The user will provide an English audio. Dub the complete audio, word for word in ${language}. Keep certain words in original language for which a direct translation in ${language} does not exist.` }, { role: “user”, content: [ { type: “input_audio”, input_audio: { data: base64Str, format: “mp3″ } } ] } ], }); return response.choices[0]; } catch (error) { throw new Error(`OpenAI audio processing failed: ${error.message}`); } } Save Audio Data Data Schema To save audio data in the GridDB database, we must define the schema columns. The schema includes fields such as id, originalAudio, targetAudio, and targetTranscription. The container name can be arbitrary; however, it is best practice to choose one that reflects the context. For this project, the container name is SpeechDubbingContainer : const containerName = ‘SpeechDubbingContainer’; const columnInfoList = [ [‘id’, griddb.Type.INTEGER], [‘originalAudio’, griddb.Type.STRING], [‘targetAudio’, griddb.Type.STRING], [‘targetTranscription’, griddb.Type.STRING], ]; const container = await getOrCreateContainer(containerName, columnInfoList); This table explaining the schema defined in the selected portion of your code: Column Name Type Description id griddb.Type.INTEGER A unique identifier for each entry in the container. originalAudio griddb.Type.STRING The file path or name of the original audio file that was uploaded and processed. targetAudio griddb.Type.STRING The file path or name of the generated audio file containing the translated or dubbed speech. targetTranscription griddb.Type.STRING The text transcription of the translated audio, as provided by the speech processing API. Save Operation If the audio translation succesful, the insertData function will save the audio data into the database. try { const container = await getOrCreateContainer(containerName, columnInfoList); await insertData(container, [generateRandomID(), mp3FilePath, targetAudio, result.message.audio.transcript]); } catch (error) { console.log(error) } The GridDB data operation code is located in the griddbOperations.js file. This file provides detailed implementation on inserting data, querying data, and retrieving data by its ID in the GridDB database. Read Operation To read all data or data for a specific ID, you can use code or tools like Postman. For example, to query all data in the GridDB database by using the /query endpoint: And to read a specific data by ID, you can use the /query/:id endpoint: User Interface The user interface in this project is build using React. The AudioRecorder.tsx is a React component for a speech dubbing interface featuring a header with a title and description, a recording alert, a toggleable recording button, and an audio player for playback if a translated audio URL is available: <Card className=”w-full”> <CardHeader className=’text-center’> <CardTitle>Speech Dubber</CardTitle> <CardDescription>Push to dub your voice</CardDescription> </CardHeader> <CardContent className=”space-y-4″> {isRecording && ( <Alert variant=”destructive”> <AlertDescription>Recording in progress…</AlertDescription> </Alert> )} <div className=”flex justify-center”> <Button onClick={toggleRecording} variant={isRecording ? “destructive” : “default”} className=”w-24 h-24 rounded-full” > {isRecording ? <StopCircle size={36} /> : <Mic size={36} />} </Button> </div> {translatedAudioURL && ( <div className=”space-y-4″> <audio src={translatedAudioURL} controls className=”w-full” /> </div> )} </CardContent> </Card> This is the screenshot when the translated audio is available: Further Improvements This blog teaches you how to build a simple web application that translates audio from one language to another. However, please note that this is just a prototype. There are several improvements that you can make. Here are some suggestions: Enhance the user interface. Add a real-time feature. Include a language selector. Implement user

More
Stress Detection using Machine Learning & GridDB

Stress significantly affects individuals’ well-being, productivity, and overall quality of life. Understanding and predicting stress levels can help take proactive measures to mitigate its adverse effects. W This article demonstrates how to develop a stress detection system using machine learning and deep learning techniques with the GridDB database. We will begin by retrieving a stress detection dataset from Kaggle, storing it in a GridDB container, and utilizing this data to train predictive models capable of estimating users’ perceived stress scores. GridDB, a high-performance NoSQL database, is particularly suited for managing complex and dynamic datasets. Its efficient in-memory processing and flexible data storage capabilities make it an ideal choice for real-time applications. Note: The codes for this tutorial are in my GridDB Blogs GitHub repository. Prerequisites You need to install the following libraries to run codes in this article. GridDB C Client GridDB Python client To install these libraries, follow the installation instructions on GridDB Python Package Index (Pypi). The code is executed in Google Colab, so you do not need to install other libraries. Run the following script to import the required libraries into your Python application. import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import mean_absolute_error, mean_squared_error from sklearn.ensemble import RandomForestRegressor import tensorflow as tf from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Dropout, BatchNormalization from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint from tensorflow.keras.optimizers import Adam from tensorflow.keras.models import load_model import griddb_python as griddb Inserting Stress Detection Dataset into GridDB We will begin by inserting the stress detection dataset from Kaggle into GridDB. In a later section, we will retrieve data from the GridDB and train our machine-learning algorithms for user stress prediction. Downloading and Importing the Stress Detection Dataset from Kaggle You can download the stress detection dataset from Kaggle and import it into your Python application. # Dataset download link # https://www.kaggle.com/datasets/swadeshi/stress-detection-dataset?resource=download dataset = pd.read_csv(“stress_detection.csv”) print(f”The dataset consists of {dataset.shape[0]} rows and {dataset.shape[1]} columns”) dataset.head() Output: The dataset consists of 3000 records belonging to 100 users. For each user, 30 days of data are recorded for various attributes such as openness, sleep duration, screen time, mobility distance, and number of calls. The PSS_score column contains the perceived stress score, which ranges from 10 to 40. A higher score corresponds to a higher stress level. The following script displays various statistics for the PSS_score column. dataset[“PSS_score”].describe() Output: count 3000.000000 mean 24.701000 std 8.615781 min 10.000000 25% 17.000000 50% 25.000000 75% 32.000000 max 39.000000 Name: PSS_score, dtype: float64 Next, we will insert the user stress dataset into GridDB. Connect to GridDB You need to connect to a GridDB instance before inserting data into the GridDB. To do so, you must create a GridDB factory instance using the griddb.StoreFactory.get_instance() method. Next, you have to call the get_store method on the factory instance and pass the database host URL, cluster name, and user name and password. The get_store() method returns a grid store object that you can use to create containers in GridDB. To test whether the connection is successful, we retrieve a dummy container, container1, as shown in the script below. # GridDB connection details DB_HOST = “127.0.0.1:10001” DB_CLUSTER = “myCluster” DB_USER = “admin” DB_PASS = “admin” # creating a connection factory = griddb.StoreFactory.get_instance() try: gridstore = factory.get_store( notification_member = DB_HOST, cluster_name = DB_CLUSTER, username = DB_USER, password = DB_PASS ) container1 = gridstore.get_container(“container1”) if container1 == None: print(“Container does not exist”) print(“Successfully connected to GridDB”) except griddb.GSException as e: for i in range(e.get_error_stack_size()): print(“[“, i, “]”) print(e.get_error_code(i)) print(e.get_location(i)) print(e.get_message(i)) Output: Container does not exist Successfully connected to GridDB You should see the above message if the connection is successful. Create Container for User Stress Data in GridDB GridDB stores data in containers, which are specialized data structures for efficient data structure. The following script creates a GridDB container to store our stress detection dataset. We first remove any existing container with the name user_stress_data as we will use this name to create a new container. Next, we replace empty spaces in column names with an underscore since GridDB does not expect column names to have spaces. We will then map Pandas DataFrame data type to GridDB data types and create a column info list containing column names and corresponding GridDB data types, Next, we create a container info object and pass it the container name, the column info list, and the container type, which is COLLECTION for tabular data. Finally, we call the grid store’s put_container method and pass the container info object we created to it as a parameter. # drop container if already exists gridstore.drop_container(“user_stress_data”) # Clean column names to remove spaces or forbidden characters in the GridDB container dataset.columns = [col.strip().replace(” “, “_”) for col in dataset.columns] # Mapping from pandas data types to GridDB data types type_mapping = { ‘float64’: griddb.Type.DOUBLE, ‘int64’: griddb.Type.INTEGER, ‘object’: griddb.Type.STRING, ‘category’: griddb.Type.STRING # Map category to STRING for GridDB } # Generate column_info dynamically column_info = [[col, type_mapping[str(dtype)]] for col, dtype in dataset.dtypes.items()] # Define the container info container_name = “user_stress_data” container_info = griddb.ContainerInfo( container_name, column_info, griddb.ContainerType.COLLECTION, row_key=True ) # Connecting to GridDB and creating the container try: gridstore.put_container(container_info) container = gridstore.get_container(container_name) if container is None: print(f”Failed to create container: {container_name}”) else: print(f”Successfully created container: {container_name}”) except griddb.GSException as e: print(f”Error creating container {container_name}:”) for i in range(e.get_error_stack_size()): print(f”[{i}] Error code: {e.get_error_code(i)}, Message: {e.get_message(i)}”) Output: Successfully created container: user_stress_data The above message shows that the container creation is successful. Insert User Stress Data into GridDB We can now store data in the container we created. To do so, we iterate through the rows in our dataset, convert the column data into a GridDB data type, and store each row in the container using the put() method. The following script inserts our stress detection dataset into the user_stress_data container we created. try: for _, row in dataset.iterrows(): # Prepare each row’s data in the exact order as defined in `column_info` row_data = [ int(row[col]) if dtype == griddb.Type.INTEGER else float(row[col]) if dtype == griddb.Type.DOUBLE else str(row[col]) for col, dtype in column_info ] # Insert the row data into the container container.put(row_data) print(f”Successfully inserted {len(dataset)} rows of data into {container_name}”) except griddb.GSException as e: print(f”Error inserting data into container {container_name}:”) for i in range(e.get_error_stack_size()): print(f”[{i}] Error code: {e.get_error_code(i)}, Message: {e.get_message(i)}”) Output: Successfully inserted 3000 rows of data into user_stress_data The above output shows that data insertion is successful. Stress Detection Using Machine and Deep Learning In this section, we will retrieve the stress detection dataset from the user_stress_data GridDB container we created earlier. Subsequently, we will train machine learning and deep learning models for stress prediction. Retrieving Data from GridDB The following script defines the retrieve_data_from_griddb() function that accepts the container name as a parameter and calls the get_container() function on the grid store to retrieve the data container. Next, we create a SELECT query object and call its fetch() method to retrieve all records from the user_stress_data container. Finally, we call the fetch_rows() function to convert the records into a Pandas DataFrame. def retrieve_data_from_griddb(container_name): try: data_container = gridstore.get_container(container_name) # Query all data from the container query = data_container.query(“select *”) rs = query.fetch() data = rs.fetch_rows() return data except griddb.GSException as e: print(f”Error retrieving data from GridDB: {e.get_message()}”) return None stress_dataset = retrieve_data_from_griddb(“user_stress_data”) stress_dataset.head() Output: The above output shows the stress detection dataset we retrieved from the GridDB container. Predicting User Stress with Machine Learning We will first try to predict the PSS_score using a traditional machine learning algorithm such as Random Forest Regressor. The following script divides the dataset into features and labels, splits it into training and test sets, and normalizes it using the standard scaling approach. X = stress_dataset.drop(columns=[‘PSS_score’]) y = stress_dataset[‘PSS_score’] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) Next, we create an object of the RandomForestRegressor class from he Scikit learn library and pass the training and test sets to the fit() method. rf_model = RandomForestRegressor(random_state=42, n_estimators=1000) rf_model.fit(X_train, y_train) Finally, we evaluate the model performance by prediting PSS_score on the test set. rf_predictions = rf_model.predict(X_test) # Evaluate the regression model mae = mean_absolute_error(y_test, rf_predictions) print(f”Mean Absolute Error: {mae:.4f}”) Output: Mean Absolute Error: 7.8973 The output shows that, on average, our model’s predicted PSS_score is off by 7.8 points. This is not so bad, but it is not very good either. Next, we will develop a deep neural network for stress detection prediction. Predicting User Stress with Deep Learning We will use the TensorFlow and Keras libraries to create a sequential deep learning model with three dense layers. We will also add batch normalization and dropout to reduce model overfitting. We will also use an adaptive learning rate so the gradient doesn’t overshoot while training. Finally, we compile the model using the mean squared error loss and mean absolute error metric. We use this loss and metric since we are dealing with a regression problem. model = Sequential([ Dense(128, input_dim=X_train.shape[1], activation=’relu’), BatchNormalization(), Dropout(0.2), Dense(64, activation=’relu’), BatchNormalization(), Dropout(0.2), Dense(32, activation=’relu’), BatchNormalization(), Dropout(0.1), Dense(1) ]) # Adaptive learning rate scheduler with exponential decay initial_learning_rate = 0.001 lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay( initial_learning_rate=initial_learning_rate, decay_steps=10000, decay_rate=0.9 ) # Compile the model with Adam optimizer and a regression loss model.compile(optimizer=Adam(learning_rate=lr_schedule), loss=’mean_squared_error’, metrics=[‘mean_absolute_error’]) # Summary of the model model.summary() Output: The above output shows the model summary. Next, we train the model using the fit() method. We use an early stopping approach that stops model training if the loss doesn’t decrease for 100 consecutive epochs. Finally, we save the best model at the end of model training. # Define callbacks for training early_stopping = EarlyStopping(monitor=’val_loss’, patience=100, restore_best_weights=True) model_checkpoint = ModelCheckpoint(‘best_model.keras’, monitor=’val_loss’, save_best_only=True) # Train the model with the callbacks history = model.fit( X_train, y_train, epochs=1000, batch_size=4, validation_data=(X_test, y_test), callbacks=[early_stopping, model_checkpoint], verbose=1 ) Output: We load the best model to evaluate the performance and use it to make predictions on the test set. # Load the best model best_model = load_model(‘best_model.keras’) # Make predictions on the test set y_pred = best_model.predict(X_test) # Calculate Mean Absolute Error mae = mean_absolute_error(y_test, y_pred) print(f”Mean Absolute Error: {mae:.4f}”) # Plot training history to show MAE over epochs plt.plot(history.history[‘mean_absolute_error’], label=’Train MAE’) plt.plot(history.history[‘val_mean_absolute_error’], label=’Validation MAE’) plt.title(‘Mean Absolute Error over Epochs’) plt.xlabel(‘Epochs’) plt.ylabel(‘MAE’) plt.legend() plt.show() Output: On the test set, we achieved a mean absolute error value of 7.89, similar to what we achieved using the Random Forest Regressor. The results also show that our model is slightly overfitting since the training loss is lower compared to validation loss across the epochs. Conclusion This article is a comprehensive guide to developing a stress detection system using machine learning, deep learning regression models, and the GridDB database. In this article, you explored the process of connecting to GridDB, inserting a stress detection dataset, and utilizing Random Forest and deep neural networks to predict perceived stress scores. The Random Forest and deep learning models performed decently with a manageable mean absolute error. If you have any questions or need assistance with GridDB or machine learning techniques, please ask on Stack Overflow using the griddb tag. Our team is always happy to help. For the complete code, visit my GridDB Blogs GitHub

More
IoT Intrusion Detection

Intrusions refer to unauthorized activities that exploit vulnerabilities in IoT devices, which can compromise sensitive data or disrupt essential services. Detecting these intrusions is crucial to maintaining the security and integrity of IoT networks. This article demonstrates how to develop a robust intrusion detection system for IoT environments using a machine learning and deep learning approach with the GridDB database . We will begin by retrieving IoT intrusion detection data from Kaggle, store it in a GridDB container, and use this data to train machine learning and deep learning models to identify different types of intrusions. GridDB, a high-performance NoSQL database, is particularly suited for handling the large-scale, real-time data generated by IoT systems due to its efficient in-memory processing and time series capabilities. Using GridDB’s powerful IoT data management features along with advanced machine learning and deep learning, we will build a predictive model that identifies potential threats to IoT devices. Note: You can find codes for the tutorial in my GridDB Blogs GitHub repository. Prerequisites You need to install the following libraries to run codes in this article. GridDB C Client GridDB Python client To install these libraries, follow the installation instructions on GridDB Python Package Index (Pypi). Since the code is executed in Google Colab, you do not need to install any other libraries. Run the following script to import required libraries into your Python application. import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, classification_report from sklearn.preprocessing import LabelEncoder import tensorflow as tf from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Dropout, BatchNormalization from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint from tensorflow.keras.optimizers import Adam from tensorflow.keras.models import load_model import griddb_python as griddb Inserting IoT Data into GridDB In this section, you will see how to download IoT data from kaggle, import it into your Python application and store it in the GridDB database. Along the way you, you will learn to connect your Python application to GridDB, create a GridDB container and insert data into the GridDB container you created. Downloading and Importing the IoT Dataset From Kaggle We will insert the IoT Intrusion Detection dataset from Kaggle into the GridDB. The dataset consists of different types of IoT intrusions. the following script imports the dataset into a Pandas dataframe. # Dataset download link #https://www.kaggle.com/datasets/subhajournal/iotintrusion/data dataset = pd.read_csv(“IoT_Intrusion.csv”) print(f”The dataset consists of {dataset.shape[0]} rows and {dataset.shape[1]} columns”) dataset.head() Output: The above output shows that the dataset consists of 1048575 rows and 47 columns. For the sake of simplicity, we will train our machine learning models on 200k records. The following script randomly selects 200k records from the original dataset. The label column contains the intrusion types. We will also plot the count for each intrusion type. dataset = dataset.sample(n=200000, random_state=42) print(f”The dataset consists of {dataset.shape[0]} rows and {dataset.shape[1]} columns”) print(f”The total number of output labels are {dataset[‘label’].nunique()}”) dataset[‘label’].value_counts() Output: The above output shows all the 34 intrusion types in our dataset with DDoS-ICMP_FLOOD being the most frequently occurring intrusion while Uploading_Attack is the least frequently occurring intrusion type. For simplification’s sake we will group the 34 categories into 9 major categories using the following script. category_map = { ‘DDoS’: [ ‘DDoS-ICMP_Flood’, ‘DDoS-UDP_Flood’, ‘DDoS-TCP_Flood’, ‘DDoS-PSHACK_Flood’, ‘DDoS-SYN_Flood’, ‘DDoS-RSTFINFlood’, ‘DDoS-SynonymousIP_Flood’, ‘DDoS-ICMP_Fragmentation’, ‘DDoS-ACK_Fragmentation’, ‘DDoS-UDP_Fragmentation’, ‘DDoS-HTTP_Flood’, ‘DDoS-SlowLoris’ ], ‘DoS’: [ ‘DoS-UDP_Flood’, ‘DoS-TCP_Flood’, ‘DoS-SYN_Flood’, ‘DoS-HTTP_Flood’ ], ‘Brute Force’: [ ‘DictionaryBruteForce’ ], ‘Spoofing’: [ ‘MITM-ArpSpoofing’, ‘DNS_Spoofing’ ], ‘Recon’: [ ‘Recon-HostDiscovery’, ‘Recon-OSScan’, ‘Recon-PortScan’, ‘Recon-PingSweep’ ], ‘Web-based’: [ ‘SqlInjection’, ‘CommandInjection’, ‘XSS’, ‘BrowserHijacking’, ‘Uploading_Attack’ ], ‘Mirai’: [ ‘Mirai-greeth_flood’, ‘Mirai-udpplain’, ‘Mirai-greip_flood’ ], ‘Other’: [ ‘VulnerabilityScan’, ‘Backdoor_Malware’ ], ‘Benign-trafic’: [ ‘BenignTraffic’ ] } # Reverse the mapping to allow lookup by subcategory subcategory_to_parent = {subcat: parent for parent, subcats in category_map.items() for subcat in subcats} # Add the ‘class’ column using the mapping dataset[‘class’] = dataset[‘label’].map(subcategory_to_parent) dataset[‘class’].value_counts() Output: You can now see that DDoS intrusion is the most frequently occuring intrusion followed by DoS and Mirai. Let’s plot a bar plot for the class distribution. class_counts = dataset[‘class’].value_counts() sns.set(style=”darkgrid”) plt.figure(figsize=(10, 6)) sns.barplot(x=class_counts.index, y=class_counts.values) plt.title(“Class Distribution”) plt.xlabel(“Class”) plt.ylabel(“Count”) plt.xticks(rotation=45) plt.show() Output: The above output shows that our dataset is highly imbalanced. In the next section, we will insert this data into GridDB. Connect to GridDB You need to perform the following steps to connect your Python application to a GridDB instance. Create an instance of the griddb.StoreFactory object using the get_instance() method. Create a GridDB store factory object by calling the get_store() method on the StoreFactory object. You need to pass the GridDB host and cluster name, and the user and password that you use to connect to the GridDB instance. This should establish a connection to your GridDB instance. To test the connection create a container by calling the get_container() method and pass to it a dummy container name. This step is optional and only tests the connection. The following script shows how to connect to GridDB. # GridDB connection details DB_HOST = “127.0.0.1:10001” DB_CLUSTER = “myCluster” DB_USER = “admin” DB_PASS = “admin” # creating a connection factory = griddb.StoreFactory.get_instance() try: gridstore = factory.get_store( notification_member = DB_HOST, cluster_name = DB_CLUSTER, username = DB_USER, password = DB_PASS ) container1 = gridstore.get_container(“container1”) if container1 == None: print(“Container does not exist”) print(“Successfully connected to GridDB”) except griddb.GSException as e: for i in range(e.get_error_stack_size()): print(“[“, i, “]”) print(e.get_error_code(i)) print(e.get_location(i)) print(e.get_message(i)) Output: Container does not exist Successfully connected to GridDB If you see the above output, you have successfully connected to GridB. Create Container for IoT Data in GridDB GridDB stores data in containers. Therefore, you need to create a container to store your IoT data. You will need to perform the following steps to create a GridDb container capable of storing your IoT data. First, check if a container with the name you want to use already exists. If it does, either delete the existing container or choose a different name for your new container. Convert your dataset into a format that the GridDB expects. For example, GridDB expects the dataset to have a unique ID, therefore you need to add unique add if it doesn’t already exist. Similarly GriDB doesn’t accept spaces in column names. You will need to preprocess the column names too. GridDB data types are different than the Pandas dataframe types. You need to Define mapping from Pandas dataframe column types to GridDB data types. You need to create a column info list that contains column names and their corresponding mapped data type. Finally, you need to call the put_container() method on the store factory object and pass it the container name, the column_info list and the container type (griddb.ContainerType.COLLECTION in this case). The row_key is set to true since each row has unique ID. To test if the container is successfully created, call the get_container() method to retrieve the container you created. The following script creates the IoT_Data container in our GridB instance. # drop container if already exists gridstore.drop_container(“IoT_Data”) # Add an primary key column dataset.insert(0, ‘ID’, range(1, len(dataset) + 1)) # Clean column names to remove spaces or forbidden characters in the GridDB container dataset.columns = [col.strip().replace(” “, “_”) for col in dataset.columns] # Mapping from pandas data types to GridDB data types type_mapping = { ‘float64’: griddb.Type.DOUBLE, ‘int64’: griddb.Type.INTEGER, ‘object’: griddb.Type.STRING } # Generate column_info dynamically, adding ID as the first entry column_info = [[“ID”, griddb.Type.INTEGER]] + [ [col, type_mapping[str(dtype)]] for col, dtype in dataset.dtypes.items() if col != “ID” ] # Define the container info with ID as the primary key and as a collection container container_name = “IoT_Data” container_info = griddb.ContainerInfo( container_name, column_info, griddb.ContainerType.COLLECTION, row_key=True ) # Connecting to GridDB and creating the container try: gridstore.put_container(container_info) container = gridstore.get_container(container_name) if container is None: print(f”Failed to create container: {container_name}”) else: print(f”Successfully created container: {container_name}”) except griddb.GSException as e: print(f”Error creating container {container_name}:”) for i in range(e.get_error_stack_size()): print(f”[{i}] Error code: {e.get_error_code(i)}, Message: {e.get_message(i)}”) Insert IoT Data into GridDB We are now ready to insert the dataframe into the GridDB container we just created. To do so, we iterate through all the rows in our dataset, fetch the column data and column type for each row and insert the data using the container.put() method. The following script inserts our IoT data from the Pandas dataframe into the GriDB IoT_Data container. try: for _, row in dataset.iterrows(): # Prepare each row’s data in the exact order as defined in `column_info` row_data = [ int(row[col]) if dtype == griddb.Type.INTEGER else float(row[col]) if dtype == griddb.Type.DOUBLE else str(row[col]) for col, dtype in column_info ] # Insert the row data into the container container.put(row_data) print(f”Successfully inserted {len(dataset)} rows of data into {container_name}”) except griddb.GSException as e: print(f”Error inserting data into container {container_name}:”) for i in range(e.get_error_stack_size()): print(f”[{i}] Error code: {e.get_error_code(i)}, Message: {e.get_message(i)}”) Output: Successfully inserted 200000 rows of data into IoT_Data In the next sections we will retrieve the IoT data from GridDB and will train machine learning and deep learning models for predicting intrusion type. Forecasting IoT Intrusion Using Machine Learning and Deep Learning We will try both machine learning and deep learning approaches for predicting intrusion type. But first we will see how to retrieve data from GridDB. Retrieving Data From GridDB To retrieve GridDB data you have to retrieve the container, call query() method and pass the SELECT * query to the method. This will retrieve all the records from the container. Next, call the fetch() method to retrieve the data from the container. Finally, call the fetch_rows() method to store data in a Pandas dataframe. The following script defines the retrieve_data_from_griddb() method that retrieves data from a GridDB container into a Pandas dataframe. def retrieve_data_from_griddb(container_name): try: data_container = gridstore.get_container(container_name) # Query all data from the container query = data_container.query(“select *”) rs = query.fetch() data = rs.fetch_rows() data .set_index(“ID”, inplace=True) return data except griddb.GSException as e: print(f”Error retrieving data from GridDB: {e.get_message()}”) return None iot_data = retrieve_data_from_griddb(“IoT_Data”) iot_data.head() Output: IoT Data Classification Using Machine Learning We will first use the Random Forest Classification algorithm, a classification machine learning algorithm and predict the intrusion type. To do so, we will divide our dataset into features and labels set and then train and test sets. We will also standardize our dataset. # Separate the features (X) and the output class (y) X = iot_data.drop(columns=[‘label’, ‘class’]) # Dropping both `label` and `class` columns as `class` is the target y = iot_data[‘class’] # Output target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) Next, we will use the RandomForestClassifier() class from the sklearn module and call its fit() method to train the algorithm on the training data. rf_model = RandomForestClassifier(random_state=42) rf_model.fit(X_train, y_train) Finally, we can make predictions using the predict() method and compare the predictions with the actual labels in the test set to calculate the model accuracy. rf_predictions = rf_model.predict(X_test) rf_accuracy = accuracy_score(y_test, rf_predictions) classification_rep = classification_report(y_test, rf_predictions, zero_division=1) print(“Classification Report:\n”, classification_rep) Output: The above output shows that our model achieves an accuracy of 99% on the test set. It is important to note that since Brute Force had only 47 instances in the training set, the model is not able to learn much about this category. IoT Data Classification Using Deep Learning Let’s now predict intrusion detection using a deep neural network implemented in TensorFlow Keras library. We will convert the output labels to numeric integers since deep learning algorithms work with numbers only. Next we will standardize the training and test sets as we did before. label_encoder = LabelEncoder() y = label_encoder.fit_transform(iot_data[‘class’]) # Integer encoding X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) The following script defines our model with four hidden layers followed by an output layer. Since we have a multiclass classification problem we use the softmax function in the final activation layer. We also use an adaptive learning rate to avoid overfitting. # Define the model model = Sequential([ Dense(256, input_dim=X_train.shape[1], activation=’relu’), BatchNormalization(), Dropout(0.4), Dense(128, activation=’relu’), BatchNormalization(), Dropout(0.4), Dense(64, activation=’relu’), BatchNormalization(), Dropout(0.3), Dense(32, activation=’relu’), BatchNormalization(), Dropout(0.3), Dense(len(pd.unique(y)), activation=’softmax’) # Softmax for multiclass classification ]) # Adaptive learning rate scheduler with exponential decay initial_learning_rate = 0.001 lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay( initial_learning_rate=initial_learning_rate, decay_steps=10000, decay_rate=0.9 ) # Compile the model with Adam optimizer with decayed learning rate model.compile(optimizer=Adam(learning_rate=lr_schedule), loss=’sparse_categorical_crossentropy’, metrics=[‘accuracy’]) The script trains our deep learning model on the training set. We save the best model after each epoch. # Define callbacks without ReduceLROnPlateau early_stopping = EarlyStopping(monitor=’val_loss’, patience=15, restore_best_weights=True) model_checkpoint = ModelCheckpoint(‘best_model.keras’, monitor=’val_accuracy’, save_best_only=True) # Train the model with the callbacks history = model.fit( X_train, y_train, epochs=100, batch_size=32, validation_data=(X_test, y_test), callbacks=[early_stopping, model_checkpoint], verbose=1 ) Output: Once the training is complete we load the best model and make predictions on the test. We compare predictions with the actual target label to calculate model performance. # Load the best model best_model = load_model(‘best_model.keras’) y_pred = best_model.predict(X_test).argmax(axis=-1) print(“Classification Report:\n”, classification_report(y_test, y_pred, zero_division = 0)) Output: The above output shows that we achieve an accuracy of 89% on the test which is much less than what was achieved using the Random Forest algorithm. The reason can be that tree algorithms such as Random Forest are known to perform better on the tabular dataset. Furthermore, the performance of deep learning models is significantly impacted by data imbalance. Conclusion This article demonstrated a complete workflow for building an IoT intrusion detection system using GridDB for data management and machine learning and deep learning models for classification. We covered how to connect to GridDB, store IoT intrusion data, and use Random Forest and deep neural network models to predict intrusion types. The Random Forest model achieved high accuracy, showing its suitability for tabular datasets, while deep learning models highlighted the challenges of data imbalance. If you have any questions or need help with GridDB, feel free to reach out on Stack Overflow with the griddb tag, and our engineers will be glad to

More