Blog

Migrate from PostgreSQL to GridDB Cloud

In a previous article, we talked about some new features in the GridDB Cloud CLI Tool which showcased the ability to make tables based on schemas in the form of JSON files. Included in that was also the ability to migrate from GridDB Community Edition (on prem) to the Cloud in just a few commands; you can read that article here: New Features of the GridDB Cloud CLI Tool. In this article, we will again feature the GridDB Cloud CLI Tool, but this time we want to showcase the ability to migrate from your on-prem PostgreSQL database directly to your GridDB Cloud instance. We will get into the details later in the article, but here is a basic rundown of the steps required to make this work: Export PostgreSQL tables as CSV row data Export schemas of all PostgreSQL tables as JSON file Use GridDB Cloud CLI Tool to transform the data, create the corresponding tables, and then load the data into GridDB Cloud Also please note: this tool is intended for developers and is not of production level quality — it is not made by the official GridDB Development team, so please if you face any issues, leave a comment in this article or within the GridDB Cloud CLI Tool’s repo and we will look into it. Downloading the Tool and Examples The source code for this tool and the binaries to get the latest version are found on GitHub: https://github.com/Imisrael/griddb-cloud-cli. The example directories with CSV data that you can test with are found in the migrate-psql branch: https://github.com/Imisrael/griddb-cloud-cli/tree/migrate-psql/migration_dirs. Scripts to Export PostgreSQL Before we showcase running an example of this process, let’s first take a look at the scripts needed to export the data from your PostgreSQL instance into a more neutral and workable data format (CSV). We will use two different scripts in this phase, one for extract row data, and the other to grab ‘meta’/schema data; both of these scripts rely on the psql command to directly communicate with the database to grab the data we need. Script for Row Data Here is the script to extract your row data; it will generate a unique CSV file for each table in your database, with the filename corresponding directly to the name of the table — handy! #!/bin/bash # Check if a database name was provided. if [ -z “$1” ]; then echo “Usage: $0 ” exit 1 fi DB_NAME=$1 EXPORT_DIR=”.” # Export to the current directory. # Get a list of all tables in the ‘public’ schema. TABLES=$(psql -d $DB_NAME -t -c “SELECT table_name FROM information_schema.tables WHERE table_schema = ‘public’ AND table_type = ‘BASE TABLE’;”) # Loop through the tables and export each one to a CSV file. for TBL in $TABLES; do echo “Exporting table: $TBL” psql -d $DB_NAME -c “\copy (SELECT * FROM $TBL) TO ‘$EXPORT_DIR/$TBL.csv’ WITH (FORMAT CSV, HEADER);” done echo “Export complete.” The script itself is self-explanatory: it exports all tables into the current working directory with the header as the top row and every subsequent row being the data from the table. Script for Schema Data This script is a bit more clever and more of a unique feature for PostgreSQL. It utilizes its json_agg() function which “is an aggregate function that collects values from multiple rows and returns them as a single JSON array.” source. Here is what the script looks like: #!/bin/bash psql -d postgres -X -A -t -c ” WITH column_details AS ( — First, get all the column info for each table SELECT table_name, json_agg( json_build_object( ‘column_name’, column_name, ‘data_type’, udt_name, ‘is_nullable’, is_nullable, ‘column_default’, column_default ) ORDER BY ordinal_position ) AS columns FROM information_schema.columns WHERE table_schema = ‘public’ GROUP BY table_name ), primary_key_details AS ( — Next, find the primary key columns for each table SELECT kcu.table_name, json_agg(kcu.column_name) AS pk_columns FROM information_schema.table_constraints AS tc JOIN information_schema.key_column_usage AS kcu ON tc.constraint_name = kcu.constraint_name AND tc.table_schema = kcu.table_schema WHERE tc.constraint_type = ‘PRIMARY KEY’ AND tc.table_schema = ‘public’ GROUP BY kcu.table_name ) — Finally, join them together SELECT json_object_agg( cd.table_name, json_build_object( ‘primary_key’, COALESCE(pkd.pk_columns, ‘[]’::json), ‘columns’, cd.columns ) ) FROM column_details cd LEFT JOIN primary_key_details pkd ON cd.table_name = pkd.table_name; ” > schema.json If you were just grabbing the schema, I think this script would be about half of this length, but we actually are grabbing the primary key as well which requires a JOIN operation. We grab this data in order to ensure we migrate time series tables into TIME_SERIES containers on the GridDB side (more on that later). Changes to the GridDB Cloud CLI Tool To get this process to work, we needed to first separate out the migrate tool. Prior to this release, the migrate tool only expected to work with GridDB CE, and so, the command worked like this: griddb-cloud-cli migrate [griddb-out-directory]. Now obviously this no longer works as we need to indicate to the tool what sort of database we are importing to GridDB Cloud. So what we did was separate out the commands as sub commands, and this meant keeping the more generic functions that work for both in the root of the command — things like reading a CSV file, or parsing a JSON file — and keeping the DB-specific functions inside of the respective subcommand files. So now our migrate/ directory has three files instead of one: migrateCmd.go, migrateGridDB.go, and migratePSQL.go Here’s a brief example of a function which is unique to the PostgreSQL import process func typeSwitcher(s string) string { switch s { case “bool”: return “BOOL” case “char”, “varchar”, “text”: return “STRING” case “int”, “int2”, “int4”: return “INTEGER” case “int8”: return “LONG” case “decimal”, “real”, “numeric”: return “FLOAT” case “float”, “float8”: return “DOUBLE” case “timetz”, “timestamptz”: return “TIMESTAMP” default: return strings.ToUpper(s) } } Migrating from PostgreSQL Now that we have our data ready for use, let’s run the migration tool and see the results. First, place the contents of your exporting efforts into their own directory $ mkdir psql_exported $ mv *.csv psql_exported/ $ mv schema.json psql_exprted/ And now we can run the tool: $ griddb-cloud-cli migrate psql psql_exported/ {“container_name”:”customers”,”container_type”:”COLLECTION”,”rowkey”:false,”columns”:[{“name”:”customer_id”,”type”:”INTEGER”,”index”:null},{“name”:”first_name”,”type”:”STRING”,”index”:null},{“name”:”last_name”,”type”:”STRING”,”index”:null},{“name”:”email”,”type”:”STRING”,”index”:null},{“name”:”phone_number”,”type”:”STRING”,”index”:null},{“name”:”address”,”type”:”STRING”,”index”:null},{“name”:”created_at”,”type”:”TIMESTAMP”,”index”:null}]} ✔ Make Container? { “container_name”: “customers”, “container_type”: “COLLECTION”, “rowkey”: false, “columns”: [ { “name”: “customer_id”, “type”: “INTEGER”, “index”: null }, { “name”: “first_name”, “type”: “STRING”, “index”: null }, { “name”: “last_name”, “type”: “STRING”, “index”: null }, { “name”: “email”, “type”: “STRING”, “index”: null }, { “name”: “phone_number”, “type”: “STRING”, “index”: null }, { “name”: “address”, “type”: “STRING”, “index”: null }, { “name”: “created_at”, “type”: “TIMESTAMP”, “index”: null } ] } … YES 201 Created inserting into (customers). csv: psql_exported/customers.csv 200 OK {“container_name”:”products”,”container_type”:”COLLECTION”,”rowkey”:false,”columns”:[{“name”:”id”,”type”:”INTEGER”,”index”:null},{“name”:”name”,”type”:”STRING”,”index”:null},{“name”:”category”,”type”:”STRING”,”index”:null},{“name”:”price”,”type”:”FLOAT”,”index”:null},{“name”:”stock_quantity”,”type”:”INTEGER”,”index”:null}]} ✔ Make Container? { “container_name”: “products”, “container_type”: “COLLECTION”, “rowkey”: false, “columns”: [ { “name”: “id”, “type”: “INTEGER”, “index”: null }, { “name”: “name”, “type”: “STRING”, “index”: null }, { “name”: “category”, “type”: “STRING”, “index”: null }, { “name”: “price”, “type”: “FLOAT”, “index”: null }, { “name”: “stock_quantity”, “type”: “INTEGER”, “index”: null } ] } … YES 201 Created inserting into (products). csv: psql_exported/products.csv 200 OK And now, of course, is the last step: checking to make sure our data was successfully migrated. $ griddb-cloud-cli show customers { “container_name”: “customers”, “container_type”: “COLLECTION”, “rowkey”: false, “columns”: [ { “name”: “customer_id”, “type”: “INTEGER”, “index”: [] }, { “name”: “first_name”, “type”: “STRING”, “index”: [] }, { “name”: “last_name”, “type”: “STRING”, “index”: [] }, { “name”: “email”, “type”: “STRING”, “index”: [] }, { “name”: “phone_number”, “type”: “STRING”, “index”: [] }, { “name”: “address”, “type”: “STRING”, “index”: [] }, { “name”: “created_at”, “type”: “TIMESTAMP”, “timePrecision”: “MILLISECOND”, “index”: [] } ] } And a quick read: $ griddb-cloud-cli read customers -r [ { “name”: “customers”, “stmt”: “select * limit 50”, “columns”: null, “hasPartialExecution”: true }] customer_id,first_name,last_name,email,phone_number,address,created_at, [1 Alice Johnson alice.j@email.com 555-0101 123 Maple St, Springfield 2025-09-04T06:32:05.700Z] [2 Bob Smith bob.smith@email.com 555-0102 456 Oak Ave, Shelbyville 2025-09-04T06:32:05.700Z] [3 Charlie Brown charlie@email.com 555-0103 789 Pine Ln, Capital City 2025-09-04T06:32:05.700Z] [4 Diana Prince diana.p@email.com 901 Birch Rd, Themyscira 2025-09-04T06:32:05.700Z] [5 Ethan Hunt ethan.hunt@email.com 555-0105 1122 Mission St, Los Angeles 2025-09-04T06:32:05.700Z] And for fun: griddb-cloud-cli read graph products –columns ‘stock_quantity’ [“stock_quantity”] [ { “name”: “products”, “stmt”: “select * limit 50”, “columns”: [“stock_quantity”], “hasPartialExecution”: true }] 936 ┤ ╭╮ 906 ┤ ││ 876 ┤ ││ 845 ┤ │╰╮ ╭╮ 815 ┤ │ │ ││ 785 ┤ │ │ ││ 755 ┤ │ │ ╭─╮ ││ 724 ┤ │ │ │ │ ╭╯│ 694 ┤ │ │ │ │ │ ╰╮ ╭╮ 664 ┤ ╭╯ │ │ ╰╮│ │ ││ 633 ┤ │ │ │ ││ │ ││ 603 ┤ │ ╰╮╭╯ ╰╯ │ ││ 573 ┤ │ ││ │ ╭╯│ 543 ┤ │ ││ │ │ │ 512 ┤ │ ││ │ │ ╰╮ ╭╮ 482 ┤ │ ╰╯ │ ╭╮ │ │ │╰╮ 452 ┤ │ │ ││ ╭╯ │ │ │ 421 ┤ ╭─╮ │ │ │╰╮ │ │ │ │ 391 ┤ │ │ ╭╮ ╭╯ │ ╭╯ │ ╭╯ │ │ │ ╭╮ 361 ┤ │ │ ││ │ ╰╮ │ ╰─╯ │ │ │ │╰╮ 331 ┤ │ │ │╰╮ │ │ │ │ │ │ ╭╯ ╰ 300 ┤ ╭╯ │ ╭╯ │ ╭╮ │ │ │ ╰╮ ╭╮ ╭╯ │ │ 270 ┤ │ │ │ ╰╮ ││ │ │ │ │ ╭╯╰╮ │ ╰╮ │ 240 ┤ │ ╰╮ │ │ ╭╯│ ╭╯ │ ╭╯ │ ╭╯ ╰╮ │ │ │ 209 ┤ ╭╯ │ │ ╰╮ │ ╰╮ ╭─╮ │ │ │ │ │ ╰╮ │ │╭╯ 179 ┤ ╭╯ │ ╭╯ │ ╭╯ │ ╭──╯ │ ╭╯ │ │ │╭─╯ │ │ ││ 149 ┤ ╭╯ │ │ ╰╮ │ ╰╮ ╭╯ ╰╮│ │ │ ╰╯ ╰╮│ ││ 119 ┤ ╭╯ │ ╭╮ │ │╭─╯ │ ╭╯ ╰╯ │ │ ╰╯ ││ 88 ┤╭╯ │ ╭╯╰─╯ ╰╯ ╰─╮╭╯ │ │ ╰╯ 58 ┼╯ ╰─╯ ╰╯ │ ╭─╯ 28 ┤ ╰───╯ Col names from container products ■ stock_quantity Conclusion Never has it been easier to migrate from an on-prem SQL database to GridDB Cloud, and we hope these tools make any of those looking to try out GridDB Cloud a much simpler

More
Sentiment Analysis on Product Reviews using OpenAI

Overview This article explores a practical implementation of sentiment analysis on product reviews using OpenAI’s large language models (LLMs) and GridDB. The project involves extracting review data, analyzing sentiment using AI, and efficiently storing and querying the results. It’s an example of how modern AI tools and database systems can be combined to process and understand massive amounts of customer feedback. Why Use AI for Sentiment Analysis? Manual analysis of customer sentiment is impossible at scale due to the sheer volume of user-generated content. Artificial intelligence, especially LLMs like OpenAI’s GPT, enables automated analysis with human-level language understanding. Key advantages include: Scalability: AI can process millions of reviews quickly and consistently. Context Awareness: LLMs are capable of identifying sarcasm, subtle opinions, and emotional tone. Language Versatility: These models handle slang, informal language, and multilingual text more effectively than rule-based systems. AI-driven sentiment analysis helps businesses track public opinion, improve customer service, and guide product development based on real user feedback. Running the Project Clone the Repository git clone https://github.com/junwatu/sentiment-analysis-griddb.git cd app npm install Setup Environment Variables Create a .env file in the app directory of the project and add the following environment variables: OPENAI_API_KEY= GRIDDB_WEBAPI_URL= GRIDDB_USERNAME= GRIDDB_PASSWORD= VITE_BASE_URL=http://localhost:3000 You can copy the .env.example file to .env and fill in the values. To get the OpenAI API key, read this section and for GridDB Cloud, read this section. By default, the VITE_BASE_URL is set to http://localhost:3000 but you can change it to your own domain. Run the Project npm run start Access the Project If you don’t change the VITE_BASE_URL, open your browser and navigate to http://localhost:3000 to access the project. Technologies Used Node.js You need Node.js installed because this project uses Vite based project. Install the Node LTS version from here. OpenAI AI models are used for natural language processing and sentiment analysis. Create the OpenAI API key here. You may need to create a project and enable a few models. We will use the gpt-4o model. GridDB Cloud Setup The GridDB Cloud offers a free plan tier and is officially available worldwide. You need these GridDB environment variables in the .env file: GRIDDB_WEBAPI_URL= GRIDDB_USERNAME= GRIDDB_PASSWORD= Sign Up for GridDB Cloud Free Plan If you would like to sign up for a GridDB Cloud Free instance, you can do so in the following link: https://form.ict-toshiba.jp/download_form_griddb_cloud_freeplan_e. After successfully signing up, you will receive a free instance along with the necessary details to access the GridDB Cloud Management GUI, including the GridDB Cloud Portal URL, Contract ID, Login, and Password. GridDB WebAPI URL Go to the GridDB Cloud Portal and copy the WebAPI URL from the Clusters section. It should look like this: GridDB Username and Password Go to the GridDB Users section of the GridDB Cloud portal and create or copy the username for GRIDDB_USERNAME. The password is set when the user is created for the first time, use this as the GRIDDB_PASSWORD. For more details, to get started with GridDB Cloud, please follow this quick start guide. IP Whitelist When running this project, please ensure that the IP address where the project is running is whitelisted. Failure to do so will result in a 403 status code or forbidden access. You can use a website like What Is My IP Address to find your public IP address. To whitelist the IP, go to the GridDB Cloud Admin and navigate to the Network Access menu. Project Architecture The diagram illustrates the flow of data in the sentiment analysis system: Product Review Data is fed into the system. The OpenAI API analyzes this data for sentiment. The sentiment results and associated metadata are then stored in GridDB. Finally, the sentiment analysis results are presented through a user interface. This represents a pipeline where raw review data is processed, analyzed for sentiment, stored in GridDB, and then made available for viewing. Dataset Preparation You can use any data review for this project. For example, you can use the Amazon Review Dataset or just copy-paste a review from one of the product reviews on Amazon. Server The server is built using Node.js and Express.js. It is a simple server that provides a few API routes. You can see the server code in the server.ts file. API Routes Method Endpoint Description POST /api/sentiment Sentiment analysis for a review and persist the result in GridDB GET /api/sentiments Retrieve all stored review records GET * (any non‑API path) Serve main UI Running Sentiment Analysis with OpenAI We use the OpenAI API (e.g., GPT-4o) to evaluate the sentiment of each review. The input is a text prompt that asks the model to categorize the sentiment. System Prompt We will use a system prompt to guide the model to understand the task. The system prompt is as follows: You are a sentiment‑analysis classifier for Amazon user‑review records. You will receive one JSON object that contains the fields “title” and “text” (and sometimes “rating” which you must ignore). Your task:\n1. Read the natural‑language content (title + text).\n2. Predict the sentiment label and an estimated star rating without looking at any numeric “rating” field.\n3. Respond ONLY with a JSON object in this schema:\n{\n “label”: “positive | neutral | negative”,\n “predicted_rating”: 1 | 2 | 3 | 4 | 5,\n “confidence”: 0-1\n}\nMapping rule (aligned to the Amazon Reviews dataset):\n• 1–2 stars ⇒ negative\n• 3 stars ⇒ neutral\n• 4–5 stars ⇒ positive\nIf the review text is empty, off‑topic, or nonsense, return:\n{“label”:”neutral”,”predicted_rating”:3,”confidence”:0.0}\nNever add commentary or extra keys. Basically, it tells the model to behave like a sentiment analysis classifier for Amazon user-review records. Few Shots A few shots are used to guide the model in understanding the task. The few shots used in this project are as follows: const FEW_SHOTS = [ { role: ‘user’, content: JSON.stringify({ title: ‘Rock‑solid mount’, text: “I’ve tried dozens of phone mounts—this one finally stays put on bumpy roads. Five minutes to install and rock‑solid!”, }), }, { role: ‘assistant’, content: ‘{“label”:”positive”,”predicted_rating”:5,”confidence”:0.96}’, }, { role: ‘user’, content: JSON.stringify({ title: ‘Broke in a week’, text: “Looks nice, but the zipper broke after one week and Amazon wouldn’t replace it.”, }), }, { role: ‘assistant’, content: ‘{“label”:”negative”,”predicted_rating”:1,”confidence”:0.93}’, }, { role: ‘user’, content: JSON.stringify({ title: ‘Meh’, text: ‘These were lightweight and soft but much too small for my liking. I would have preferred two of these together to make one loc.’, }), }, { role: ‘assistant’, content: ‘{“label”:”neutral”,”predicted_rating”:3,”confidence”:0.55}’, }, ] as const; Few-shot examples are used to help a model quickly learn a specific task or format by demonstrating correct input-output pairs. In this context, these few shots serve as demonstrations to guide the sentiment analysis model, illustrating: Expected input format: Reviews consisting of a title and text. Expected output structure: A JSON response containing: label: sentiment classification (“positive”, “negative”, “neutral”). predicted_rating: numerical rating estimation (1–5). confidence: confidence score for the prediction. The model leverages these examples to understand precisely how to classify and score new reviews in a similar structured manner. Call OpenAI API The sentiment analysis was done by OpenAI using the gpt-4o model. The code is available in the server.ts file. Here is the important code snippet: const completion = await openai.chat.completions.create({ model: ‘gpt-4o’, temperature: 0, messages: [ { role: ‘system’, content: SYSTEM_PROMPT }, …FEW_SHOTS, { role: ‘user’, content: JSON.stringify({ title, text }), }, ], }); The text is the review text and the title is the review title. Storing Results in GridDB GridDB is used to persist the processed review data. We use a collection container to store the data. Container Schema The container schema is as follows: id: INTEGER title: STRING review: STRING sentiment: STRING Saving Data to GridDB This is the example data format that is saved to GridDB: { id: 4495, title: ‘US Version’, review: “US version of the game so DLC can’t be downloaded. Returned for a refund”, sentiment: ‘{“label”:”negative”,”predicted_rating”:2,”confidence”:0.88}’ } The reviewData is inserted into GridDB using the insertData function. const reviewData: GridDBData = { id: generateRandomID(), title, review: text, sentiment: JSON.stringify(parsed), }; await dbClient.insertData({ data: reviewData }); The code to insert data is available in the griddb.ts file. Basically, this function takes the data and inserts it into GridDB using the HTTP method. async function insertData({ data, containerName = ‘sentiments’, }: { data: GridDBData; containerName?: string; }): Promise { try { const row = [ parseInt(data.id.toString()), data.title, data.review, data.sentiment, ]; const path = `/containers/${containerName}/rows`; return await makeRequest(path, [row], ‘PUT’); } catch (error) { if (error instanceof GridDBError) { throw error; } const errorMessage = error instanceof Error ? error.message : ‘Unknown error’; throw new GridDBError(`Failed to insert data: ${errorMessage}`, undefined, undefined, error); } } The API route used to insert data in GridDB Cloud is: `${GRIDDB_WEB_API_URL}/containers/sentiments/rows` and by using the HTTP PUT method, the data can be easily inserted into the database. Query All Data To query all data from the sentiments container, you can use the following API route: app.get(‘/api/sentiments’, async (req: express.Request, res: express.Response) => { try { // Search all data const results = await dbClient.searchData([{ type: ‘sql’, stmt: ‘SELECT * FROM sentiments’ }]); res.json({ data: results }); } catch (error) { res.status(500).json({ error: error.message }); } }); The code above uses SQL SELECT statement to query all data from the sentiments container and if you query the data directly from the browser, you can use the following URL https://localhost:3000/api/sentiments User Interface The UI is built using React and Vite. The <SentimentAnalysis /> component is the main component of the UI. It is a simple form that allows users to input a review title and text, and then submit the form to the /api/sentiment endpoint. After submitting the form and processing the sentiment analysis, the UI receives the response data that has JSON format, for example: { “label”: “positive”, “predicted_rating”: 4, “confidence”: 0.85 } The response data is then displayed in the UI. If the review is positive, the UI will display a green color, if it is negative, the UI will display a red color, and if it is neutral, the UI will display a yellow color. Also the UI will display the predicted rating and confidence level. Conclusion This project demonstrates a real-world use case of combining AI and databases for sentiment analysis. Using OpenAI for intelligent sentiment analysis and GridDB for data storage enables fast, efficient processing of customer reviews. The same framework can be extended to other datasets and domains, including social media, customer support logs, or live feedback systems. References OpenAI API

More
Business Intelligence Case Study with GridDB and Python

How do you know your business is going in the right direction? With so much data we have in hand, it is very easy for the problem to get lost while cleaning and analysing it. To be honest, I was in the same place. As a Data Analyst/Data Scientist, I would try to look at the data first and make sense of it. But it not only consumed more time but also made it difficult to explain why I took certain decisions. A recent project on Business Intelligence taught me the power of problem-solving. For this tutorial, we will be exploring a dataset of a global store. To keep it general, I am going to assume our problem statment here is How to grow our business. Then, we will work through the problem statement in a tree-like framework. This will help in structuring our analysis and also get specific KPIs. Framework The framework breaks this down into two major strategic approaches: 1. Increase Our Profit There are a few ways to boost profits. This framework highlights three core strategies: ➤ Increase the Quantity Sold More units sold usually means more revenue. To do this effectively: – Ensure demand is met – Make sure your supply chain and stock levels can support customer demand. We need more data to do this. – Focus on region-specific performance – Some regions may outperform others, and identifying where sales are strongest helps target efforts better. ➤ Sell More Products with High Profit Margins Not all products are created equal. Shifting focus toward high-margin items can improve profitability. – Use product-level data to identify top performers. – Understand regional trends to see where high-margin products sell best. ➤ Reduce the Shipping Cost Shipping and logistics can quietly erode margins, so optimizing this is crucial. – Analyze which products incur the highest shipping costs. – Check which regions are driving those costs and explore smarter routing or fulfillment solutions. 2. Decrease Our Discount While discounts can drive volume, excessive discounting eats into profits. Two key tactics can help manage this: ➤ Identify Best-Selling Products Best-sellers may not need heavy discounting to move. – Reassess discount policies at the product level. – Consider performance variations across regions. ➤ Uncover Bottlenecks Inefficiencies or slow-moving inventory often trigger unnecessary discounts. – Identifying and addressing bottlenecks can reduce dependency on markdowns. Now that we have a basic framework in place, let us look at each aspect and decide on the metrics. What metrics are we tracking? Key Performance Indicators (KPIs): total sales total profit average shipping cost average discount Profitability: products with most profit regions with most profit Discount: best selling products (do not need discounts) bottlenecks due to discounts (low or negative profit margin) Market segmentation: Identifying customer patterns quantity sold by region profit by region shipping cost by region discount by region Pre-requisites The following libraries are required for the below code to execute: Pandas GridDB Python Client Seaborn Matplotlib GridDB’s GitHub Page covers the installation in detail. Please go through it to be able to interact with the GridDB Server in Python. Great! Let’s go ahead and import the libraries Importing Libraries #this is optional – in case of errors during execution even though the python client is installed, it is likely that this could be a path issue and can be resolved as follows import sys sys.path.append(“/home/shripriya/python_client”) #sys.path.append(<path_to_python_client>)</path_to_python_client> import griddb_python as griddb import pandas as pd import seaborn as sns import matplotlib.pyplot as plt Dataset We will be using an open-source SuperStore dataset that can be downloaded from here. Once we download it and put it in the same folder as the code, we can then use pandas to read it and clean it. Reading the Dataframe superstore_df = pd.read_excel(“global_superstore_2016.xlsx”, sheet_name=”Orders”) superstore_df.head() Row ID Order ID Order Date Ship Date Ship Mode Customer ID Customer Name Segment Postal Code City Product ID Category Sub-Category Product Name Sales Quantity Discount Profit Shipping Cost Order Priority 0 40098 CA-2014-AB10015140-41954 2014-11-11 2014-11-13 First Class AB-100151402 Aaron Bergman Consumer 73120.0 Oklahoma City TEC-PH-5816 Technology Phones Samsung Convoy 3 221.980 2 0.0 62.1544 40.77 High 1 26341 IN-2014-JR162107-41675 2014-02-05 2014-02-07 Second Class JR-162107 Justin Ritter Corporate NaN Wollongong FUR-CH-5379 Furniture Chairs Novimex Executive Leather Armchair, Black 3709.395 9 0.1 -288.7650 923.63 Critical 2 25330 IN-2014-CR127307-41929 2014-10-17 2014-10-18 First Class CR-127307 Craig Reiter Consumer NaN Brisbane TEC-PH-5356 Technology Phones Nokia Smart Phone, with Caller ID 5175.171 9 0.1 919.9710 915.49 Medium 3 13524 ES-2014-KM1637548-41667 2014-01-28 2014-01-30 First Class KM-1637548 Katherine Murray Home Office NaN Berlin TEC-PH-5267 Technology Phones Motorola Smart Phone, Cordless 2892.510 5 0.1 -96.5400 910.16 Medium 4 47221 SG-2014-RH9495111-41948 2014-11-05 2014-11-06 Same Day RH-9495111 Rick Hansen Consumer NaN Dakar TEC-CO-6011 Technology Copiers Sharp Wireless Fax, High-Speed 2832.960 8 0.0 311.5200 903.04 Critical 5 rows × 23 columns We can see that our dataset has a lot of information on the orders being placed. Not all information is useful here. Remember our framework comes in handy here. So, we will go ahead and drop some columns that are not needed. Exploratory Data Analysis (EDA) #do not need Order ID, Postal Code, Order Priority cols = [“Order ID”, “Postal Code”, “Order Priority”] superstore_df.drop(columns=cols, inplace=True) superstore_df.info() <class ‘pandas.core.frame.DataFrame’> RangeIndex: 51290 entries, 0 to 51289 Data columns (total 21 columns): # Column Non-Null Count Dtype — —— ————– —– 0 Row ID 51290 non-null int64 1 Order Date 51290 non-null datetime64[ns] 2 Ship Date 51290 non-null datetime64[ns] 3 Ship Mode 51290 non-null object 4 Customer ID 51290 non-null object 5 Customer Name 51290 non-null object 6 Segment 51290 non-null object 7 City 51290 non-null object 8 State 51290 non-null object 9 Country 51290 non-null object 10 Region 51290 non-null object 11 Market 51290 non-null object 12 Product ID 51290 non-null object 13 Category 51290 non-null object 14 Sub-Category 51290 non-null object 15 Product Name 51290 non-null object 16 Sales 51290 non-null float64 17 Quantity 51290 non-null int64 18 Discount 51290 non-null float64 19 Profit 51290 non-null float64 20 Shipping Cost 51290 non-null float64 dtypes: datetime64[ns](2), float64(4), int64(2), object(13) memory usage: 8.2+ MB Since there are no null-values, we do not need to worry about that. However, the date columns here are not of the correct datatype. So we will quickly change that before inserting the data into GridDB. # Prepare the DataFrame superstore_df[“Order Date”] = pd.to_datetime(superstore_df[“Order Date”]).dt.to_pydatetime() superstore_df[“Ship Date”] = pd.to_datetime(superstore_df[“Ship Date”]).dt.to_pydatetime() # Rename columns to match GridDB naming (no spaces or dashes) superstore_df.columns = [col.replace(” “, “_”).replace(“-“, “_”) for col in superstore_df.columns] /tmp/ipykernel_1024/114528052.py:2: FutureWarning: The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result superstore_df[“Order Date”] = pd.to_datetime(superstore_df[“Order Date”]).dt.to_pydatetime() /tmp/ipykernel_1024/114528052.py:3: FutureWarning: The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result superstore_df[“Ship Date”] = pd.to_datetime(superstore_df[“Ship Date”]).dt.to_pydatetime() Let us verify that the datatypes are indeed changed. superstore_df.info() <class ‘pandas.core.frame.DataFrame’> RangeIndex: 51290 entries, 0 to 51289 Data columns (total 21 columns): # Column Non-Null Count Dtype — —— ————– —– 0 Row_ID 51290 non-null int64 1 Order_Date 51290 non-null datetime64[ns] 2 Ship_Date 51290 non-null datetime64[ns] 3 Ship_Mode 51290 non-null object 4 Customer_ID 51290 non-null object 5 Customer_Name 51290 non-null object 6 Segment 51290 non-null object 7 City 51290 non-null object 8 State 51290 non-null object 9 Country 51290 non-null object 10 Region 51290 non-null object 11 Market 51290 non-null object 12 Product_ID 51290 non-null object 13 Category 51290 non-null object 14 Sub_Category 51290 non-null object 15 Product_Name 51290 non-null object 16 Sales 51290 non-null float64 17 Quantity 51290 non-null int64 18 Discount 51290 non-null float64 19 Profit 51290 non-null float64 20 Shipping_Cost 51290 non-null float64 dtypes: datetime64[ns](2), float64(4), int64(2), object(13) memory usage: 8.2+ MB Okay, everything looks great! We can now put our clean data into GridDB. Putting the data into GridDB Before putting the data into GridDB, we need to create something called containers. Containers are like tables in any database. We define our schema that includes the name and data type of the column. Once we create the containers, it is very easy to put the data using the GridDB python client. Container Creation factory = griddb.StoreFactory.get_instance() host = “127.0.0.1:10001” cluster = “myCluster” db_user = “admin” db_password = “admin” try: gridstore = factory.get_store(notification_member=host, cluster_name=cluster, username=db_user, password=db_password) print(“Connection successful”) conInfo = griddb.ContainerInfo( name=”sales_data”, column_info_list=[ [“Row_ID”, griddb.Type.LONG], [“Order_Date”, griddb.Type.TIMESTAMP], [“Ship_Date”, griddb.Type.TIMESTAMP], [“Ship_Mode”, griddb.Type.STRING], [“Customer_ID”, griddb.Type.STRING], [“Customer_Name”, griddb.Type.STRING], [“Segment”, griddb.Type.STRING], [“City”, griddb.Type.STRING], [“State”, griddb.Type.STRING], [“Country”, griddb.Type.STRING], [“Region”, griddb.Type.STRING], [“Market”, griddb.Type.STRING], [“Product_ID”, griddb.Type.STRING], [“Category”, griddb.Type.STRING], [“Sub_Category”, griddb.Type.STRING], [“Product_Name”, griddb.Type.STRING], [“Sales”, griddb.Type.DOUBLE], [“Quantity”, griddb.Type.INTEGER], [“Discount”, griddb.Type.DOUBLE], [“Profit”, griddb.Type.DOUBLE], [“Shipping_Cost”, griddb.Type.DOUBLE] ], type=griddb.ContainerType.COLLECTION ) container = gridstore.put_container(conInfo) print(“Container creation successful”) except griddb.GSException as e: for i in range(e.get_error_stack_size()): print(“[“, i, “]”) print(“Error Code:”, e.get_error_code(i)) print(“Location:”, e.get_location(i)) print(“Message:”, e.get_message(i)) Connection successful Container creation successful If executed correctly, the above code should print out Connection successful followed by Container creation successful. If not, the exception message will help you troubleshoot. More information on the error codes are available here. Once our container is in place, let us put the data into it using the following code. Inserting the data container.put_rows(superstore_df) print(“Data insertion successful”) Data insertion successful It is very important to verify if our data has been inserted correctly before moving onto the data analysis. Let’s go ahead and do that. Reading the data try: gridstore = factory.get_store(notification_member=host, cluster_name=cluster, username=db_user, password=db_password) cont = gridstore.get_container(“sales_data”) if cont is None: print(“Does not exist”) print(“Connection successful”) query_str=”SELECT *” query = cont.query(query_str) rs = query.fetch() df = rs.fetch_rows() print(df.info()) except griddb.GSException as e: for i in range(e.get_error_stack_size()): print(“[“,i,”]”) print(e.get_error_code(i)) print(e.get_location(i)) print(e.get_message(i)) Connection successful <class ‘pandas.core.frame.DataFrame’> RangeIndex: 51290 entries, 0 to 51289 Data columns (total 21 columns): # Column Non-Null Count Dtype — —— ————– —– 0 Row_ID 51290 non-null int64 1 Order_Date 51290 non-null datetime64[ns] 2 Ship_Date 51290 non-null datetime64[ns] 3 Ship_Mode 51290 non-null object 4 Customer_ID 51290 non-null object 5 Customer_Name 51290 non-null object 6 Segment 51290 non-null object 7 City 51290 non-null object 8 State 51290 non-null object 9 Country 51290 non-null object 10 Region 51290 non-null object 11 Market 51290 non-null object 12 Product_ID 51290 non-null object 13 Category 51290 non-null object 14 Sub_Category 51290 non-null object 15 Product_Name 51290 non-null object 16 Sales 51290 non-null float64 17 Quantity 51290 non-null int64 18 Discount 51290 non-null float64 19 Profit 51290 non-null float64 20 Shipping_Cost 51290 non-null float64 dtypes: datetime64[ns](2), float64(4), int64(2), object(13) memory usage: 8.2+ MB None Great! Everything looks great. We can now go ahead with our analysis. Analysis I will define a function that takes the container, query, and output data type as an input and returns the result of the query. This helps me avoid writing the same code over and over again. def fetch_data(container, query_str, data_type): rs = container.query(query_str) row_set = rs.fetch() if row_set.has_next(): result = row_set.next() result = result.get(type=data_type) return result 1. Key Performance Indicators (KPI) Referencing the framework, we are tracking 4 KPIs – Total Sales Total Profit Avg Shipping Cost Avg Discount total_sales = fetch_data(cont, ‘SELECT SUM(Sales) FROM sales_data’, griddb.Type.DOUBLE) total_profit = fetch_data(cont, ‘SELECT SUM(Profit) FROM sales_data’, griddb.Type.DOUBLE) avg_shipping = fetch_data(cont, ‘SELECT AVG(Shipping_Cost) FROM sales_data’, griddb.Type.DOUBLE) avg_discount = fetch_data(cont, ‘SELECT AVG(Discount) FROM sales_data’, griddb.Type.DOUBLE) print(f”🔢 KPIs”) print(f”Total Sales: ${total_sales:,.2f}”) print(f”Total Profit: ${total_profit:,.2f}”) print(f”Avg Shipping Cost: ${avg_shipping:.2f}”) print(f”Avg Discount: {avg_discount:.2%}”) 🔢 KPIs Total Sales: $12,642,501.91 Total Profit: $1,467,457.29 Avg Shipping Cost: $26.48 Avg Discount: 14.29% Let’s go ahead and plot these. fig, axes = plt.subplots(2, 2, figsize=(10, 6)) fig.suptitle(“Key Performance Indicators (KPIs)”, fontsize=20, fontweight=’bold’) axes = axes.flatten() for ax in axes: ax.axis(‘off’) # KPIs kpi_labels = [“Total Sales”, “Total Profit”, “Avg Shipping Cost”, “Avg Discount”] kpi_values = [ f”${total_sales:,.2f}”, f”${total_profit:,.2f}”, f”${avg_shipping:.2f}”, f”{avg_discount:.2%}” ] kpi_colors = [“#4CAF50”, “#2196F3”, “#FF9800”, “#9C27B0”] # Fill KPI cards for i, ax in enumerate(axes): ax.text(0.5, 0.6, kpi_labels[i], fontsize=16, ha=’center’, va=’center’, weight=’bold’, color=’gray’) ax.text(0.5, 0.4, kpi_values[i], fontsize=24, ha=’center’, va=’center’, weight=’bold’, color=kpi_colors[i]) ax.set_facecolor(“#f7f7f7”) ax.set_frame_on(True) ax.patch.set_linewidth(1) ax.patch.set_edgecolor(“#ddd”) # Add vertical and horizontal dividers fig.subplots_adjust(hspace=0.3, wspace=0.3) fig_width, fig_height = fig.get_size_inches() # Vertical line in the center fig.lines.append(plt.Line2D([0.5, 0.5], [0.05, 0.95], color=”lightgray”, linewidth=1, linestyle=”–“, transform=fig.transFigure)) # Horizontal line in the center fig.lines.append(plt.Line2D([0.05, 0.95], [0.5, 0.5], color=”lightgray”, linewidth=1, linestyle=”–“, transform=fig.transFigure)) plt.tight_layout(rect=[0, 0, 1, 0.95]) plt.show() For the next set of metrics, we will be using multiple columns of the database. Let’s go ahead and retrieve it as a dataframe so that we can easily leverage pandas groupby and aggregation. query_str=”SELECT *” query = cont.query(query_str) rs = query.fetch() df = rs.fetch_rows() 2. Profitability We are tracking 2 key things here: products with most profit regions with most profit top_products_profit = df.groupby(‘Product_Name’)[‘Profit’].sum().sort_values(ascending=False).head(10) region_profit = df.groupby(‘Region’)[‘Profit’].sum().sort_values(ascending=False) fig, axes = plt.subplots(1, 2, figsize=(16, 6)) sns.barplot(x=top_products_profit.values, y=top_products_profit.index, hue=top_products_profit.index, palette=”viridis”, legend=False, ax=axes[0]) axes[0].set_title(“Top 10 Products by Profit”) axes[0].set_xlabel(“Profit”) axes[0].set_ylabel(“Product”) sns.barplot(x=region_profit.values, y=region_profit.index, hue=region_profit.index, palette=”magma”, legend=False, ax=axes[1]) axes[1].set_title(“Profit by Region”) axes[1].set_xlabel(“Profit”) axes[1].set_ylabel(“Region”) plt.tight_layout() plt.show() 3. Discount Similarly, for the second branch in our tree framework – Discount, we will be tracking 2 metrics: Best selling products (products that do not need discounts) Bottlenecks due to discounts (products with low or negative profit margin) top_sales_no_discount = df[df[‘Discount’] == 0].groupby(‘Product_Name’)[‘Sales’].sum().sort_values(ascending=False).head(10) df[‘Profit_Margin’] = df[‘Profit’] / df[‘Sales’] low_margin = df[(df[‘Discount’] > 0.3) & (df[‘Profit_Margin’] < 0)].groupby(‘Product_Name’)[‘Profit’].sum().sort_values().head(10) fig, axes = plt.subplots(1, 2, figsize=(16, 6)) sns.barplot(x=top_sales_no_discount.values, y=top_sales_no_discount.index, hue=top_sales_no_discount.index, palette=”cubehelix”, legend=False, ax=axes[0]) axes[0].set_title(“Top Selling Products Without Discount”) axes[0].set_xlabel(“Sales”) axes[0].set_ylabel(“Product”) sns.barplot(x=low_margin.values, y=low_margin.index, hue=low_margin.index, palette=”coolwarm”, legend=False, ax=axes[1]) axes[1].set_title(“High Discount, Low/Negative Profit”) axes[1].set_xlabel(“Profit”) axes[1].set_ylabel(“Product”) plt.tight_layout() plt.show() 4. Market Segmentation Lastly, we also want to look at the market segmentation and we can segment it by 4 parts: Quantity sold by region Profit by region Shipping cost by region Discount by region quantity_by_region = df.groupby(‘Region’)[‘Quantity’].sum().sort_values(ascending=False) shipping_by_region = df.groupby(‘Region’)[‘Shipping_Cost’].mean().sort_values(ascending=False) discount_by_region = df.groupby(‘Region’)[‘Discount’].mean().sort_values(ascending=False) fig, axes = plt.subplots(2, 2, figsize=(18, 12)) # Plot 1: Quantity Sold by Region sns.barplot(x=quantity_by_region.values, y=quantity_by_region.index, hue=quantity_by_region.index, palette=”crest”, legend=False, ax=axes[0, 0]) axes[0, 0].set_title(“Quantity Sold by Region”) axes[0, 0].set_xlabel(“Quantity”) axes[0, 0].set_ylabel(“Region”) # Plot 2: Profit by Region sns.barplot(x=region_profit.values, y=region_profit.index, hue=region_profit.index, palette=”flare”, legend=False, ax=axes[0, 1]) axes[0, 1].set_title(“Profit by Region”) axes[0, 1].set_xlabel(“Profit”) axes[0, 1].set_ylabel(“Region”) # Plot 3: Shipping Cost by Region sns.barplot(x=shipping_by_region.values, y=shipping_by_region.index, hue=shipping_by_region.index, palette=”rocket”, legend=False, ax=axes[1, 0]) axes[1, 0].set_title(“Shipping Cost by Region”) axes[1, 0].set_xlabel(“Shipping Cost”) axes[1, 0].set_ylabel(“Region”) # Plot 4: Discount by Region sns.barplot(x=discount_by_region.values, y=discount_by_region.index, hue=discount_by_region.index, palette=”ch:start=.2,rot=-.3″, legend=False, ax=axes[1, 1]) axes[1, 1].set_title(“Average Discount by Region”) axes[1, 1].set_xlabel(“Discount”) axes[1, 1].set_ylabel(“Region”) plt.tight_layout() plt.show() Insights The problem statement was how can we grow our business. We focused on increasing our profits and reducing our discounts. So what insights can we draw from the visualizations above? Some specific smartphones are one of the most profitable products for our business, so I would focus on if the demand is met and how can we improve the customer experience for that domain. Africa and Central Asia stores are facing overall losses. Therefore, we need to consider reducing the scale of the store. More information is needed in this regard. The second point is validated when we also look at the average discount in Africa and Asia – the discount is one of the highest while profit margins are negative. These regions, therefore, are contributing negatively to our overall business. Western Europe seems to be our biggest market with higher demand and higher profit margins. So, we should be doubling down here to grow our business. Conclusion In this tutorial, we developed an end-to-end BI solution for a superstore. The tutorial dived into problem-solving, critical thinking followed by detailed analysis and providing actionable insights. This is close to a real-world application of how a Data/Business Analyst would approach a business problem. We also saw that in order to have a centralized data repository, we can make use of GridDB. With GridDB’s python client, the integration of pandas for analysis seemed

More
Building a Stock Market Chatbot Application Using LangGraph and GridDB

With the advent of Large Language Models (LLMs), chatbot applications have become increasingly common, enabling more natural and intelligent interactions with data. In this article, you will see how to build a stock market chatbot using LangGraph, OpenAI API, and GridDB cloud. We will retrieve historical Apple stock price data from Yahoo Finance using the yfinance library, insert it into a GridDB container, and then query it using a chatbot built with LangGraph that utilizes the OpenAI GPT -4 model. GridDB is a high-performance time-series database designed for massive real-time workloads. Its support for structured containers, built-in compression, and lightning-fast reads and writes makes it ideal for storing and querying time series data such as stock market prices. Installing and Importing Required Libraries !pip install -q yfinance !pip install langchain !pip install langchain-core !pip install langchain-community !pip install langgraph !pip install langchain_huggingface !pip install tabulate !pip uninstall -y pydantic !pip install –no-cache-dir “pydantic>=2.11,<3” import yfinance as yf import pandas as pd import json import datetime as dt import base64 import requests import numpy as np from langchain_core.prompts import ChatPromptTemplate from langchain_openai import ChatOpenAI from langchain_core.output_parsers import StrOutputParser from langgraph.graph import START, END, StateGraph from langchain_core.messages import HumanMessage from langgraph.checkpoint.memory import MemorySaver from langchain_experimental.agents import create_pandas_dataframe_agent from langchain_openai import OpenAI from langchain.agents.agent_types import AgentType from typing_extensions import List, TypedDict from pydantic import BaseModel, Field from IPython.display import Image, display Inserting and Retrieving Stock Market Data From GridDB We will first import data from Yahoo Finance into a Python application. Next, we will insert this data into a GridDB container and will retrieve it. Importing Data from Yahoo Finance The yfinance Python library allows you to import data from Yahoo Finance. You need to pass the ticker name, as well as the start and end dates, for the data you want to download. The following script downloads the Apple stock price data for the year 2024. import yfinance as yf import pandas as pd ticker = “AAPL” start_date = “2024-01-01” end_date = “2024-12-31” dataset = yf.download(ticker, start=start_date, end=end_date, auto_adjust=False) # ─────────────────────────────────────────────────────────────── # 1. FLATTEN: keep the level that holds ‘Close’, ‘High’, … # ─────────────────────────────────────────────────────────────── if isinstance(dataset.columns, pd.MultiIndex): # find the level index where ‘Close’ lives for lvl in range(dataset.columns.nlevels): level_vals = dataset.columns.get_level_values(lvl) if ‘Close’ in level_vals: dataset.columns = level_vals # keep that level break else: # already flat – nothing to do pass # ─────────────────────────────────────────────────────────────── # 2. Select OHLCV, move index to ‘Date’ # ─────────────────────────────────────────────────────────────── dataset = dataset[[‘Close’, ‘High’, ‘Low’, ‘Open’, ‘Volume’]] dataset = dataset.reset_index().rename(columns={‘index’: ‘Date’}) dataset[‘Date’] = pd.to_datetime(dataset[‘Date’]) # optional: reorder columns dataset = dataset[[‘Date’, ‘Close’, ‘High’, ‘Low’, ‘Open’, ‘Volume’]] dataset.columns.name = None dataset.head() Output: The above output indicates that the dataset comprises the daily closing, opening, high, low, and volume prices for Apple stock. In the section, you will see how to insert this data into a GridDB cloud container. Establishing a Connection with GridDB Cloud After you create your GridDB cloud account and complete configuration settings, you can run the following script to see if you can access your database within a Python application. username = “your_user_name” password = “your_password” base_url = “your_griddb_host_url” url = f”{base_url}/checkConnection” credentials = f”{username}:{password}” encoded_credentials = base64.b64encode(credentials.encode()).decode() headers = { ‘Content-Type’: ‘application/json’, # Added this header to specify JSON content ‘Authorization’: f’Basic {encoded_credentials}’, ‘User-Agent’: ‘PostmanRuntime/7.29.0’ } response = requests.get(url, headers=headers) print(response.status_code) print(response.text) Output: 200 The above output indicates that you have successfully connected with your GridDB cloud host. Creating a Container for Inserting Stock Market Data in GridDB Cloud Next, we will insert the Yahoo Finance into GridDB. To do so, we will add another column, SerialNo which contains unique keys for each data row, as GridDB expects a unique key column in the dataset. Next, we will map Pandas dataframe column types to Gridb data types. dataset.insert(0, “SerialNo”, dataset.index + 1) dataset[‘Date’] = pd.to_datetime(dataset[‘Date’]).dt.strftime(‘%Y-%m-%d’) # “2024-01-02” dataset.columns.name = None # Mapping pandas dtypes to GridDB types type_mapping = { “int64”: “LONG”, “float64”: “DOUBLE”, “bool”: “BOOL”, ‘datetime64’: “TIMESTAMP”, “object”: “STRING”, “category”: “STRING”, } # Generate the columns part of the payload dynamically columns = [] for col, dtype in dataset.dtypes.items(): griddb_type = type_mapping.get(str(dtype), “STRING”) # Default to STRING if unknown columns.append({ “name”: col, “type”: griddb_type }) columns Output: [{‘name’: ‘SerialNo’, ‘type’: ‘LONG’}, {‘name’: ‘Date’, ‘type’: ‘STRING’}, {‘name’: ‘Close’, ‘type’: ‘DOUBLE’}, {‘name’: ‘High’, ‘type’: ‘DOUBLE’}, {‘name’: ‘Low’, ‘type’: ‘DOUBLE’}, {‘name’: ‘Open’, ‘type’: ‘DOUBLE’}, {‘name’: ‘Volume’, ‘type’: ‘LONG’}] The above output displays the dataset column names and their corresponding GridDB-compliant data types. The next step is to create a GridDB container. To do so, you need to pass the container name, container type, and a list of column names and their data types. url = f”{base_url}/containers” container_name = “stock_db” # Create the payload for the POST request payload = json.dumps({ “container_name”: container_name, “container_type”: “COLLECTION”, “rowkey”: True, # Assuming the first column as rowkey “columns”: columns }) # Make the POST request to create the container response = requests.post(url, headers=headers, data=payload) # Print the response print(f”Status Code: {response.status_code}”) Adding Stock Data to GridbDB Cloud Container Once you have created a container, you must convert the data from your Pandas dataframe into the JSON format and call a put request to insert data into GridDB. url = f”{base_url}/containers/{container_name}/rows” # Convert dataset to list of lists (row-wise) with proper formatting def format_row(row): formatted = [] for item in row: if pd.isna(item): formatted.append(None) # Convert NaN to None elif isinstance(item, bool): formatted.append(str(item).lower()) # Convert True/False to true/false elif isinstance(item, (int, float)): formatted.append(item) # Keep integers and floats as they are else: formatted.append(str(item)) # Convert other types to string return formatted # Prepare rows with correct formatting rows = [format_row(row) for row in dataset.values.tolist()] # Create payload as a JSON string payload = json.dumps(rows) # Make the PUT request to add the rows to the container response = requests.put(url, headers=headers, data=payload) # Print the response print(f”Status Code: {response.status_code}”) print(f”Response Text: {response.text}”) Output: Status Code: 200 Response Text: {“count”:251} If you see the above response, you have successfully inserted the data. Retrieving Data from GridDB After inserting the data, you can perform a variety of operations on the dataset. Let’s see how to import data from a GridDB container into a Pandas dataframe. container_name = “stock_db” url = f”{base_url}/containers/{container_name}/rows” # Define the payload for the query payload = json.dumps({ “offset”: 0, # Start from the first row “limit”: 10000, # Limit the number of rows returned “condition”: “”, # No filtering condition (you can customize it) “sort”: “” # No sorting (you can customize it) }) # Make the POST request to read data from the container response = requests.post(url, headers=headers, data=payload) # Check response status and print output print(f”Status Code: {response.status_code}”) if response.status_code == 200: try: data = response.json() print(“Data retrieved successfully!”) # Convert the response to a DataFrame rows = data.get(“rows”, []) stock_dataset = pd.DataFrame(rows, columns=[col for col in dataset.columns]) except json.JSONDecodeError: print(“Error: Failed to decode JSON response.”) else: print(f”Error: Failed to query data from the container. Response: {response.text}”) print(stock_dataset.shape) stock_dataset.head() Output: The above output shows the data retrieved from the GridDB container. We store the data in a Pandas dataframe. You can store the data in any other format if you want. Once you have the data, you can create a variety of AI and data science applications. Creating a Stock Market Chatbot Using GridDB Data In this next section, you will see how to create a simple chatbot in LangGraph framework, which calls the OpenAI API to answer your questions about the Apple stock price you just retrieved from the GridDB. Creating a Graph in LangGraph To create a Graph in LangGraph, you need to define its state. A graph’s state contains attributes that are shared between multiple graph nodes. Since we only need to store questions and answers, we create the following graph state. class State(TypedDict): question: str answer: str We will use the create_pandas_dataframe_agent from LangChain to answer our questions since we retrieved data from Gridb into a Pandas dataframe. We will create the agent object and will call it inside the run_llm() function we define. We will use this function in our LangGraph node. api_key = “YOUR_OPENAI_API_KEY” llm = ChatOpenAI(model = ‘gpt-4o’, api_key = api_key) agent = create_pandas_dataframe_agent(llm, stock_dataset, verbose=True, agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION, allow_dangerous_code=True) def run_llm(state: State): question = state[‘question’] response = agent.invoke(question) return {‘answer’: response[‘output’]} Finally, we define the graph for our chatbot. The graph consists of only one node, ask_question, which calls the run_llm() function. Inside the function, we call the create_pandas_dataframe_agent(), which answers questions about the dataset. graph_builder=StateGraph(State) graph_builder.add_node(“ask_question”, run_llm) graph_builder.add_edge(START,”ask_question”) graph_builder.add_edge(“ask_question”,END) graph = graph_builder.compile() display(Image(graph.get_graph().draw_mermaid_png())) Output: The above output shows the flow of our graph. Asking Questions Let’s test our chatbot by asking some questions. We will first ask our chatbot about the month that had the highest average opening price—also, the month where people made the most profit in day trading. question = [HumanMessage(content=”Which month had the highest average opening stock price? And what is the month where people made most profit in day trading?”)] result = graph.invoke({“question”: question}) print(result[‘answer’]) Output: The output above shows the chatbot’s response. That is correct; I verified it manually using a Python script. Let’s ask it to be more creative and see if it finds any interesting patterns in the dataset. question = [HumanMessage(content=”Do you find any interesting patterns in the dataset?”)] result = graph.invoke({“question”: question}) print(result[‘answer’]) Output: g src=”images\img5-chatbot-response2b.png”> The above output shows the first part of the reply. You can see that the chatbot is intelligent enough to draw a plot for the closing prices to identify interesting patterns. The following output shows some interesting observations from the dataset. Output: g src=”images\img5-chatbot-response2b.png”> Conclusion The article demonstrates how to create an OpenAI API-based chatbot that answers questions related to data retrieved from GridDB. If you have any questions or need help with GridDB cloud, you can leave your query on Stack Overflow using the griddb tag. Our team will be happy to answer it. For the complete code of this article, visit my GridDB Blogs GitHub

More
New Features for the GridDB Cloud CLI Tool

The GridDB Cloud CLI Tool aims to make routine maintenance of checking of your GridDB Cloud instance a much easier endeavor. When we first introduced it, it was able to do your basic CRUD commands, but it lacked the ability to read from the filesystem to create containers/tables and to push data to those containers. Although maybe a subtle difference, essentially before, whenever the tool was run to CREATE, you needed manual intervention (known as interactive mode) to use the tool. I’m sure you know where I am going with this — in this latest release, we have added some functionality revolved around reading JSON files from the filesystem which can help certain workflow, such as to help automate some testing on your GridDB Cloud, for instance. Being able to read from JSON files to create tables also means we are able to migrate from GridDB CE to GridDB Cloud as discussed in this blog: https://griddb.net/en/blog/griddb-cloud-on-azure-marketplace-how-to-migrate-3-ways/. So for this article, since there’s not much in the way of showcasing features, I will walk through the technical updates made to the tool, as well as a simple demo of how this tool can be useful. Creating Container from JSON File First off, I changed the way GridDB Cloud CLI Tool handles the create command as explained above. From now on, when you use Create, if you want to create a new container by filling out CLI prompts, you can use the -i flag to indicate –interactive mode. So if you run: griddb-cloud-cli create -i, it will begin the process of asking a series of questions until you get the container you want. And now if you run griddb-cloud-cli create the tool will expect exactly one argument, a json file. The JSON file format is modeled to be the same as exported by the GridDB CE tool, here’s an example: { “version”:”5.6.0″, “database”:”public”, “container”:”device”, “containerType”:”TIME_SERIES”, “containerFileType”:”csv”, “containerFile”:[ “public.device_2020-07-11_2020-07-12.csv”, “public.device_2020-07-12_2020-07-13.csv”, “public.device_2020-07-13_2020-07-14.csv”, “public.device_2020-07-14_2020-07-15.csv”, “public.device_2020-07-15_2020-07-16.csv”, “public.device_2020-07-16_2020-07-17.csv”, “public.device_2020-07-17_2020-07-18.csv”, “public.device_2020-07-18_2020-07-19.csv”, “public.device_2020-07-19_2020-07-20.csv” ], “partitionNo”:14, “columnSet”:[ { “columnName”:”ts”, “type”:”timestamp”, “notNull”:true }, { “columnName”:”co”, “type”:”double”, “notNull”:false }, { “columnName”:”humidity”, “type”:”double”, “notNull”:false }, { “columnName”:”light”, “type”:”boolean”, “notNull”:false }, { “columnName”:”lpg”, “type”:”double”, “notNull”:false }, { “columnName”:”motion”, “type”:”boolean”, “notNull”:false }, { “columnName”:”smoke”, “type”:”double”, “notNull”:false }, { “columnName”:”temp”, “type”:”double”, “notNull”:false } ], “rowKeySet”:[ “ts” ], “timeSeriesProperties”:{ “compressionMethod”:”NO”, “compressionWindowSize”:-1, “compressionWindowSizeUnit”:”null”, “expirationDivisionCount”:-1, “rowExpirationElapsedTime”:-1, “rowExpirationTimeUnit”:”null” }, “compressionInfoSet”:[ ], “timeIntervalInfo”:[ { “containerFile”:”public.device_2020-07-11_2020-07-12.csv”, “boundaryValue”:”2020-07-11T17:00:00.000-0700″ }, { “containerFile”:”public.device_2020-07-12_2020-07-13.csv”, “boundaryValue”:”2020-07-12T00:00:00.000-0700″ }, { “containerFile”:”public.device_2020-07-13_2020-07-14.csv”, “boundaryValue”:”2020-07-13T00:00:00.000-0700″ }, { “containerFile”:”public.device_2020-07-14_2020-07-15.csv”, “boundaryValue”:”2020-07-14T00:00:00.000-0700″ }, { “containerFile”:”public.device_2020-07-15_2020-07-16.csv”, “boundaryValue”:”2020-07-15T00:00:00.000-0700″ }, { “containerFile”:”public.device_2020-07-16_2020-07-17.csv”, “boundaryValue”:”2020-07-16T00:00:00.000-0700″ }, { “containerFile”:”public.device_2020-07-17_2020-07-18.csv”, “boundaryValue”:”2020-07-17T00:00:00.000-0700″ }, { “containerFile”:”public.device_2020-07-18_2020-07-19.csv”, “boundaryValue”:”2020-07-18T00:00:00.000-0700″ }, { “containerFile”:”public.device_2020-07-19_2020-07-20.csv”, “boundaryValue”:”2020-07-19T00:00:00.000-0700″ } ] } Now, obviously, if you were just writing your own json files, you wouldn’t need a lot of the information from the migration tool, so instead you can create something with just the bare minimum. { “database”:”public”, “container”:”device10″, “containerType”:”TIME_SERIES”, “columnSet”:[ { “columnName”:”ts”, “type”:”timestamp”, “notNull”:true }, { “columnName”:”co”, “type”:”double”, “notNull”:false }, { “columnName”:”humidity”, “type”:”double”, “notNull”:false }, { “columnName”:”light”, “type”:”boolean”, “notNull”:false }, { “columnName”:”lpg”, “type”:”double”, “notNull”:false }, { “columnName”:”motion”, “type”:”boolean”, “notNull”:false }, { “columnName”:”smoke”, “type”:”double”, “notNull”:false }, { “columnName”:”temp”, “type”:”double”, “notNull”:false } ], “rowKeySet”:[ “ts” ] } Small note, you can also use the -f flag to avoid the confirmation of the container you are about to create, again avoiding manual user input when committing an action. Technical Details of Implementing Let’s take a quick look at the underlying code of writing a program which can take the JSON file from the filesystem and push it to GridDB Cloud to be made into a new container. First, let’s look at interpreting the file after being read from the filesystem: func ParseJson(jsonName string) (cmd.ContainerInfo, []string) { filename := jsonName properties, err := os.ReadFile(filename) if err != nil { log.Fatal(err) } var exportProperties ExportProperties err = json.Unmarshal(properties, &exportProperties) if err != nil { log.Fatal(err) } //fmt.Println(exportProperties) var conInfo cmd.ContainerInfo conInfo.ContainerName = exportProperties.Container conInfo.ContainerType = exportProperties.ContainerType conInfo.RowKey = len(exportProperties.RowKeySet) > 0 cols := transformToConInfoCols(exportProperties.ColumnSet) conInfo.Columns = cols return conInfo, exportProperties.ContainerFile } Here we read the file contents, create a data structure of type ExportProperties which was declared earlier (and shown below). We unmarshal (think of mapping) the json file to match our struct so that our program knows which values correspond to which keys. From there, we do some other processing to create the data object which will be sent to GriddB Cloud via an HTTP Web Request. // data struct to unmarshal user json file type ExportProperties struct { Version string `json:”version,omitempty”` Database string `json:”database,omitempty”` Container string `json:”container”` ContainerType string `json:”containerType,omitempty”` ContainerFileType string `json:”containerFileType,omitempty”` ContainerFile ContainerFile `json:”containerFile”` ColumnSet []ColumnSet `json:”columnSet”` RowKeySet []string `json:”rowKeySet”` } Here we define the entire struct which may be read from either the user-made JSON, or the GridDB CE migration tool-generated JSON (notice the keys which are omitted in the user JSONs have a special designation of omitempty). One last thing I’ll point out is that we needed to use a custom unmarshaller for this specific struct (ContainerFile) because when the GridDB export tool outputs the meta JSON file, the containerFile value is sometimes a single string, and sometimes it’s a slice of strings. Here’s the custom unmarshaller: // custom JSON unmarshaller for the case where sometimes the value is a slice // and sometimes it’s just a singular string func (c *ContainerFile) UnmarshalJSON(data []byte) error { var nums any err := json.Unmarshal(data, &nums) if err != nil { return err } items := reflect.ValueOf(nums) switch items.Kind() { case reflect.String: *c = append(*c, items.String()) case reflect.Slice: *c = make(ContainerFile, 0, items.Len()) for i := 0; i < items.Len(); i++ { item := items.Index(i) switch item.Kind() { case reflect.String: *c = append(*c, item.String()) case reflect.Interface: *c = append(*c, item.Interface().(string)) } } } return nil } Examples Of Using New Features As explained above, you can now use the tool as part of your workflow and bash scripting. So you can, for example, run a script with cron to track your Downloads folder list of files and general file size. Let's take a look at a working example. Bash Script to Track Downloads Directory As a silly example of how you can use the tool in your bash scripts, let's create a TIME_SERIES container which will keep track of the status of your downloads directory. We will create a JSON file with the container's schema, then we will use a simple bash script to push the current datetime and a couple of data points to our table. We then schedule a cron to run the script every 1 hour so that we have up to date data. First, the JSON file: { "database": "public", "container": "download_data", "containerType": "TIME_SERIES", "columnSet": [ { "columnName": "ts", "type": "timestamp", "notNull": true }, { "columnName": "file_count", "type": "integer", "notNull": false }, { "columnName": "total_size", "type": "string", "notNull": false } ] } And then our script, which will create the table (I realize it's not the best showcase as we are using just one table, but you could also create a situation where you push logs to a table and create new daily tables, kind of like fluentd), and then push up to date data of your directory: #!/bin/bash DOWNLOADS_DIR="/Users/israelimru/Downloads" FILE_COUNT=$(ls -1A "$DOWNLOADS_DIR" | wc -l | xargs) TOTAL_SIZE=$(du -sh "$DOWNLOADS_DIR" | awk '{print $1}') LOG_DATA="NOW(),$FILE_COUNT,$TOTAL_SIZE" griddb-cloud-cli create /Users/israelimru/download_table.json -f griddb-cloud-cli put -n download_data -v $LOG_DATA echo "Log complete." NOTE: It's crucial to use the -f flag when calling the create command because it won't ask for a prompt, it will simply create the table for you (and simply fail if the table already exists, which is fine!) Here you can see our LOG_DATA uses the NOW() variable; this will be auto translated by the tool to be a current timestamp in the proper format. And then we just grab the relevant data points using simple unix commands and push that data to the Cloud. Then set up the cronjob: 0 * * * * /your/script/testing.sh Conclusion Being able to use the GridDB Cloud CLI Tool without user intervention opens up some new possibilities and we hope these changes can be useful for you and your team. To grab a copy of the cloud tool, you can check here: https://github.com/Imisrael/griddb-cloud-cli. Source code for this blog is simply that bash script up in the section prior, so feel free to copy and use as you

More
Generate Podcast from PDF using AI

Why This Project? Turn your complex PDFs into clear, natural-sounding AI-powered podcasts instantly. Ideal for busy professionals, auditory learners, and passive learning workflows. Core Benefits: Hands-free learning, ideal during commutes or chores. Easily transforms technical PDFs into engaging audio. Drastically reduces content-creation time. Problem & Why This Matters Converting PDFs manually is tedious: Difficult text extraction from PDFs. Complex manual summarization and audio recording overhead. Messy PDF management/storage. Your AI-powered podcast solution automates every step—solving these clearly. Introducing the AI-powered PDF-to-Podcast Generation System The diagram above illustrates the simplified workflow of the AI-powered PDF-to-podcast system: User Uploads PDF: The user submits a PDF document to the platform. OCR Text Extraction (Mistral AI OCR): Mistral AI OCR accurately extracts the text content from the uploaded PDF. Summarize Content (OpenAI): The extracted text is summarized by OpenAI to simplify complex content for easier listening. Convert Text to Podcast (OpenAI TTS): OpenAI’s Text-to-Speech converts the summarized text into natural, engaging audio. Store Data (GridDB Cloud): The summarized text and associated data are efficiently stored in the GridDB Cloud for future retrieval. Podcast Playback: Users access the simplified, engaging podcast directly for convenient listening. Run the Project To run this project, you should clone the apps directory from this repository. git clone https://github.com/junwatu/griddb-podcast-creator.git cd griddb-podcast-creator cd apps npm install Create a .env file for the project credentials with these keys: MISTRAL_API_KEY= OPENAI_API_KEY= GRIDDB_WEBAPI_URL= GRIDDB_PASSWORD= GRIDDB_USERNAME= You need the Mistral API key for OCR functionality, the OpenAI key for the text-to-speech (TTS) conversion, and GridDB keys for the data storage. Please, look in this section on how to get those keys. Run the project using this command: npm run dev Go to the default app URL http://localhost:3000 Browse for the PDF file and click the Convert to Podcast button to generate podcast. Prerequisites Node.js You need Node.js installed because this project uses Next.js. Install the Node LTS version from here. Mistral OCR API Setup A Mistral API key is needed to use the OCR functionality. Create the API key here. OpenAI API Setup Create the OpenAI API key here. You may need create a project and enable few models. In this project, we will use two AI models from OpenAI: gpt-4o to create audio script. gpt-4o-mini-tts for generating audio from the script. GridDB Cloud Setup The GridDB Cloud offers a free plan tier and is officially available worldwide. You need these GridDB environment variables in the .env file: GRIDDB_WEBAPI_URL= GRIDDB_USERNAME= GRIDDB_PASSWORD= Sign Up for GridDB Cloud Free Plan If you would like to sign up for a GridDB Cloud Free instance, you can do so in the following link: https://form.ict-toshiba.jp/download_form_griddb_cloud_freeplan_e. After successfully signing up, you will receive a free instance along with the necessary details to access the GridDB Cloud Management GUI, including the GridDB Cloud Portal URL, Contract ID, Login, and Password. GridDB WebAPI URL Go to the GridDB Cloud Portal and copy the WebAPI URL from the Clusters section. It should look like this: GridDB Username and Password Go to the GridDB Users section of the GridDB Cloud portal and create or copy the username for GRIDDB_USERNAME. The password is set when the user is created for the first time, use this as the GRIDDB_PASSWORD. For more details, to get started with GridDB Cloud, please follow this quick start guide. IP Whitelist When running this project, please ensure that the IP address where the project is running is whitelisted. Failure to do so will result in a 403 status code or forbidden access. You can use a website like What Is My IP Address to find your public IP address. To whitelist the IP, go to the GridDB Cloud Admin and navigate to the Network Access menu. Building the Podcast Generator Developing the Next.js Web Interface & API The main important code is the API route that handles PDF upload and then processes it. This is the snippet code from the route.ts file in the apps/app/api/upload directory: export async function POST(request: NextRequest) { try { const formData = await request.formData(); const file = formData.get(‘file’) as File; if (!file) { return NextResponse.json( { error: ‘No file uploaded’ }, { status: 400 } ); } if (file.type !== ‘application/pdf’) { return NextResponse.json( { error: ‘Invalid file type. Please upload a PDF file’ }, { status: 400 } ); } const maxSize = 10 * 1024 * 1024; // 10MB in bytes if (file.size > maxSize) { return NextResponse.json( { error: ‘File size too large. Maximum size is 10MB’ }, { status: 400 } ); } const bytes = await file.arrayBuffer(); const buffer = Buffer.from(bytes); const timestamp = Date.now(); const randomString = Math.random().toString(36).substring(7); const tempFilename = `upload_${timestamp}_${randomString}.pdf`; const tempFilePath = join(os.tmpdir(), tempFilename); await writeFile(tempFilePath, buffer); // Extract data from PDF const { content: pdfContent, response: ocrResponse } = await ocrService.processFile(tempFilePath, file.name); // Generate script for the audio from the extracted data const audioScript = await openaiService.generatePodcastScript(pdfContent); // Generate audio from the script const audioFiles = await generatePodcastAudio(audioScript, process.env.OPENAI_API_KEY || ”, { voice: ‘verse’, outputDir: audioDir, instructions: instructions, outputFormat: ‘mp3’, }); const cleanedAudioFiles = cleanAudioPaths(audioFiles); // Save the data into GridDB database const podcastData: GridDBData = { id: generateRandomID(), audioFiles: JSON.stringify(cleanedAudioFiles), audioScript: JSON.stringify(audioScript), // @ts-ignore ocrResponse: JSON.stringify(ocrResponse), }; const result = await dbClient.insertData({ data: podcastData }); return NextResponse.json({ message: ‘File uploaded successfully’, fileName: file.name, fileSize: file.size, tempFilePath: tempFilePath, ocrResponse: ocrResponse, audioFiles: audioFiles, audioScript: audioScript, }); } catch (error) { console.error(‘Error uploading file:’, error); return NextResponse.json( { error: ‘Failed to upload file’ }, { status: 500 } ); } } Let’s look at the code in detail: 1. OCR Extraction of PDF Content /** Extracted data from PDF */ const { content: pdfContent, response: ocrResponse } = await ocrService.processFile(tempFilePath, file.name); Function: The uploaded PDF file is sent to an OCR (Optical Character Recognition) service for processing. Outcome: Mistral OCR extracts textual content from the PDF and provides a full response (ocrResponse) and the PDF content (pdfContent). 2. Generating the Podcast Script /** Generate script for the audio from the extracted data */ const audioScript = await openaiService.generatePodcastScript(pdfContent); Function: Pass the extracted textual content (pdfContent) to an OpenAI-powered service. Outcome: A structured podcast script (audioScript) suitable for text-to-speech conversion. 3. Creating Audio from the Generated Script /** Generate audio from the script */ const audioFiles = await generatePodcastAudio(audioScript, process.env.OPENAI_API_KEY || ”, { voice: ‘verse’, outputDir: audioDir, instructions: instructions, outputFormat: ‘mp3′, }); Function: Convert the generated podcast script into audio form using an OpenAI text-to-speech API. Parameters: voice: Determines vocal style (e.g., ‘verse’). outputDir: The destination directory for audio files. In this project, it is set to the public/audio directory. instructions: Extra refinement or instructional parameters for audio quality. outputFormat: Audio format set as ‘mp3’. Outcome: The data location of audio segments (audioFiles). 4. Saving Data to GridDB Database /** Save the data into GridDB database */ const podcastData: GridDBData = { id: generateRandomID(), audioFiles: JSON.stringify(cleanedAudioFiles), audioScript: JSON.stringify(audioScript), // ts ignore // @ts-ignore ocrResponse: JSON.stringify(ocrResponse) } const result = await dbClient.insertData({ data: podcastData }); Function: Collect and organize processed data (ID, audio files, podcast script, OCR response) into a structured object (podcastData) with appropriate type conversions (JSON serialization) to match the GridDB data schema. Outcome: Persistently stores generated audio metadata, associated scripts, and OCR extraction data into the GridDB database, providing future retrieval and management capabilities. In summary, the main functionality follows a clear sequence from our system diagram before. Integrating OCR using Mistral AI Mistral OCR is an Optical Character Recognition API that sets a new standard in document understanding. Unlike other models, Mistral OCR comprehends each element of documents—media, text, tables, equations—with unprecedented accuracy and cognition. It takes images and PDFs as input and extracts content in an ordered interleaved text and images. In this project, we will use Mistral OCR to extract text from PDFs. The process involves: 1. Uploading the PDF file to Mistral. const uploaded_pdf = await this.client.files.upload({ file: { fileName: fileName, content: file, }, purpose: “ocr”, }); 2. Retrieving the signed URL for the uploaded PDF. const signedUrl = await this.client.files.getSignedUrl({ fileId: uploaded_pdf.id, }); 3. Sending the signed URL to Mistral OCR for text extraction. const ocrResponse = await this.client.ocr.process({ model: “mistral-ocr-latest”, document: { type: “document_url”, documentUrl: signedUrl.url, } }); This final step will yield a complete OCR data response from Mistral. Get PDF key points and summarization using OpenAI We won’t convert all the content of the PDF extraction text because it will be to long. The best way is to summarize and get the key points of the extraction data. For this task, we will use the gpt-4o model. This is the system prompt to extract meaningful content from PDF’s extracted data: Create a 5-minute podcast episode script in a conversational style, using the content provided.\n\nInclude the following elements:\n\n- **Introduction**: Engage your audience with an intriguing opening statement related to the topic. Capture their attention immediately.\n\n- **Main Talking Points**: Develop 3-4 main sections discussing the central ideas or arguments. Use relatable examples and personal stories for better understanding. Maintain a conversational tone, as if you are speaking directly to the listener. Ensure natural transitions between sections to keep the flow.\n\n- **Conclusion**: Summarize the key takeaways in a concise manner, making sure to leave a lasting impression.\n\n- **Call to Action**: End with a clear and compelling call to action encouraging listeners to engage further or reflect on the topic.\n\n# Output Format\n\nWrite the script in a conversational and engaging narrative suitable for a podcast. Each section should integrate seamlessly with transitions, emulate a direct speaking style to engage the listener, and reinforce the message.\n\n# Examples\n\n**Introduction**: \”Welcome to [Podcast Name]. Today, we’re diving into [Topic]. Have you ever wondered…?\”\n\n**Main Talking Points**:\n\n1. \”Let’s start with [Main Idea]. It’s like when…\”\n2. \”Moving on to [Next Idea], consider how…\”\n3. \”Finally, when we talk about [Final Idea], there’s a story about…\”\n\n**Conclusion**: \”So, as we’ve learned today, [Key Takeaway 1], [Key Takeaway 2]…\”\n\n**Call to Action**: \”Think about how you can [Action]. Join us next time when we explore…\”\n\n# Notes\n\n- The script should be written to cater both to novices and those with some prior knowledge.\n- Ensure it resonates intellectually and stimulates curiosity among listeners.\n- Use transition words to guide listeners smoothly from one idea to the next. To keep the response consistent, we can use the schema feature. So, basically, we can force the AI model response to match a predefined data structure or schema: { “introduction”: “Welcome to our podcast! Today, we’re exploring how AI can revolutionize the way we consume content by transforming PDFs into engaging audio podcasts. Have you ever wished you could listen to your documents instead of reading them? Let’s dive in!”, “main_talking_points”: [ { “title”: “The Challenges of Manual PDF-to-Podcast Conversion”, “content”: “Manually converting PDFs into podcasts is a tedious process. It involves extracting text, summarizing complex content, and recording audio—all of which take significant time and effort. AI simplifies this by automating these steps, saving you hours of work.” }, { “title”: “How AI Simplifies the Process”, “content”: “AI tools like Mistral OCR and OpenAI TTS streamline the workflow. Mistral OCR extracts text from PDFs with high accuracy, while OpenAI’s models summarize and convert the text into natural-sounding audio. This ensures a seamless and efficient process.” }, { “title”: “The Role of GridDB in Managing Data”, “content”: “GridDB Cloud acts as a robust storage solution for parsed text and audio files. It ensures that your data is organized, easily retrievable, and ready for future use, making the entire system scalable and efficient.” } ], “conclusion”: “In summary, AI-powered tools are transforming the way we interact with content. By automating the conversion of PDFs into podcasts, we save time, enhance accessibility, and create a more engaging learning experience.”, “call_to_action”: “Think about how you can leverage this technology in your own projects. Visit our GitHub repository to get started, and don’t forget to share your feedback!” } The gpt-4o will response data with these keys: introduction main_talking_points conclusion call_to_action With this format then it will be easier to convert the text to audio for our podcast application. For more information about the data schema code, please look into the openai.ts file in apps\app\lib folder. Generating Podcast using OpenAI TTS We will use the gpt-4o-mini-tts model from OpenAI to generate speech from text. This model is capable of controlling the voice of your generated audio with additional instructions. We will process the audio in two processes based on the schema response from the OpenAI model. 1. Process introduction, conclusion, and call_to_action. This code will process the introduction, conclusion, and call to action into audio. const simpleKeys = [‘introduction’, ‘conclusion’, ‘call_to_action’] as const; for (const key of simpleKeys) { try { const text = podcastData[key]; const fileName = `${key}.${outputFormat}`; const speechFile = path.join(outputDir, fileName); const response = await openai.audio.speech.create({ model, voice, instructions, input: text }); const buffer = Buffer.from(await response.arrayBuffer()); fs.writeFileSync(speechFile, buffer); audioFiles[key] = speechFile; } catch (error) { console.error(`Error processing ${key}:`, error); throw error; } } 2. Process main_talking_points This code will process the main content or talking points into audio. // Process main talking points separately if (Array.isArray(podcastData.main_talking_points)) { for (let i = 0; i < podcastData.main_talking_points.length; i++) { try { const point = podcastData.main_talking_points[i]; const text = point.content; const fileName = `talking_point_${i}.${outputFormat}`; const speechFile = path.join(outputDir, fileName); const response = await openai.audio.speech.create({ model, voice, instructions, input: text }); const buffer = Buffer.from(await response.arrayBuffer()); fs.writeFileSync(speechFile, buffer); audioFiles[`talking_point_${i}`] = speechFile; } catch (error) { console.error(`Error processing talking point ${i}:`, error); throw error; } } Storing Data to GridDB Cloud The column data or schema for the GridDB database is simple: export interface GridDBData { id: string | number; ocrResponse: Blob; audioScript: string; audioFiles: string; } Then to save data to the GridDB, in this project the code is in the insertData function: async function insertData({ data, containerName = 'podcasts', }: { data: GridDBData; containerName?: string; }): Promise { console.log(data); try { const row = [ parseInt(data.id.toString(), 10), data.ocrResponse, data.audioScript, data.audioFiles, ]; const path = `/containers/${containerName}/rows`; return await makeRequest(path, [row], 'PUT'); } catch (error) { if (error instanceof GridDBError) { throw error; } const errorMessage = error instanceof Error ? error.message : 'Unknown error'; throw new GridDBError(`Failed to insert data: ${errorMessage}`, undefined, undefined, error); } } The core code actually just PUT operation on REST route path /containers/podcasts/rows. It's so easy to use GridDB on the cloud. The full source code for saving data into GridDB is in the apps/app/lib/griddb.ts file. This file later will be used in the route.ts file as an API. User Interface The code for the user interface resides in a single file, page.tsx. It is developed using React and Shadcn components, featuring two primary tabs: 1. Upload & Convert PDF The user will browse and upload a PDF file. Once the file is selected, the user can click the Convert to Podcast button. This action will make the application process the PDF using OCR to extract text and then generate an audio podcast using AI-generated voices. You can test the application using the sample PDF files available in the pdfs directory. These files are sourced from arXiv and contain highly technical content, making them ideal for evaluating the application's capabilities. 2. Podcast Result After successful conversion, in the Your Podcasts tab the user can listen to the podcast, using the audio player and section list to navigate through the content easily. Possible enhancements Custom Voice Options: Provide users the option for different voices or accents. Podcast hosting integration: Connect your podcasts directly to platforms like Spotify, Apple Podcasts, RSS feeds, etc. Improved UI/UX: Provide users better controls over file management & audio playback. Resources Project source code. GridDB Cloud Web API, Mistral OCR, OpenAI TTS,

More