Blog

Inserting Structured Information from PDF documents into GridDB using LLMs

Extracting meaningful insights from a large corpus of documents is a challenging task. With advancements in Large Language Models (LLMS), it is now possible to automate the process of structured data extraction from text documents. In this article, you will learn how to extract structured data from PDF documents using LLMs in LangChain and store it in GridDB. GridDB is a high-performance NoSQL database suited for managing complex and dynamic datasets. Its high-throughput NOSQL capabilities make it ideal for storing large structured datasets containing text insights. We will begin by downloading a PDF document dataset from Kaggle and extracting structured information from the documents using LangChain. We will then store the structured data in a GridB container. Finally, we will retrieve the data from the GridDB container and analyze the structured metadata for the PDF documents. Note: See the GridDB Blogs GitHub repository for codes. Prerequisites You need to install the following libraries to run the codes in this article. GridDB C Client GridDB Python client You can install these libraries following the instructions on the GridDB Python Package Index (Pypi) In addition, you need to install the langchain, openai, pydantic, pandas, pypdf, openai, tiktoken, and tqdm libraries to run codes in this article. The following script installs these libraries. !pip install –upgrade -q langchain !pip install –upgrade -q pydantic !pip install –upgrade -q langchain-community !pip install –upgrade -q langchain-core !pip install –upgrade -q langchain-openai !pip install –upgrade -q pydantic pandas pypdf openai tiktoken tqdm Finally, run the script below to import the required libraries and modules into your Python application. from pathlib import Path import re import pandas as pd from tqdm import tqdm from itertools import islice from typing import Literal, Optional import matplotlib.pyplot as plt from langchain_core.pydantic_v1 import BaseModel, Field, validator from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder from langchain_openai import ChatOpenAI from langchain.agents import create_openai_functions_agent, AgentExecutor from langchain_experimental.tools import PythonREPLTool from langchain_community.document_loaders import PyPDFDirectoryLoader import griddb_python as griddb Extracting Structured Data from PDF Documents We will extract structured information from the PDF files dataset from Kaggle. Download the dataset into your local directory and run the following script. The dataset contains over a thousand files; however, for the sake of testing, we will extract structured information from 100 documents. The following script extracts data from the first 100 documents and stores it in a Python list. # https://www.kaggle.com/datasets/manisha717/dataset-of-pdf-files pdf_dir = Path(“/home/mani/Datasets/Pdf”) loader = PyPDFDirectoryLoader( pdf_dir, recursive=True, silent_errors=True )# raises warning if a PDF document doesnt contain valid text # first 100 that load cleanly docs_iter = loader.lazy_load() # generator → 1 Document per good PDF docs = list(islice(docs_iter, 100)) docs[0] Output: The above output shows the contents of the first document. In this article, we will use a large language model (LLM) with a structured response in LangChain to extract the title, summary, document type, topic category, and sentiment from a PDF document. To retrieve structured data, we have to define the scheme of the data we want to retrieve. For example, we will predefine some categories for document type, topic category, and sentiment, as shown in the following script. DOC_TYPES = ( “report”, “article”, “manual”, “white_paper”, “thesis”, “presentation”, “policy_brief”, “email”, “letter”, “other”, ) TOPIC_CATS = ( “science”, “technology”, “history”, “business”, “literature”, “health”, “education”, “art”, “politics”, “other”, ) Sentiment = Literal[“positive”, “neutral”, “negative”] Next, we will define a Pydantic BaseModel class object, which contains fields for the structured information we want to extract from the PDF documents. The descriptions of the fields tell LLMs what information to store in them. class PDFRecord(BaseModel): “””Validated metadata for a single PDF.””” title: str = Field( …, description=”Document title. If the text contains no clear title, ” “generate a concise 6–12-word title that reflects the content.” ) summary: str = Field( …, description=”Two- to three-sentence synopsis of the document.” ) doc_type: Literal[DOC_TYPES] = Field( default=”other”, description=”Document genre; choose one from: ” + “, “.join(DOC_TYPES) ) topic_category: Literal[TOPIC_CATS] = Field( default=”other”, description=”Primary subject domain; choose one from: ” + “, “.join(TOPIC_CATS) ) sentiment: Sentiment = Field( default=”neutral”, description=”Overall tone of the document: positive, neutral, or negative.” ) # — fallback helpers so bad labels never crash validation — @validator(“doc_type”, pre=True, always=True) def _doc_fallback(cls, v): return v if v in DOC_TYPES else “other” @validator(“topic_category”, pre=True, always=True) def _topic_fallback(cls, v): return v if v in TOPIC_CATS else “other” The next step is to define a prompt for an LLM that guides the LLM in extracting structured data from PDF documents and converting it to JSON format. The BaseModel class we defined before can extract JSON data from a structured LLM response. Notice that the prompt contains the pdf_text placeholder. This placeholder will store the text of the PDF document. prompt = ChatPromptTemplate.from_messages([ (“system”, “You are a meticulous analyst. ” “Extract only what is explicitly present in the text, ” “but you MAY generate a succinct title if none exists.”), (“human”, f””” **Task** Fill the JSON schema fields shown below. **Fields** • title – exact title if present; otherwise invent a 6-12-word title • summary – 2–3 sentence synopsis • doc_type – one of: {“, “.join(DOC_TYPES)} • topic_category – one of: {“, “.join(TOPIC_CATS)} • sentiment – positive, neutral, or negative overall tone **Rules** – If a category is uncertain, use “other”. – Respond ONLY in the JSON format supplied automatically. **Document begins** {{pdf_text}} “””) ]) The next step is to define an LLM. We will use the OpenAI gpt-4o-mini model and create the ChatOpenAI object that supports chat-like interaction with the LLM. You can use any other supported by the LangChain framework. To extract structured data, we call the with_structured_output() function using the ChatOpenAI object and pass it the PDFRecord base model class we defined earlier. Finally, we combine the prompt and LLM to create a LangChain runnable object. llm = ChatOpenAI(model_name=”gpt-4o-mini”, openai_api_key = “YOUR_OPENAI_API_KEY”, temperature=0) structured_llm = llm.with_structured_output(PDFRecord) chain = prompt | structured_llm We will extract the text of each document from the list of PDF documents and invoke the chain we defined. Notice that we are passing the PDF text (doc.page_content) as a value for the pdf_text key since the prompt contains a placeholder with the same name. The response from the LLM chain is appended to the rows list. rows = [] for doc in tqdm(docs, desc=”Processing PDFs”): record = chain.invoke({“pdf_text”: doc.page_content}) # → PDFRecord row = record.dict() # plain dict row[“path”] = doc.metadata[“source”] rows.append(row) The rows list now contains Python dictionaries containing structured information extracted from the PDF documents. We convert this list into a Pandas DataFrame and store it as a CSV file for later use. dataset = pd.DataFrame(rows) dataset.to_csv(“pdf_catalog.csv”, index=False) print(“✓ Saved pdf_catalog.csv with”, len(rows), “rows”) dataset.head(10) Output: The above output shows the data extracted from PDF documents. Each row corresponds to a single PDF document. Next, we will insert this data in GridDB. Inserting Structured Data from PDF into GridDB Inserting data into GridDB is a three-step process. You establish a connection with a GridDB host, create a container, and insert data into it. Creating a Connection with GridDB To create a GridDB connection, call the griddb.StoreFactory.get_instance() function to get a factory object. Next, call the get_store() function on the factory object and pass it the database host, cluster name, and user name and password. The following script creates a connection to the locally hosted GridDB server and tests the connection by retrieving a dummy container. factory = griddb.StoreFactory.get_instance() DB_HOST = “127.0.0.1:10001” DB_CLUSTER = “myCluster” DB_USER = “admin” DB_PASS = “admin” try: gridstore = factory.get_store( notification_member = DB_HOST, cluster_name = DB_CLUSTER, username = DB_USER, password = DB_PASS ) container1 = gridstore.get_container(“container1”) if container1 == None: print(“Container does not exist”) print(“Successfully connected to GridDB”) except griddb.GSException as e: for i in range(e.get_error_stack_size()): print(“[“, i, “]”) print(e.get_error_code(i)) print(e.get_location(i)) print(e.get_message(i)) Output: Container does not exist Successfully connected to GridDB If you see the above output, you successfully established a connection with the GridDB server. Inserting Data into GridDB Next, we will insert the data from our Pandas DataFrame into the GridDB container. To do so, we define the map_pandas_dtype_to_griddb() function, which maps the Pandas column types to GridDB data types. We iterate through all the column names and types and create a list of lists, each nested list containing a column name and GridDB data type for the column. Next, we create a ContainerInfo object using the container name, the container columns, and the types lists. Since we are storing tabular data, we set the container type to COLLECTION. Next, we store the container in GridDB using the gridstore.put_container() function. Finally, we iterate through all the rows in our pdf document dataset and store it in the container we created using the put() function. # see all GridDB data types: https://docs.griddb.net/architecture/data-model/#data-type def map_pandas_dtype_to_griddb(dtype): if dtype == ‘int64’: return griddb.Type.LONG elif dtype == ‘float64’: return griddb.Type.FLOAT elif dtype == ‘object’: return griddb.Type.STRING # Add more column types if you want else: raise ValueError(f’Unsupported pandas type: {dtype}’) container_columns = [] for column_name, dtype in dataset.dtypes.items(): griddb_dtype = map_pandas_dtype_to_griddb(str(dtype)) container_columns.append([column_name, griddb_dtype]) container_name = “PDFData” container_info = griddb.ContainerInfo(container_name, container_columns, griddb.ContainerType.COLLECTION, True) try: cont = gridstore.put_container(container_info) for index, row in dataset.iterrows(): cont.put(row.tolist()) print(“All rows have been successfully stored in the GridDB container.”) except griddb.GSException as e: for i in range(e.get_error_stack_size()): print(“[“, i, “]”) print(e.get_error_code(i)) print(e.get_location(i)) print(e.get_message(i)) Finally, we will retrieve data from GridDB and analyze the dataset. Retrieving Data from GridDB and Performing Analysis To retrieve data from a GridDB container, you must first retrieve the container using the get_container() function and then execute an SQL query on the container object using the query() function, as shown in the script below. To execute the select query, you need to call the fetch() function, and to retrieve data as a Pandas dataframe, call the fetch_rows() function. The following script retrieves structured data from our GridDB container and stores it in the pdf_dataset dataframe. def retrieve_data_from_griddb(container_name): try: data_container = gridstore.get_container(container_name) # Query all data from the container query = data_container.query(“select *”) rs = query.fetch() data = rs.fetch_rows() return data except griddb.GSException as e: print(f”Error retrieving data from GridDB: {e.get_message()}”) return None pdf_dataset = retrieve_data_from_griddb(container_name) pdf_dataset.head() Output: The above output shows the data retrieved from our GridDB container. Once we store data from a GridDB container in a Pandas DataFrame, we can perform various analyses on it. Using a Pie chart, Let’s see the topic category distribution in all PDF documents. pdf_dataset[“topic_category”].value_counts().plot.pie(autopct=”%1.1f%%”) plt.title(“Distribution of topic categories”) plt.ylabel(“”) plt.show() Output: The output shows that the majority of documents are related to science, followed by business. Next, we can plot the distribution of document types using a donut chart. df[“doc_type”].value_counts().plot.pie( autopct=”%1.1f%%”, wedgeprops=dict(width=0.50) # makes the “donut” hole ) plt.title(“Document type”) plt.ylabel(“”) plt.gca().set_aspect(“equal”) plt.show() Output: The output shows that the majority of documents are reports. Finally, we can plot the sentiments expressed in documents as a bar plot. pdf_dataset[“sentiment”].value_counts().plot.bar() plt.title(“Distribution of sentiment values”) plt.xlabel(“sentiment”) plt.ylabel(“count”) plt.tight_layout() plt.show() Output: The above output shows that most of the documents have neutral sentiments. Conclusion This article explained how to build a complete pipeline for extracting metadata from unstructured PDF documents using LLMs and storing the result in GridDB. You explored using LangChain with OpenAI’s GPT-4 model to extract key information such as document title, summary, type, category, and sentiment and how to save this structured output into a GridDB container. The combination of LLM-driven data extraction and GridDB’s performance-oriented architecture makes this approach suitable for intelligent document processing in real-time applications. If you have questions or need assistance with GridDB please ask on Stack Overflow using the griddb tag. Our team is always happy to help. For the complete code, visit my GridDB Blogs GitHub

More
Migrate from PostgreSQL to GridDB Cloud

In a previous article, we talked about some new features in the GridDB Cloud CLI Tool which showcased the ability to make tables based on schemas in the form of JSON files. Included in that was also the ability to migrate from GridDB Community Edition (on prem) to the Cloud in just a few commands; you can read that article here: New Features of the GridDB Cloud CLI Tool. In this article, we will again feature the GridDB Cloud CLI Tool, but this time we want to showcase the ability to migrate from your on-prem PostgreSQL database directly to your GridDB Cloud instance. We will get into the details later in the article, but here is a basic rundown of the steps required to make this work: Export PostgreSQL tables as CSV row data Export schemas of all PostgreSQL tables as JSON file Use GridDB Cloud CLI Tool to transform the data, create the corresponding tables, and then load the data into GridDB Cloud Also please note: this tool is intended for developers and is not of production level quality — it is not made by the official GridDB Development team, so please if you face any issues, leave a comment in this article or within the GridDB Cloud CLI Tool’s repo and we will look into it. Downloading the Tool and Examples The source code for this tool and the binaries to get the latest version are found on GitHub: https://github.com/Imisrael/griddb-cloud-cli. The example directories with CSV data that you can test with are found in the migrate-psql branch: https://github.com/Imisrael/griddb-cloud-cli/tree/migrate-psql/migration_dirs. Scripts to Export PostgreSQL Before we showcase running an example of this process, let’s first take a look at the scripts needed to export the data from your PostgreSQL instance into a more neutral and workable data format (CSV). We will use two different scripts in this phase, one for extract row data, and the other to grab ‘meta’/schema data; both of these scripts rely on the psql command to directly communicate with the database to grab the data we need. Script for Row Data Here is the script to extract your row data; it will generate a unique CSV file for each table in your database, with the filename corresponding directly to the name of the table — handy! #!/bin/bash # Check if a database name was provided. if [ -z “$1” ]; then echo “Usage: $0 ” exit 1 fi DB_NAME=$1 EXPORT_DIR=”.” # Export to the current directory. # Get a list of all tables in the ‘public’ schema. TABLES=$(psql -d $DB_NAME -t -c “SELECT table_name FROM information_schema.tables WHERE table_schema = ‘public’ AND table_type = ‘BASE TABLE’;”) # Loop through the tables and export each one to a CSV file. for TBL in $TABLES; do echo “Exporting table: $TBL” psql -d $DB_NAME -c “\copy (SELECT * FROM $TBL) TO ‘$EXPORT_DIR/$TBL.csv’ WITH (FORMAT CSV, HEADER);” done echo “Export complete.” The script itself is self-explanatory: it exports all tables into the current working directory with the header as the top row and every subsequent row being the data from the table. Script for Schema Data This script is a bit more clever and more of a unique feature for PostgreSQL. It utilizes its json_agg() function which “is an aggregate function that collects values from multiple rows and returns them as a single JSON array.” source. Here is what the script looks like: #!/bin/bash psql -d postgres -X -A -t -c ” WITH column_details AS ( — First, get all the column info for each table SELECT table_name, json_agg( json_build_object( ‘column_name’, column_name, ‘data_type’, udt_name, ‘is_nullable’, is_nullable, ‘column_default’, column_default ) ORDER BY ordinal_position ) AS columns FROM information_schema.columns WHERE table_schema = ‘public’ GROUP BY table_name ), primary_key_details AS ( — Next, find the primary key columns for each table SELECT kcu.table_name, json_agg(kcu.column_name) AS pk_columns FROM information_schema.table_constraints AS tc JOIN information_schema.key_column_usage AS kcu ON tc.constraint_name = kcu.constraint_name AND tc.table_schema = kcu.table_schema WHERE tc.constraint_type = ‘PRIMARY KEY’ AND tc.table_schema = ‘public’ GROUP BY kcu.table_name ) — Finally, join them together SELECT json_object_agg( cd.table_name, json_build_object( ‘primary_key’, COALESCE(pkd.pk_columns, ‘[]’::json), ‘columns’, cd.columns ) ) FROM column_details cd LEFT JOIN primary_key_details pkd ON cd.table_name = pkd.table_name; ” > schema.json If you were just grabbing the schema, I think this script would be about half of this length, but we actually are grabbing the primary key as well which requires a JOIN operation. We grab this data in order to ensure we migrate time series tables into TIME_SERIES containers on the GridDB side (more on that later). Changes to the GridDB Cloud CLI Tool To get this process to work, we needed to first separate out the migrate tool. Prior to this release, the migrate tool only expected to work with GridDB CE, and so, the command worked like this: griddb-cloud-cli migrate [griddb-out-directory]. Now obviously this no longer works as we need to indicate to the tool what sort of database we are importing to GridDB Cloud. So what we did was separate out the commands as sub commands, and this meant keeping the more generic functions that work for both in the root of the command — things like reading a CSV file, or parsing a JSON file — and keeping the DB-specific functions inside of the respective subcommand files. So now our migrate/ directory has three files instead of one: migrateCmd.go, migrateGridDB.go, and migratePSQL.go Here’s a brief example of a function which is unique to the PostgreSQL import process func typeSwitcher(s string) string { switch s { case “bool”: return “BOOL” case “char”, “varchar”, “text”: return “STRING” case “int”, “int2”, “int4”: return “INTEGER” case “int8”: return “LONG” case “decimal”, “real”, “numeric”: return “FLOAT” case “float”, “float8”: return “DOUBLE” case “timetz”, “timestamptz”: return “TIMESTAMP” default: return strings.ToUpper(s) } } Migrating from PostgreSQL Now that we have our data ready for use, let’s run the migration tool and see the results. First, place the contents of your exporting efforts into their own directory $ mkdir psql_exported $ mv *.csv psql_exported/ $ mv schema.json psql_exprted/ And now we can run the tool: $ griddb-cloud-cli migrate psql psql_exported/ {“container_name”:”customers”,”container_type”:”COLLECTION”,”rowkey”:false,”columns”:[{“name”:”customer_id”,”type”:”INTEGER”,”index”:null},{“name”:”first_name”,”type”:”STRING”,”index”:null},{“name”:”last_name”,”type”:”STRING”,”index”:null},{“name”:”email”,”type”:”STRING”,”index”:null},{“name”:”phone_number”,”type”:”STRING”,”index”:null},{“name”:”address”,”type”:”STRING”,”index”:null},{“name”:”created_at”,”type”:”TIMESTAMP”,”index”:null}]} ✔ Make Container? { “container_name”: “customers”, “container_type”: “COLLECTION”, “rowkey”: false, “columns”: [ { “name”: “customer_id”, “type”: “INTEGER”, “index”: null }, { “name”: “first_name”, “type”: “STRING”, “index”: null }, { “name”: “last_name”, “type”: “STRING”, “index”: null }, { “name”: “email”, “type”: “STRING”, “index”: null }, { “name”: “phone_number”, “type”: “STRING”, “index”: null }, { “name”: “address”, “type”: “STRING”, “index”: null }, { “name”: “created_at”, “type”: “TIMESTAMP”, “index”: null } ] } … YES 201 Created inserting into (customers). csv: psql_exported/customers.csv 200 OK {“container_name”:”products”,”container_type”:”COLLECTION”,”rowkey”:false,”columns”:[{“name”:”id”,”type”:”INTEGER”,”index”:null},{“name”:”name”,”type”:”STRING”,”index”:null},{“name”:”category”,”type”:”STRING”,”index”:null},{“name”:”price”,”type”:”FLOAT”,”index”:null},{“name”:”stock_quantity”,”type”:”INTEGER”,”index”:null}]} ✔ Make Container? { “container_name”: “products”, “container_type”: “COLLECTION”, “rowkey”: false, “columns”: [ { “name”: “id”, “type”: “INTEGER”, “index”: null }, { “name”: “name”, “type”: “STRING”, “index”: null }, { “name”: “category”, “type”: “STRING”, “index”: null }, { “name”: “price”, “type”: “FLOAT”, “index”: null }, { “name”: “stock_quantity”, “type”: “INTEGER”, “index”: null } ] } … YES 201 Created inserting into (products). csv: psql_exported/products.csv 200 OK And now, of course, is the last step: checking to make sure our data was successfully migrated. $ griddb-cloud-cli show customers { “container_name”: “customers”, “container_type”: “COLLECTION”, “rowkey”: false, “columns”: [ { “name”: “customer_id”, “type”: “INTEGER”, “index”: [] }, { “name”: “first_name”, “type”: “STRING”, “index”: [] }, { “name”: “last_name”, “type”: “STRING”, “index”: [] }, { “name”: “email”, “type”: “STRING”, “index”: [] }, { “name”: “phone_number”, “type”: “STRING”, “index”: [] }, { “name”: “address”, “type”: “STRING”, “index”: [] }, { “name”: “created_at”, “type”: “TIMESTAMP”, “timePrecision”: “MILLISECOND”, “index”: [] } ] } And a quick read: $ griddb-cloud-cli read customers -r [ { “name”: “customers”, “stmt”: “select * limit 50”, “columns”: null, “hasPartialExecution”: true }] customer_id,first_name,last_name,email,phone_number,address,created_at, [1 Alice Johnson alice.j@email.com 555-0101 123 Maple St, Springfield 2025-09-04T06:32:05.700Z] [2 Bob Smith bob.smith@email.com 555-0102 456 Oak Ave, Shelbyville 2025-09-04T06:32:05.700Z] [3 Charlie Brown charlie@email.com 555-0103 789 Pine Ln, Capital City 2025-09-04T06:32:05.700Z] [4 Diana Prince diana.p@email.com 901 Birch Rd, Themyscira 2025-09-04T06:32:05.700Z] [5 Ethan Hunt ethan.hunt@email.com 555-0105 1122 Mission St, Los Angeles 2025-09-04T06:32:05.700Z] And for fun: griddb-cloud-cli read graph products –columns ‘stock_quantity’ [“stock_quantity”] [ { “name”: “products”, “stmt”: “select * limit 50”, “columns”: [“stock_quantity”], “hasPartialExecution”: true }] 936 ┤ ╭╮ 906 ┤ ││ 876 ┤ ││ 845 ┤ │╰╮ ╭╮ 815 ┤ │ │ ││ 785 ┤ │ │ ││ 755 ┤ │ │ ╭─╮ ││ 724 ┤ │ │ │ │ ╭╯│ 694 ┤ │ │ │ │ │ ╰╮ ╭╮ 664 ┤ ╭╯ │ │ ╰╮│ │ ││ 633 ┤ │ │ │ ││ │ ││ 603 ┤ │ ╰╮╭╯ ╰╯ │ ││ 573 ┤ │ ││ │ ╭╯│ 543 ┤ │ ││ │ │ │ 512 ┤ │ ││ │ │ ╰╮ ╭╮ 482 ┤ │ ╰╯ │ ╭╮ │ │ │╰╮ 452 ┤ │ │ ││ ╭╯ │ │ │ 421 ┤ ╭─╮ │ │ │╰╮ │ │ │ │ 391 ┤ │ │ ╭╮ ╭╯ │ ╭╯ │ ╭╯ │ │ │ ╭╮ 361 ┤ │ │ ││ │ ╰╮ │ ╰─╯ │ │ │ │╰╮ 331 ┤ │ │ │╰╮ │ │ │ │ │ │ ╭╯ ╰ 300 ┤ ╭╯ │ ╭╯ │ ╭╮ │ │ │ ╰╮ ╭╮ ╭╯ │ │ 270 ┤ │ │ │ ╰╮ ││ │ │ │ │ ╭╯╰╮ │ ╰╮ │ 240 ┤ │ ╰╮ │ │ ╭╯│ ╭╯ │ ╭╯ │ ╭╯ ╰╮ │ │ │ 209 ┤ ╭╯ │ │ ╰╮ │ ╰╮ ╭─╮ │ │ │ │ │ ╰╮ │ │╭╯ 179 ┤ ╭╯ │ ╭╯ │ ╭╯ │ ╭──╯ │ ╭╯ │ │ │╭─╯ │ │ ││ 149 ┤ ╭╯ │ │ ╰╮ │ ╰╮ ╭╯ ╰╮│ │ │ ╰╯ ╰╮│ ││ 119 ┤ ╭╯ │ ╭╮ │ │╭─╯ │ ╭╯ ╰╯ │ │ ╰╯ ││ 88 ┤╭╯ │ ╭╯╰─╯ ╰╯ ╰─╮╭╯ │ │ ╰╯ 58 ┼╯ ╰─╯ ╰╯ │ ╭─╯ 28 ┤ ╰───╯ Col names from container products ■ stock_quantity Conclusion Never has it been easier to migrate from an on-prem SQL database to GridDB Cloud, and we hope these tools make any of those looking to try out GridDB Cloud a much simpler

More
Sentiment Analysis on Product Reviews using OpenAI

Overview This article explores a practical implementation of sentiment analysis on product reviews using OpenAI’s large language models (LLMs) and GridDB. The project involves extracting review data, analyzing sentiment using AI, and efficiently storing and querying the results. It’s an example of how modern AI tools and database systems can be combined to process and understand massive amounts of customer feedback. Why Use AI for Sentiment Analysis? Manual analysis of customer sentiment is impossible at scale due to the sheer volume of user-generated content. Artificial intelligence, especially LLMs like OpenAI’s GPT, enables automated analysis with human-level language understanding. Key advantages include: Scalability: AI can process millions of reviews quickly and consistently. Context Awareness: LLMs are capable of identifying sarcasm, subtle opinions, and emotional tone. Language Versatility: These models handle slang, informal language, and multilingual text more effectively than rule-based systems. AI-driven sentiment analysis helps businesses track public opinion, improve customer service, and guide product development based on real user feedback. Running the Project Clone the Repository git clone https://github.com/junwatu/sentiment-analysis-griddb.git cd app npm install Setup Environment Variables Create a .env file in the app directory of the project and add the following environment variables: OPENAI_API_KEY= GRIDDB_WEBAPI_URL= GRIDDB_USERNAME= GRIDDB_PASSWORD= VITE_BASE_URL=http://localhost:3000 You can copy the .env.example file to .env and fill in the values. To get the OpenAI API key, read this section and for GridDB Cloud, read this section. By default, the VITE_BASE_URL is set to http://localhost:3000 but you can change it to your own domain. Run the Project npm run start Access the Project If you don’t change the VITE_BASE_URL, open your browser and navigate to http://localhost:3000 to access the project. Technologies Used Node.js You need Node.js installed because this project uses Vite based project. Install the Node LTS version from here. OpenAI AI models are used for natural language processing and sentiment analysis. Create the OpenAI API key here. You may need to create a project and enable a few models. We will use the gpt-4o model. GridDB Cloud Setup The GridDB Cloud offers a free plan tier and is officially available worldwide. You need these GridDB environment variables in the .env file: GRIDDB_WEBAPI_URL= GRIDDB_USERNAME= GRIDDB_PASSWORD= Sign Up for GridDB Cloud Free Plan If you would like to sign up for a GridDB Cloud Free instance, you can do so in the following link: https://form.ict-toshiba.jp/download_form_griddb_cloud_freeplan_e. After successfully signing up, you will receive a free instance along with the necessary details to access the GridDB Cloud Management GUI, including the GridDB Cloud Portal URL, Contract ID, Login, and Password. GridDB WebAPI URL Go to the GridDB Cloud Portal and copy the WebAPI URL from the Clusters section. It should look like this: GridDB Username and Password Go to the GridDB Users section of the GridDB Cloud portal and create or copy the username for GRIDDB_USERNAME. The password is set when the user is created for the first time, use this as the GRIDDB_PASSWORD. For more details, to get started with GridDB Cloud, please follow this quick start guide. IP Whitelist When running this project, please ensure that the IP address where the project is running is whitelisted. Failure to do so will result in a 403 status code or forbidden access. You can use a website like What Is My IP Address to find your public IP address. To whitelist the IP, go to the GridDB Cloud Admin and navigate to the Network Access menu. Project Architecture The diagram illustrates the flow of data in the sentiment analysis system: Product Review Data is fed into the system. The OpenAI API analyzes this data for sentiment. The sentiment results and associated metadata are then stored in GridDB. Finally, the sentiment analysis results are presented through a user interface. This represents a pipeline where raw review data is processed, analyzed for sentiment, stored in GridDB, and then made available for viewing. Dataset Preparation You can use any data review for this project. For example, you can use the Amazon Review Dataset or just copy-paste a review from one of the product reviews on Amazon. Server The server is built using Node.js and Express.js. It is a simple server that provides a few API routes. You can see the server code in the server.ts file. API Routes Method Endpoint Description POST /api/sentiment Sentiment analysis for a review and persist the result in GridDB GET /api/sentiments Retrieve all stored review records GET * (any non‑API path) Serve main UI Running Sentiment Analysis with OpenAI We use the OpenAI API (e.g., GPT-4o) to evaluate the sentiment of each review. The input is a text prompt that asks the model to categorize the sentiment. System Prompt We will use a system prompt to guide the model to understand the task. The system prompt is as follows: You are a sentiment‑analysis classifier for Amazon user‑review records. You will receive one JSON object that contains the fields “title” and “text” (and sometimes “rating” which you must ignore). Your task:\n1. Read the natural‑language content (title + text).\n2. Predict the sentiment label and an estimated star rating without looking at any numeric “rating” field.\n3. Respond ONLY with a JSON object in this schema:\n{\n “label”: “positive | neutral | negative”,\n “predicted_rating”: 1 | 2 | 3 | 4 | 5,\n “confidence”: 0-1\n}\nMapping rule (aligned to the Amazon Reviews dataset):\n• 1–2 stars ⇒ negative\n• 3 stars ⇒ neutral\n• 4–5 stars ⇒ positive\nIf the review text is empty, off‑topic, or nonsense, return:\n{“label”:”neutral”,”predicted_rating”:3,”confidence”:0.0}\nNever add commentary or extra keys. Basically, it tells the model to behave like a sentiment analysis classifier for Amazon user-review records. Few Shots A few shots are used to guide the model in understanding the task. The few shots used in this project are as follows: const FEW_SHOTS = [ { role: ‘user’, content: JSON.stringify({ title: ‘Rock‑solid mount’, text: “I’ve tried dozens of phone mounts—this one finally stays put on bumpy roads. Five minutes to install and rock‑solid!”, }), }, { role: ‘assistant’, content: ‘{“label”:”positive”,”predicted_rating”:5,”confidence”:0.96}’, }, { role: ‘user’, content: JSON.stringify({ title: ‘Broke in a week’, text: “Looks nice, but the zipper broke after one week and Amazon wouldn’t replace it.”, }), }, { role: ‘assistant’, content: ‘{“label”:”negative”,”predicted_rating”:1,”confidence”:0.93}’, }, { role: ‘user’, content: JSON.stringify({ title: ‘Meh’, text: ‘These were lightweight and soft but much too small for my liking. I would have preferred two of these together to make one loc.’, }), }, { role: ‘assistant’, content: ‘{“label”:”neutral”,”predicted_rating”:3,”confidence”:0.55}’, }, ] as const; Few-shot examples are used to help a model quickly learn a specific task or format by demonstrating correct input-output pairs. In this context, these few shots serve as demonstrations to guide the sentiment analysis model, illustrating: Expected input format: Reviews consisting of a title and text. Expected output structure: A JSON response containing: label: sentiment classification (“positive”, “negative”, “neutral”). predicted_rating: numerical rating estimation (1–5). confidence: confidence score for the prediction. The model leverages these examples to understand precisely how to classify and score new reviews in a similar structured manner. Call OpenAI API The sentiment analysis was done by OpenAI using the gpt-4o model. The code is available in the server.ts file. Here is the important code snippet: const completion = await openai.chat.completions.create({ model: ‘gpt-4o’, temperature: 0, messages: [ { role: ‘system’, content: SYSTEM_PROMPT }, …FEW_SHOTS, { role: ‘user’, content: JSON.stringify({ title, text }), }, ], }); The text is the review text and the title is the review title. Storing Results in GridDB GridDB is used to persist the processed review data. We use a collection container to store the data. Container Schema The container schema is as follows: id: INTEGER title: STRING review: STRING sentiment: STRING Saving Data to GridDB This is the example data format that is saved to GridDB: { id: 4495, title: ‘US Version’, review: “US version of the game so DLC can’t be downloaded. Returned for a refund”, sentiment: ‘{“label”:”negative”,”predicted_rating”:2,”confidence”:0.88}’ } The reviewData is inserted into GridDB using the insertData function. const reviewData: GridDBData = { id: generateRandomID(), title, review: text, sentiment: JSON.stringify(parsed), }; await dbClient.insertData({ data: reviewData }); The code to insert data is available in the griddb.ts file. Basically, this function takes the data and inserts it into GridDB using the HTTP method. async function insertData({ data, containerName = ‘sentiments’, }: { data: GridDBData; containerName?: string; }): Promise { try { const row = [ parseInt(data.id.toString()), data.title, data.review, data.sentiment, ]; const path = `/containers/${containerName}/rows`; return await makeRequest(path, [row], ‘PUT’); } catch (error) { if (error instanceof GridDBError) { throw error; } const errorMessage = error instanceof Error ? error.message : ‘Unknown error’; throw new GridDBError(`Failed to insert data: ${errorMessage}`, undefined, undefined, error); } } The API route used to insert data in GridDB Cloud is: `${GRIDDB_WEB_API_URL}/containers/sentiments/rows` and by using the HTTP PUT method, the data can be easily inserted into the database. Query All Data To query all data from the sentiments container, you can use the following API route: app.get(‘/api/sentiments’, async (req: express.Request, res: express.Response) => { try { // Search all data const results = await dbClient.searchData([{ type: ‘sql’, stmt: ‘SELECT * FROM sentiments’ }]); res.json({ data: results }); } catch (error) { res.status(500).json({ error: error.message }); } }); The code above uses SQL SELECT statement to query all data from the sentiments container and if you query the data directly from the browser, you can use the following URL https://localhost:3000/api/sentiments User Interface The UI is built using React and Vite. The <SentimentAnalysis /> component is the main component of the UI. It is a simple form that allows users to input a review title and text, and then submit the form to the /api/sentiment endpoint. After submitting the form and processing the sentiment analysis, the UI receives the response data that has JSON format, for example: { “label”: “positive”, “predicted_rating”: 4, “confidence”: 0.85 } The response data is then displayed in the UI. If the review is positive, the UI will display a green color, if it is negative, the UI will display a red color, and if it is neutral, the UI will display a yellow color. Also the UI will display the predicted rating and confidence level. Conclusion This project demonstrates a real-world use case of combining AI and databases for sentiment analysis. Using OpenAI for intelligent sentiment analysis and GridDB for data storage enables fast, efficient processing of customer reviews. The same framework can be extended to other datasets and domains, including social media, customer support logs, or live feedback systems. References OpenAI API

More
Business Intelligence Case Study with GridDB and Python

How do you know your business is going in the right direction? With so much data we have in hand, it is very easy for the problem to get lost while cleaning and analysing it. To be honest, I was in the same place. As a Data Analyst/Data Scientist, I would try to look at the data first and make sense of it. But it not only consumed more time but also made it difficult to explain why I took certain decisions. A recent project on Business Intelligence taught me the power of problem-solving. For this tutorial, we will be exploring a dataset of a global store. To keep it general, I am going to assume our problem statment here is How to grow our business. Then, we will work through the problem statement in a tree-like framework. This will help in structuring our analysis and also get specific KPIs. Framework The framework breaks this down into two major strategic approaches: 1. Increase Our Profit There are a few ways to boost profits. This framework highlights three core strategies: ➤ Increase the Quantity Sold More units sold usually means more revenue. To do this effectively: – Ensure demand is met – Make sure your supply chain and stock levels can support customer demand. We need more data to do this. – Focus on region-specific performance – Some regions may outperform others, and identifying where sales are strongest helps target efforts better. ➤ Sell More Products with High Profit Margins Not all products are created equal. Shifting focus toward high-margin items can improve profitability. – Use product-level data to identify top performers. – Understand regional trends to see where high-margin products sell best. ➤ Reduce the Shipping Cost Shipping and logistics can quietly erode margins, so optimizing this is crucial. – Analyze which products incur the highest shipping costs. – Check which regions are driving those costs and explore smarter routing or fulfillment solutions. 2. Decrease Our Discount While discounts can drive volume, excessive discounting eats into profits. Two key tactics can help manage this: ➤ Identify Best-Selling Products Best-sellers may not need heavy discounting to move. – Reassess discount policies at the product level. – Consider performance variations across regions. ➤ Uncover Bottlenecks Inefficiencies or slow-moving inventory often trigger unnecessary discounts. – Identifying and addressing bottlenecks can reduce dependency on markdowns. Now that we have a basic framework in place, let us look at each aspect and decide on the metrics. What metrics are we tracking? Key Performance Indicators (KPIs): total sales total profit average shipping cost average discount Profitability: products with most profit regions with most profit Discount: best selling products (do not need discounts) bottlenecks due to discounts (low or negative profit margin) Market segmentation: Identifying customer patterns quantity sold by region profit by region shipping cost by region discount by region Pre-requisites The following libraries are required for the below code to execute: Pandas GridDB Python Client Seaborn Matplotlib GridDB’s GitHub Page covers the installation in detail. Please go through it to be able to interact with the GridDB Server in Python. Great! Let’s go ahead and import the libraries Importing Libraries #this is optional – in case of errors during execution even though the python client is installed, it is likely that this could be a path issue and can be resolved as follows import sys sys.path.append(“/home/shripriya/python_client”) #sys.path.append(<path_to_python_client>)</path_to_python_client> import griddb_python as griddb import pandas as pd import seaborn as sns import matplotlib.pyplot as plt Dataset We will be using an open-source SuperStore dataset that can be downloaded from here. Once we download it and put it in the same folder as the code, we can then use pandas to read it and clean it. Reading the Dataframe superstore_df = pd.read_excel(“global_superstore_2016.xlsx”, sheet_name=”Orders”) superstore_df.head() Row ID Order ID Order Date Ship Date Ship Mode Customer ID Customer Name Segment Postal Code City Product ID Category Sub-Category Product Name Sales Quantity Discount Profit Shipping Cost Order Priority 0 40098 CA-2014-AB10015140-41954 2014-11-11 2014-11-13 First Class AB-100151402 Aaron Bergman Consumer 73120.0 Oklahoma City TEC-PH-5816 Technology Phones Samsung Convoy 3 221.980 2 0.0 62.1544 40.77 High 1 26341 IN-2014-JR162107-41675 2014-02-05 2014-02-07 Second Class JR-162107 Justin Ritter Corporate NaN Wollongong FUR-CH-5379 Furniture Chairs Novimex Executive Leather Armchair, Black 3709.395 9 0.1 -288.7650 923.63 Critical 2 25330 IN-2014-CR127307-41929 2014-10-17 2014-10-18 First Class CR-127307 Craig Reiter Consumer NaN Brisbane TEC-PH-5356 Technology Phones Nokia Smart Phone, with Caller ID 5175.171 9 0.1 919.9710 915.49 Medium 3 13524 ES-2014-KM1637548-41667 2014-01-28 2014-01-30 First Class KM-1637548 Katherine Murray Home Office NaN Berlin TEC-PH-5267 Technology Phones Motorola Smart Phone, Cordless 2892.510 5 0.1 -96.5400 910.16 Medium 4 47221 SG-2014-RH9495111-41948 2014-11-05 2014-11-06 Same Day RH-9495111 Rick Hansen Consumer NaN Dakar TEC-CO-6011 Technology Copiers Sharp Wireless Fax, High-Speed 2832.960 8 0.0 311.5200 903.04 Critical 5 rows × 23 columns We can see that our dataset has a lot of information on the orders being placed. Not all information is useful here. Remember our framework comes in handy here. So, we will go ahead and drop some columns that are not needed. Exploratory Data Analysis (EDA) #do not need Order ID, Postal Code, Order Priority cols = [“Order ID”, “Postal Code”, “Order Priority”] superstore_df.drop(columns=cols, inplace=True) superstore_df.info() <class ‘pandas.core.frame.DataFrame’> RangeIndex: 51290 entries, 0 to 51289 Data columns (total 21 columns): # Column Non-Null Count Dtype — —— ————– —– 0 Row ID 51290 non-null int64 1 Order Date 51290 non-null datetime64[ns] 2 Ship Date 51290 non-null datetime64[ns] 3 Ship Mode 51290 non-null object 4 Customer ID 51290 non-null object 5 Customer Name 51290 non-null object 6 Segment 51290 non-null object 7 City 51290 non-null object 8 State 51290 non-null object 9 Country 51290 non-null object 10 Region 51290 non-null object 11 Market 51290 non-null object 12 Product ID 51290 non-null object 13 Category 51290 non-null object 14 Sub-Category 51290 non-null object 15 Product Name 51290 non-null object 16 Sales 51290 non-null float64 17 Quantity 51290 non-null int64 18 Discount 51290 non-null float64 19 Profit 51290 non-null float64 20 Shipping Cost 51290 non-null float64 dtypes: datetime64[ns](2), float64(4), int64(2), object(13) memory usage: 8.2+ MB Since there are no null-values, we do not need to worry about that. However, the date columns here are not of the correct datatype. So we will quickly change that before inserting the data into GridDB. # Prepare the DataFrame superstore_df[“Order Date”] = pd.to_datetime(superstore_df[“Order Date”]).dt.to_pydatetime() superstore_df[“Ship Date”] = pd.to_datetime(superstore_df[“Ship Date”]).dt.to_pydatetime() # Rename columns to match GridDB naming (no spaces or dashes) superstore_df.columns = [col.replace(” “, “_”).replace(“-“, “_”) for col in superstore_df.columns] /tmp/ipykernel_1024/114528052.py:2: FutureWarning: The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result superstore_df[“Order Date”] = pd.to_datetime(superstore_df[“Order Date”]).dt.to_pydatetime() /tmp/ipykernel_1024/114528052.py:3: FutureWarning: The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result superstore_df[“Ship Date”] = pd.to_datetime(superstore_df[“Ship Date”]).dt.to_pydatetime() Let us verify that the datatypes are indeed changed. superstore_df.info() <class ‘pandas.core.frame.DataFrame’> RangeIndex: 51290 entries, 0 to 51289 Data columns (total 21 columns): # Column Non-Null Count Dtype — —— ————– —– 0 Row_ID 51290 non-null int64 1 Order_Date 51290 non-null datetime64[ns] 2 Ship_Date 51290 non-null datetime64[ns] 3 Ship_Mode 51290 non-null object 4 Customer_ID 51290 non-null object 5 Customer_Name 51290 non-null object 6 Segment 51290 non-null object 7 City 51290 non-null object 8 State 51290 non-null object 9 Country 51290 non-null object 10 Region 51290 non-null object 11 Market 51290 non-null object 12 Product_ID 51290 non-null object 13 Category 51290 non-null object 14 Sub_Category 51290 non-null object 15 Product_Name 51290 non-null object 16 Sales 51290 non-null float64 17 Quantity 51290 non-null int64 18 Discount 51290 non-null float64 19 Profit 51290 non-null float64 20 Shipping_Cost 51290 non-null float64 dtypes: datetime64[ns](2), float64(4), int64(2), object(13) memory usage: 8.2+ MB Okay, everything looks great! We can now put our clean data into GridDB. Putting the data into GridDB Before putting the data into GridDB, we need to create something called containers. Containers are like tables in any database. We define our schema that includes the name and data type of the column. Once we create the containers, it is very easy to put the data using the GridDB python client. Container Creation factory = griddb.StoreFactory.get_instance() host = “127.0.0.1:10001” cluster = “myCluster” db_user = “admin” db_password = “admin” try: gridstore = factory.get_store(notification_member=host, cluster_name=cluster, username=db_user, password=db_password) print(“Connection successful”) conInfo = griddb.ContainerInfo( name=”sales_data”, column_info_list=[ [“Row_ID”, griddb.Type.LONG], [“Order_Date”, griddb.Type.TIMESTAMP], [“Ship_Date”, griddb.Type.TIMESTAMP], [“Ship_Mode”, griddb.Type.STRING], [“Customer_ID”, griddb.Type.STRING], [“Customer_Name”, griddb.Type.STRING], [“Segment”, griddb.Type.STRING], [“City”, griddb.Type.STRING], [“State”, griddb.Type.STRING], [“Country”, griddb.Type.STRING], [“Region”, griddb.Type.STRING], [“Market”, griddb.Type.STRING], [“Product_ID”, griddb.Type.STRING], [“Category”, griddb.Type.STRING], [“Sub_Category”, griddb.Type.STRING], [“Product_Name”, griddb.Type.STRING], [“Sales”, griddb.Type.DOUBLE], [“Quantity”, griddb.Type.INTEGER], [“Discount”, griddb.Type.DOUBLE], [“Profit”, griddb.Type.DOUBLE], [“Shipping_Cost”, griddb.Type.DOUBLE] ], type=griddb.ContainerType.COLLECTION ) container = gridstore.put_container(conInfo) print(“Container creation successful”) except griddb.GSException as e: for i in range(e.get_error_stack_size()): print(“[“, i, “]”) print(“Error Code:”, e.get_error_code(i)) print(“Location:”, e.get_location(i)) print(“Message:”, e.get_message(i)) Connection successful Container creation successful If executed correctly, the above code should print out Connection successful followed by Container creation successful. If not, the exception message will help you troubleshoot. More information on the error codes are available here. Once our container is in place, let us put the data into it using the following code. Inserting the data container.put_rows(superstore_df) print(“Data insertion successful”) Data insertion successful It is very important to verify if our data has been inserted correctly before moving onto the data analysis. Let’s go ahead and do that. Reading the data try: gridstore = factory.get_store(notification_member=host, cluster_name=cluster, username=db_user, password=db_password) cont = gridstore.get_container(“sales_data”) if cont is None: print(“Does not exist”) print(“Connection successful”) query_str=”SELECT *” query = cont.query(query_str) rs = query.fetch() df = rs.fetch_rows() print(df.info()) except griddb.GSException as e: for i in range(e.get_error_stack_size()): print(“[“,i,”]”) print(e.get_error_code(i)) print(e.get_location(i)) print(e.get_message(i)) Connection successful <class ‘pandas.core.frame.DataFrame’> RangeIndex: 51290 entries, 0 to 51289 Data columns (total 21 columns): # Column Non-Null Count Dtype — —— ————– —– 0 Row_ID 51290 non-null int64 1 Order_Date 51290 non-null datetime64[ns] 2 Ship_Date 51290 non-null datetime64[ns] 3 Ship_Mode 51290 non-null object 4 Customer_ID 51290 non-null object 5 Customer_Name 51290 non-null object 6 Segment 51290 non-null object 7 City 51290 non-null object 8 State 51290 non-null object 9 Country 51290 non-null object 10 Region 51290 non-null object 11 Market 51290 non-null object 12 Product_ID 51290 non-null object 13 Category 51290 non-null object 14 Sub_Category 51290 non-null object 15 Product_Name 51290 non-null object 16 Sales 51290 non-null float64 17 Quantity 51290 non-null int64 18 Discount 51290 non-null float64 19 Profit 51290 non-null float64 20 Shipping_Cost 51290 non-null float64 dtypes: datetime64[ns](2), float64(4), int64(2), object(13) memory usage: 8.2+ MB None Great! Everything looks great. We can now go ahead with our analysis. Analysis I will define a function that takes the container, query, and output data type as an input and returns the result of the query. This helps me avoid writing the same code over and over again. def fetch_data(container, query_str, data_type): rs = container.query(query_str) row_set = rs.fetch() if row_set.has_next(): result = row_set.next() result = result.get(type=data_type) return result 1. Key Performance Indicators (KPI) Referencing the framework, we are tracking 4 KPIs – Total Sales Total Profit Avg Shipping Cost Avg Discount total_sales = fetch_data(cont, ‘SELECT SUM(Sales) FROM sales_data’, griddb.Type.DOUBLE) total_profit = fetch_data(cont, ‘SELECT SUM(Profit) FROM sales_data’, griddb.Type.DOUBLE) avg_shipping = fetch_data(cont, ‘SELECT AVG(Shipping_Cost) FROM sales_data’, griddb.Type.DOUBLE) avg_discount = fetch_data(cont, ‘SELECT AVG(Discount) FROM sales_data’, griddb.Type.DOUBLE) print(f”🔢 KPIs”) print(f”Total Sales: ${total_sales:,.2f}”) print(f”Total Profit: ${total_profit:,.2f}”) print(f”Avg Shipping Cost: ${avg_shipping:.2f}”) print(f”Avg Discount: {avg_discount:.2%}”) 🔢 KPIs Total Sales: $12,642,501.91 Total Profit: $1,467,457.29 Avg Shipping Cost: $26.48 Avg Discount: 14.29% Let’s go ahead and plot these. fig, axes = plt.subplots(2, 2, figsize=(10, 6)) fig.suptitle(“Key Performance Indicators (KPIs)”, fontsize=20, fontweight=’bold’) axes = axes.flatten() for ax in axes: ax.axis(‘off’) # KPIs kpi_labels = [“Total Sales”, “Total Profit”, “Avg Shipping Cost”, “Avg Discount”] kpi_values = [ f”${total_sales:,.2f}”, f”${total_profit:,.2f}”, f”${avg_shipping:.2f}”, f”{avg_discount:.2%}” ] kpi_colors = [“#4CAF50”, “#2196F3”, “#FF9800”, “#9C27B0”] # Fill KPI cards for i, ax in enumerate(axes): ax.text(0.5, 0.6, kpi_labels[i], fontsize=16, ha=’center’, va=’center’, weight=’bold’, color=’gray’) ax.text(0.5, 0.4, kpi_values[i], fontsize=24, ha=’center’, va=’center’, weight=’bold’, color=kpi_colors[i]) ax.set_facecolor(“#f7f7f7”) ax.set_frame_on(True) ax.patch.set_linewidth(1) ax.patch.set_edgecolor(“#ddd”) # Add vertical and horizontal dividers fig.subplots_adjust(hspace=0.3, wspace=0.3) fig_width, fig_height = fig.get_size_inches() # Vertical line in the center fig.lines.append(plt.Line2D([0.5, 0.5], [0.05, 0.95], color=”lightgray”, linewidth=1, linestyle=”–“, transform=fig.transFigure)) # Horizontal line in the center fig.lines.append(plt.Line2D([0.05, 0.95], [0.5, 0.5], color=”lightgray”, linewidth=1, linestyle=”–“, transform=fig.transFigure)) plt.tight_layout(rect=[0, 0, 1, 0.95]) plt.show() For the next set of metrics, we will be using multiple columns of the database. Let’s go ahead and retrieve it as a dataframe so that we can easily leverage pandas groupby and aggregation. query_str=”SELECT *” query = cont.query(query_str) rs = query.fetch() df = rs.fetch_rows() 2. Profitability We are tracking 2 key things here: products with most profit regions with most profit top_products_profit = df.groupby(‘Product_Name’)[‘Profit’].sum().sort_values(ascending=False).head(10) region_profit = df.groupby(‘Region’)[‘Profit’].sum().sort_values(ascending=False) fig, axes = plt.subplots(1, 2, figsize=(16, 6)) sns.barplot(x=top_products_profit.values, y=top_products_profit.index, hue=top_products_profit.index, palette=”viridis”, legend=False, ax=axes[0]) axes[0].set_title(“Top 10 Products by Profit”) axes[0].set_xlabel(“Profit”) axes[0].set_ylabel(“Product”) sns.barplot(x=region_profit.values, y=region_profit.index, hue=region_profit.index, palette=”magma”, legend=False, ax=axes[1]) axes[1].set_title(“Profit by Region”) axes[1].set_xlabel(“Profit”) axes[1].set_ylabel(“Region”) plt.tight_layout() plt.show() 3. Discount Similarly, for the second branch in our tree framework – Discount, we will be tracking 2 metrics: Best selling products (products that do not need discounts) Bottlenecks due to discounts (products with low or negative profit margin) top_sales_no_discount = df[df[‘Discount’] == 0].groupby(‘Product_Name’)[‘Sales’].sum().sort_values(ascending=False).head(10) df[‘Profit_Margin’] = df[‘Profit’] / df[‘Sales’] low_margin = df[(df[‘Discount’] > 0.3) & (df[‘Profit_Margin’] < 0)].groupby(‘Product_Name’)[‘Profit’].sum().sort_values().head(10) fig, axes = plt.subplots(1, 2, figsize=(16, 6)) sns.barplot(x=top_sales_no_discount.values, y=top_sales_no_discount.index, hue=top_sales_no_discount.index, palette=”cubehelix”, legend=False, ax=axes[0]) axes[0].set_title(“Top Selling Products Without Discount”) axes[0].set_xlabel(“Sales”) axes[0].set_ylabel(“Product”) sns.barplot(x=low_margin.values, y=low_margin.index, hue=low_margin.index, palette=”coolwarm”, legend=False, ax=axes[1]) axes[1].set_title(“High Discount, Low/Negative Profit”) axes[1].set_xlabel(“Profit”) axes[1].set_ylabel(“Product”) plt.tight_layout() plt.show() 4. Market Segmentation Lastly, we also want to look at the market segmentation and we can segment it by 4 parts: Quantity sold by region Profit by region Shipping cost by region Discount by region quantity_by_region = df.groupby(‘Region’)[‘Quantity’].sum().sort_values(ascending=False) shipping_by_region = df.groupby(‘Region’)[‘Shipping_Cost’].mean().sort_values(ascending=False) discount_by_region = df.groupby(‘Region’)[‘Discount’].mean().sort_values(ascending=False) fig, axes = plt.subplots(2, 2, figsize=(18, 12)) # Plot 1: Quantity Sold by Region sns.barplot(x=quantity_by_region.values, y=quantity_by_region.index, hue=quantity_by_region.index, palette=”crest”, legend=False, ax=axes[0, 0]) axes[0, 0].set_title(“Quantity Sold by Region”) axes[0, 0].set_xlabel(“Quantity”) axes[0, 0].set_ylabel(“Region”) # Plot 2: Profit by Region sns.barplot(x=region_profit.values, y=region_profit.index, hue=region_profit.index, palette=”flare”, legend=False, ax=axes[0, 1]) axes[0, 1].set_title(“Profit by Region”) axes[0, 1].set_xlabel(“Profit”) axes[0, 1].set_ylabel(“Region”) # Plot 3: Shipping Cost by Region sns.barplot(x=shipping_by_region.values, y=shipping_by_region.index, hue=shipping_by_region.index, palette=”rocket”, legend=False, ax=axes[1, 0]) axes[1, 0].set_title(“Shipping Cost by Region”) axes[1, 0].set_xlabel(“Shipping Cost”) axes[1, 0].set_ylabel(“Region”) # Plot 4: Discount by Region sns.barplot(x=discount_by_region.values, y=discount_by_region.index, hue=discount_by_region.index, palette=”ch:start=.2,rot=-.3″, legend=False, ax=axes[1, 1]) axes[1, 1].set_title(“Average Discount by Region”) axes[1, 1].set_xlabel(“Discount”) axes[1, 1].set_ylabel(“Region”) plt.tight_layout() plt.show() Insights The problem statement was how can we grow our business. We focused on increasing our profits and reducing our discounts. So what insights can we draw from the visualizations above? Some specific smartphones are one of the most profitable products for our business, so I would focus on if the demand is met and how can we improve the customer experience for that domain. Africa and Central Asia stores are facing overall losses. Therefore, we need to consider reducing the scale of the store. More information is needed in this regard. The second point is validated when we also look at the average discount in Africa and Asia – the discount is one of the highest while profit margins are negative. These regions, therefore, are contributing negatively to our overall business. Western Europe seems to be our biggest market with higher demand and higher profit margins. So, we should be doubling down here to grow our business. Conclusion In this tutorial, we developed an end-to-end BI solution for a superstore. The tutorial dived into problem-solving, critical thinking followed by detailed analysis and providing actionable insights. This is close to a real-world application of how a Data/Business Analyst would approach a business problem. We also saw that in order to have a centralized data repository, we can make use of GridDB. With GridDB’s python client, the integration of pandas for analysis seemed

More
Building a Stock Market Chatbot Application Using LangGraph and GridDB

With the advent of Large Language Models (LLMs), chatbot applications have become increasingly common, enabling more natural and intelligent interactions with data. In this article, you will see how to build a stock market chatbot using LangGraph, OpenAI API, and GridDB cloud. We will retrieve historical Apple stock price data from Yahoo Finance using the yfinance library, insert it into a GridDB container, and then query it using a chatbot built with LangGraph that utilizes the OpenAI GPT -4 model. GridDB is a high-performance time-series database designed for massive real-time workloads. Its support for structured containers, built-in compression, and lightning-fast reads and writes makes it ideal for storing and querying time series data such as stock market prices. Installing and Importing Required Libraries !pip install -q yfinance !pip install langchain !pip install langchain-core !pip install langchain-community !pip install langgraph !pip install langchain_huggingface !pip install tabulate !pip uninstall -y pydantic !pip install –no-cache-dir “pydantic>=2.11,<3” import yfinance as yf import pandas as pd import json import datetime as dt import base64 import requests import numpy as np from langchain_core.prompts import ChatPromptTemplate from langchain_openai import ChatOpenAI from langchain_core.output_parsers import StrOutputParser from langgraph.graph import START, END, StateGraph from langchain_core.messages import HumanMessage from langgraph.checkpoint.memory import MemorySaver from langchain_experimental.agents import create_pandas_dataframe_agent from langchain_openai import OpenAI from langchain.agents.agent_types import AgentType from typing_extensions import List, TypedDict from pydantic import BaseModel, Field from IPython.display import Image, display Inserting and Retrieving Stock Market Data From GridDB We will first import data from Yahoo Finance into a Python application. Next, we will insert this data into a GridDB container and will retrieve it. Importing Data from Yahoo Finance The yfinance Python library allows you to import data from Yahoo Finance. You need to pass the ticker name, as well as the start and end dates, for the data you want to download. The following script downloads the Apple stock price data for the year 2024. import yfinance as yf import pandas as pd ticker = “AAPL” start_date = “2024-01-01” end_date = “2024-12-31” dataset = yf.download(ticker, start=start_date, end=end_date, auto_adjust=False) # ─────────────────────────────────────────────────────────────── # 1. FLATTEN: keep the level that holds ‘Close’, ‘High’, … # ─────────────────────────────────────────────────────────────── if isinstance(dataset.columns, pd.MultiIndex): # find the level index where ‘Close’ lives for lvl in range(dataset.columns.nlevels): level_vals = dataset.columns.get_level_values(lvl) if ‘Close’ in level_vals: dataset.columns = level_vals # keep that level break else: # already flat – nothing to do pass # ─────────────────────────────────────────────────────────────── # 2. Select OHLCV, move index to ‘Date’ # ─────────────────────────────────────────────────────────────── dataset = dataset[[‘Close’, ‘High’, ‘Low’, ‘Open’, ‘Volume’]] dataset = dataset.reset_index().rename(columns={‘index’: ‘Date’}) dataset[‘Date’] = pd.to_datetime(dataset[‘Date’]) # optional: reorder columns dataset = dataset[[‘Date’, ‘Close’, ‘High’, ‘Low’, ‘Open’, ‘Volume’]] dataset.columns.name = None dataset.head() Output: The above output indicates that the dataset comprises the daily closing, opening, high, low, and volume prices for Apple stock. In the section, you will see how to insert this data into a GridDB cloud container. Establishing a Connection with GridDB Cloud After you create your GridDB cloud account and complete configuration settings, you can run the following script to see if you can access your database within a Python application. username = “your_user_name” password = “your_password” base_url = “your_griddb_host_url” url = f”{base_url}/checkConnection” credentials = f”{username}:{password}” encoded_credentials = base64.b64encode(credentials.encode()).decode() headers = { ‘Content-Type’: ‘application/json’, # Added this header to specify JSON content ‘Authorization’: f’Basic {encoded_credentials}’, ‘User-Agent’: ‘PostmanRuntime/7.29.0’ } response = requests.get(url, headers=headers) print(response.status_code) print(response.text) Output: 200 The above output indicates that you have successfully connected with your GridDB cloud host. Creating a Container for Inserting Stock Market Data in GridDB Cloud Next, we will insert the Yahoo Finance into GridDB. To do so, we will add another column, SerialNo which contains unique keys for each data row, as GridDB expects a unique key column in the dataset. Next, we will map Pandas dataframe column types to Gridb data types. dataset.insert(0, “SerialNo”, dataset.index + 1) dataset[‘Date’] = pd.to_datetime(dataset[‘Date’]).dt.strftime(‘%Y-%m-%d’) # “2024-01-02” dataset.columns.name = None # Mapping pandas dtypes to GridDB types type_mapping = { “int64”: “LONG”, “float64”: “DOUBLE”, “bool”: “BOOL”, ‘datetime64’: “TIMESTAMP”, “object”: “STRING”, “category”: “STRING”, } # Generate the columns part of the payload dynamically columns = [] for col, dtype in dataset.dtypes.items(): griddb_type = type_mapping.get(str(dtype), “STRING”) # Default to STRING if unknown columns.append({ “name”: col, “type”: griddb_type }) columns Output: [{‘name’: ‘SerialNo’, ‘type’: ‘LONG’}, {‘name’: ‘Date’, ‘type’: ‘STRING’}, {‘name’: ‘Close’, ‘type’: ‘DOUBLE’}, {‘name’: ‘High’, ‘type’: ‘DOUBLE’}, {‘name’: ‘Low’, ‘type’: ‘DOUBLE’}, {‘name’: ‘Open’, ‘type’: ‘DOUBLE’}, {‘name’: ‘Volume’, ‘type’: ‘LONG’}] The above output displays the dataset column names and their corresponding GridDB-compliant data types. The next step is to create a GridDB container. To do so, you need to pass the container name, container type, and a list of column names and their data types. url = f”{base_url}/containers” container_name = “stock_db” # Create the payload for the POST request payload = json.dumps({ “container_name”: container_name, “container_type”: “COLLECTION”, “rowkey”: True, # Assuming the first column as rowkey “columns”: columns }) # Make the POST request to create the container response = requests.post(url, headers=headers, data=payload) # Print the response print(f”Status Code: {response.status_code}”) Adding Stock Data to GridbDB Cloud Container Once you have created a container, you must convert the data from your Pandas dataframe into the JSON format and call a put request to insert data into GridDB. url = f”{base_url}/containers/{container_name}/rows” # Convert dataset to list of lists (row-wise) with proper formatting def format_row(row): formatted = [] for item in row: if pd.isna(item): formatted.append(None) # Convert NaN to None elif isinstance(item, bool): formatted.append(str(item).lower()) # Convert True/False to true/false elif isinstance(item, (int, float)): formatted.append(item) # Keep integers and floats as they are else: formatted.append(str(item)) # Convert other types to string return formatted # Prepare rows with correct formatting rows = [format_row(row) for row in dataset.values.tolist()] # Create payload as a JSON string payload = json.dumps(rows) # Make the PUT request to add the rows to the container response = requests.put(url, headers=headers, data=payload) # Print the response print(f”Status Code: {response.status_code}”) print(f”Response Text: {response.text}”) Output: Status Code: 200 Response Text: {“count”:251} If you see the above response, you have successfully inserted the data. Retrieving Data from GridDB After inserting the data, you can perform a variety of operations on the dataset. Let’s see how to import data from a GridDB container into a Pandas dataframe. container_name = “stock_db” url = f”{base_url}/containers/{container_name}/rows” # Define the payload for the query payload = json.dumps({ “offset”: 0, # Start from the first row “limit”: 10000, # Limit the number of rows returned “condition”: “”, # No filtering condition (you can customize it) “sort”: “” # No sorting (you can customize it) }) # Make the POST request to read data from the container response = requests.post(url, headers=headers, data=payload) # Check response status and print output print(f”Status Code: {response.status_code}”) if response.status_code == 200: try: data = response.json() print(“Data retrieved successfully!”) # Convert the response to a DataFrame rows = data.get(“rows”, []) stock_dataset = pd.DataFrame(rows, columns=[col for col in dataset.columns]) except json.JSONDecodeError: print(“Error: Failed to decode JSON response.”) else: print(f”Error: Failed to query data from the container. Response: {response.text}”) print(stock_dataset.shape) stock_dataset.head() Output: The above output shows the data retrieved from the GridDB container. We store the data in a Pandas dataframe. You can store the data in any other format if you want. Once you have the data, you can create a variety of AI and data science applications. Creating a Stock Market Chatbot Using GridDB Data In this next section, you will see how to create a simple chatbot in LangGraph framework, which calls the OpenAI API to answer your questions about the Apple stock price you just retrieved from the GridDB. Creating a Graph in LangGraph To create a Graph in LangGraph, you need to define its state. A graph’s state contains attributes that are shared between multiple graph nodes. Since we only need to store questions and answers, we create the following graph state. class State(TypedDict): question: str answer: str We will use the create_pandas_dataframe_agent from LangChain to answer our questions since we retrieved data from Gridb into a Pandas dataframe. We will create the agent object and will call it inside the run_llm() function we define. We will use this function in our LangGraph node. api_key = “YOUR_OPENAI_API_KEY” llm = ChatOpenAI(model = ‘gpt-4o’, api_key = api_key) agent = create_pandas_dataframe_agent(llm, stock_dataset, verbose=True, agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION, allow_dangerous_code=True) def run_llm(state: State): question = state[‘question’] response = agent.invoke(question) return {‘answer’: response[‘output’]} Finally, we define the graph for our chatbot. The graph consists of only one node, ask_question, which calls the run_llm() function. Inside the function, we call the create_pandas_dataframe_agent(), which answers questions about the dataset. graph_builder=StateGraph(State) graph_builder.add_node(“ask_question”, run_llm) graph_builder.add_edge(START,”ask_question”) graph_builder.add_edge(“ask_question”,END) graph = graph_builder.compile() display(Image(graph.get_graph().draw_mermaid_png())) Output: The above output shows the flow of our graph. Asking Questions Let’s test our chatbot by asking some questions. We will first ask our chatbot about the month that had the highest average opening price—also, the month where people made the most profit in day trading. question = [HumanMessage(content=”Which month had the highest average opening stock price? And what is the month where people made most profit in day trading?”)] result = graph.invoke({“question”: question}) print(result[‘answer’]) Output: The output above shows the chatbot’s response. That is correct; I verified it manually using a Python script. Let’s ask it to be more creative and see if it finds any interesting patterns in the dataset. question = [HumanMessage(content=”Do you find any interesting patterns in the dataset?”)] result = graph.invoke({“question”: question}) print(result[‘answer’]) Output: g src=”images\img5-chatbot-response2b.png”> The above output shows the first part of the reply. You can see that the chatbot is intelligent enough to draw a plot for the closing prices to identify interesting patterns. The following output shows some interesting observations from the dataset. Output: g src=”images\img5-chatbot-response2b.png”> Conclusion The article demonstrates how to create an OpenAI API-based chatbot that answers questions related to data retrieved from GridDB. If you have any questions or need help with GridDB cloud, you can leave your query on Stack Overflow using the griddb tag. Our team will be happy to answer it. For the complete code of this article, visit my GridDB Blogs GitHub

More