{"id":52249,"date":"2025-10-10T00:00:00","date_gmt":"2025-10-10T07:00:00","guid":{"rendered":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/"},"modified":"2025-10-10T00:00:00","modified_gmt":"2025-10-10T07:00:00","slug":"inserting-structured-information-from-pdf-documents-into-griddb-using-llms","status":"publish","type":"post","link":"https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/","title":{"rendered":"Inserting Structured Information from PDF documents into GridDB using LLMs"},"content":{"rendered":"<p>Extracting meaningful insights from a large corpus of documents is a challenging task. With advancements in Large Language Models (LLMS), it is now possible to automate the process of structured data extraction from text documents. In this article, you will learn how to extract structured data from PDF documents using LLMs in <a href=\"https:\/\/www.langchain.com\/\">LangChain<\/a> and store it in <a href=\"https:\/\/griddb.net\/en\/\">GridDB<\/a>.<\/p>\n<p>GridDB is a high-performance NoSQL database suited for managing complex and dynamic datasets. Its high-throughput NOSQL capabilities make it ideal for storing large structured datasets containing text insights.<\/p>\n<p>We will begin by downloading a PDF document dataset from Kaggle and extracting structured information from the documents using LangChain. We will then store the structured data in a GridB container. Finally, we will retrieve the data from the GridDB container and analyze the structured metadata for the PDF documents.<\/p>\n<p><strong>Note:<\/strong> See the <a href=\"https:\/\/github.com\/griddbnet\/Blogs\/blob\/pdf-llm-griddb\/Jupyter_Notebook_Codes.ipynb\">GridDB Blogs GitHub repository<\/a> for codes.<\/p>\n<h2>Prerequisites<\/h2>\n<p>You need to install the following libraries to run the codes in this article.<\/p>\n<ol>\n<li>GridDB C Client<\/li>\n<li>GridDB Python client<\/li>\n<\/ol>\n<p>You can install these libraries following the instructions on the <a href=\"https:\/\/pypi.org\/project\/griddb-python\/\">GridDB Python Package Index (Pypi)<\/a><\/p>\n<p>In addition, you need to install the langchain, openai, pydantic, pandas, pypdf, openai, tiktoken, and tqdm libraries to run codes in this article. The following script installs these libraries.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-sh\">!pip install --upgrade -q langchain\n!pip install --upgrade -q pydantic\n!pip install --upgrade -q langchain-community\n!pip install --upgrade -q langchain-core\n!pip install --upgrade -q langchain-openai\n!pip install --upgrade -q pydantic pandas pypdf openai tiktoken tqdm\n<\/code><\/pre>\n<\/div>\n<p>Finally, run the script below to import the required libraries and modules into your Python application.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-python\">from pathlib import Path\nimport re\nimport pandas as pd\nfrom tqdm import tqdm\nfrom itertools import islice\nfrom typing import Literal, Optional\nimport matplotlib.pyplot as plt\nfrom langchain_core.pydantic_v1 import BaseModel, Field, validator\nfrom langchain.prompts import ChatPromptTemplate, MessagesPlaceholder\nfrom langchain_openai import ChatOpenAI\nfrom langchain.agents import create_openai_functions_agent, AgentExecutor\nfrom langchain_experimental.tools import PythonREPLTool\nfrom langchain_community.document_loaders import PyPDFDirectoryLoader\nimport griddb_python as griddb\n<\/code><\/pre>\n<\/div>\n<h2>Extracting Structured Data from PDF Documents<\/h2>\n<p>We will extract structured information from the <a href=\"https:\/\/www.kaggle.com\/datasets\/manisha717\/dataset-of-pdf-files\">PDF files dataset from Kaggle<\/a>. Download the dataset into your local directory and run the following script.<\/p>\n<p>The dataset contains over a thousand files; however, for the sake of testing, we will extract structured information from 100 documents.<\/p>\n<p>The following script extracts data from the first 100 documents and stores it in a Python list.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-python\">\n# https:\/\/www.kaggle.com\/datasets\/manisha717\/dataset-of-pdf-files\n\npdf_dir = Path(\"\/home\/mani\/Datasets\/Pdf\")\n\nloader = PyPDFDirectoryLoader(\n    pdf_dir,\n    recursive=True,\n    silent_errors=True\n)#  raises warning if a PDF document doesnt contain valid text\n\n # first 100 that load cleanly\ndocs_iter = loader.lazy_load()              # generator \u2192 1 Document per good PDF\ndocs      = list(islice(docs_iter, 100))\n\ndocs[0]   <\/code><\/pre>\n<\/div>\n<p><strong>Output:<\/strong><\/p>\n<p><a href=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/10\/img1-sample-document-text.png\"><img fetchpriority=\"high\" decoding=\"async\" src=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/10\/img1-sample-document-text.png\" alt=\"\" width=\"1770\" height=\"200\" class=\"aligncenter size-full wp-image-32475\" \/><\/a><\/p>\n<p>The above output shows the contents of the first document.<\/p>\n<p>In this article, we will use a large language model (LLM) with a structured response in LangChain to extract the title, summary, document type, topic category, and sentiment from a PDF document.<\/p>\n<p>To retrieve structured data, we have to define the scheme of the data we want to retrieve. For example, we will predefine some categories for document type, topic category, and sentiment, as shown in the following script.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-python\">\nDOC_TYPES   = (\n    \"report\", \"article\", \"manual\", \"white_paper\",\n    \"thesis\", \"presentation\", \"policy_brief\", \"email\", \"letter\", \"other\",\n)\nTOPIC_CATS  = (\n    \"science\", \"technology\", \"history\", \"business\",\n    \"literature\", \"health\", \"education\", \"art\",\n    \"politics\", \"other\",\n)\nSentiment   = Literal[\"positive\", \"neutral\", \"negative\"]<\/code><\/pre>\n<\/div>\n<p>Next, we will define a Pydantic <code>BaseModel<\/code> class object, which contains fields for the structured information we want to extract from the PDF documents. The descriptions of the fields tell LLMs what information to store in them.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-python\">\nclass PDFRecord(BaseModel):\n    \"\"\"Validated metadata for a single PDF.\"\"\"\n    title: str = Field(\n        ...,\n        description=\"Document title. If the text contains no clear title, \"\n                    \"generate a concise 6\u201312-word title that reflects the content.\"\n    )\n    summary: str = Field(\n        ...,\n        description=\"Two- to three-sentence synopsis of the document.\"\n    )\n    doc_type: Literal[DOC_TYPES] = Field(\n        default=\"other\",\n        description=\"Document genre; choose one from: \" + \", \".join(DOC_TYPES)\n    )\n    topic_category: Literal[TOPIC_CATS] = Field(\n        default=\"other\",\n        description=\"Primary subject domain; choose one from: \" + \", \".join(TOPIC_CATS)\n    )\n    sentiment: Sentiment = Field(\n        default=\"neutral\",\n        description=\"Overall tone of the document: positive, neutral, or negative.\"\n    )\n\n    # --- fallback helpers so bad labels never crash validation ---\n    @validator(\"doc_type\", pre=True, always=True)\n    def _doc_fallback(cls, v):\n        return v if v in DOC_TYPES else \"other\"\n\n    @validator(\"topic_category\", pre=True, always=True)\n    def _topic_fallback(cls, v):\n        return v if v in TOPIC_CATS else \"other\"<\/code><\/pre>\n<\/div>\n<p>The next step is to define a prompt for an LLM that guides the LLM in extracting structured data from PDF documents and converting it to JSON format. The <code>BaseModel<\/code> class we defined before can extract JSON data from a structured LLM response.<\/p>\n<p>Notice that the prompt contains the <code>pdf_text<\/code> placeholder. This placeholder will store the text of the PDF document.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-python\">\nprompt = ChatPromptTemplate.from_messages([\n    (\"system\",\n     \"You are a meticulous analyst. \"\n     \"Extract only what is explicitly present in the text, \"\n     \"but you MAY generate a succinct title if none exists.\"),\n    (\"human\",\n     f\"\"\"\n**Task**\n  Fill the JSON schema fields shown below.\n\n**Fields**\n  \u2022 title \u2013 exact title if present; otherwise invent a 6-12-word title  \n  \u2022 summary \u2013 2\u20133 sentence synopsis  \n  \u2022 doc_type \u2013 one of: {\", \".join(DOC_TYPES)}  \n  \u2022 topic_category \u2013 one of: {\", \".join(TOPIC_CATS)}  \n  \u2022 sentiment \u2013 positive, neutral, or negative overall tone  \n\n**Rules**\n  \u2013 If a category is uncertain, use \"other\".  \n  \u2013 Respond ONLY in the JSON format supplied automatically.\n\n**Document begins**\n{{pdf_text}}\n\"\"\")\n])<\/code><\/pre>\n<\/div>\n<p>The next step is to define an LLM. We will use the OpenAI <code>gpt-4o-mini<\/code> model and create the <code>ChatOpenAI<\/code> object that supports chat-like interaction with the LLM. You can use any other supported by the LangChain framework.<\/p>\n<p>To extract structured data, we call the <code>with_structured_output()<\/code> function using the <code>ChatOpenAI<\/code> object and pass it the <code>PDFRecord<\/code> base model class we defined earlier.<\/p>\n<p>Finally, we combine the prompt and LLM to create a LangChain runnable object.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-python\">\nllm   = ChatOpenAI(model_name=\"gpt-4o-mini\",\n                   openai_api_key = \"YOUR_OPENAI_API_KEY\",\n                   temperature=0)\n\nstructured_llm = llm.with_structured_output(PDFRecord)   \nchain = prompt | structured_llm   <\/code><\/pre>\n<\/div>\n<p>We will extract the text of each document from the list of PDF documents and invoke the chain we defined. Notice that we are passing the PDF text (doc.page_content) as a value for the <code>pdf_text<\/code> key since the prompt contains a placeholder with the same name.<\/p>\n<p>The response from the LLM chain is appended to the <code>rows<\/code> list.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-python\">\nrows = []\nfor doc in tqdm(docs, desc=\"Processing PDFs\"):\n    record     = chain.invoke({\"pdf_text\": doc.page_content})  # \u2192 PDFRecord\n    row        = record.dict()              # plain dict\n    row[\"path\"] = doc.metadata[\"source\"]                       \n    rows.append(row)<\/code><\/pre>\n<\/div>\n<p>The <code>rows<\/code> list now contains Python dictionaries containing structured information extracted from the PDF documents. We convert this list into a Pandas DataFrame and store it as a CSV file for later use.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-python\">\ndataset = pd.DataFrame(rows)\ndataset.to_csv(\"pdf_catalog.csv\", index=False)\nprint(\"\u2713 Saved pdf_catalog.csv with\", len(rows), \"rows\")\ndataset.head(10)<\/code><\/pre>\n<\/div>\n<p><strong>Output:<\/strong><\/p>\n<p><a href=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/10\/img2-structured-data-extracted-from-pdfs.png\"><img decoding=\"async\" src=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/10\/img2-structured-data-extracted-from-pdfs.png\" alt=\"\" width=\"1747\" height=\"460\" class=\"aligncenter size-full wp-image-32476\" \/><\/a><\/p>\n<p>The above output shows the data extracted from PDF documents. Each row corresponds to a single PDF document. Next, we will insert this data in GridDB.<\/p>\n<h2>Inserting Structured Data from PDF into GridDB<\/h2>\n<p>Inserting data into GridDB is a three-step process. You establish a connection with a GridDB host, create a container, and insert data into it.<\/p>\n<h3>Creating a Connection with GridDB<\/h3>\n<p>To create a GridDB connection, call the <code>griddb.StoreFactory.get_instance()<\/code> function to get a factory object. Next, call the <code>get_store()<\/code> function on the factory object and pass it the database host, cluster name, and user name and password.<\/p>\n<p>The following script creates a connection to the locally hosted GridDB server and tests the connection by retrieving a dummy container.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-python\">\nfactory = griddb.StoreFactory.get_instance()\n\nDB_HOST = \"127.0.0.1:10001\"\nDB_CLUSTER = \"myCluster\"\nDB_USER = \"admin\"\nDB_PASS = \"admin\"\n\ntry:\n    gridstore = factory.get_store(\n        notification_member = DB_HOST,\n        cluster_name = DB_CLUSTER,\n        username = DB_USER,\n        password = DB_PASS\n    )\n\n    container1 = gridstore.get_container(\"container1\")\n    if container1 == None:\n        print(\"Container does not exist\")\n    print(\"Successfully connected to GridDB\")\n\nexcept griddb.GSException as e:\n    for i in range(e.get_error_stack_size()):\n        print(\"[\", i, \"]\")\n        print(e.get_error_code(i))\n        print(e.get_location(i))\n        print(e.get_message(i))<\/code><\/pre>\n<\/div>\n<p><strong>Output:<\/strong><\/p>\n<pre><code>Container does not exist\nSuccessfully connected to GridDB\n<\/code><\/pre>\n<p>If you see the above output, you successfully established a connection with the GridDB server.<\/p>\n<h3>Inserting Data into GridDB<\/h3>\n<p>Next, we will insert the data from our Pandas DataFrame into the GridDB container.<\/p>\n<p>To do so, we define the <code>map_pandas_dtype_to_griddb()<\/code> function, which maps the Pandas column types to GridDB data types.<\/p>\n<p>We iterate through all the column names and types and create a list of lists, each nested list containing a column name and GridDB data type for the column.<\/p>\n<p>Next, we create a <code>ContainerInfo<\/code> object using the container name, the container columns, and the types lists. Since we are storing tabular data, we set the container type to <code>COLLECTION<\/code>.<\/p>\n<p>Next, we store the container in GridDB using the <code>gridstore.put_container()<\/code> function.<\/p>\n<p>Finally, we iterate through all the rows in our pdf document dataset and store it in the container we created using the <code>put()<\/code> function.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-python\">\n# see all GridDB data types: https:\/\/docs.griddb.net\/architecture\/data-model\/#data-type\n\ndef map_pandas_dtype_to_griddb(dtype):\n    if dtype == 'int64':\n        return griddb.Type.LONG\n    elif dtype == 'float64':\n        return griddb.Type.FLOAT\n    elif dtype == 'object':\n        return griddb.Type.STRING\n    # Add more column types if you want\n    else:\n        raise ValueError(f'Unsupported pandas type: {dtype}')\n\ncontainer_columns = []\nfor column_name, dtype in dataset.dtypes.items():\n    griddb_dtype = map_pandas_dtype_to_griddb(str(dtype))\n    container_columns.append([column_name, griddb_dtype])\n\ncontainer_name = \"PDFData\"\ncontainer_info = griddb.ContainerInfo(container_name,\n                                      container_columns,\n                                      griddb.ContainerType.COLLECTION, True)\n\n\ntry:\n    cont = gridstore.put_container(container_info)\n    for index, row in dataset.iterrows():\n        cont.put(row.tolist())\n    print(\"All rows have been successfully stored in the GridDB container.\")\n\nexcept griddb.GSException as e:\n    for i in range(e.get_error_stack_size()):\n        print(\"[\", i, \"]\")\n        print(e.get_error_code(i))\n        print(e.get_location(i))\n        print(e.get_message(i))\n<\/code><\/pre>\n<\/div>\n<p>Finally, we will retrieve data from GridDB and analyze the dataset.<\/p>\n<h2>Retrieving Data from GridDB and Performing Analysis<\/h2>\n<p>To retrieve data from a GridDB container, you must first retrieve the container using the <code>get_container()<\/code> function and then execute an SQL query on the container object using the <code>query()<\/code> function, as shown in the script below.<\/p>\n<p>To execute the select query, you need to call the <code>fetch()<\/code> function, and to retrieve data as a Pandas dataframe, call the <code>fetch_rows()<\/code> function.<\/p>\n<p>The following script retrieves structured data from our GridDB container and stores it in the <code>pdf_dataset<\/code> dataframe.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-python\">\ndef retrieve_data_from_griddb(container_name):\n\n    try:\n        data_container = gridstore.get_container(container_name)\n\n        # Query all data from the container\n        query = data_container.query(\"select *\")\n        rs = query.fetch()\n\n        data = rs.fetch_rows()\n        return data\n\n    except griddb.GSException as e:\n        print(f\"Error retrieving data from GridDB: {e.get_message()}\")\n        return None\n\n\npdf_dataset = retrieve_data_from_griddb(container_name)\npdf_dataset.head()\n<\/code><\/pre>\n<\/div>\n<p><strong>Output:<\/strong><\/p>\n<p><a href=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/10\/img3-data-retrieved-from-griddb.png\"><img decoding=\"async\" src=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/10\/img3-data-retrieved-from-griddb.png\" alt=\"\" width=\"1728\" height=\"251\" class=\"aligncenter size-full wp-image-32478\" \/><\/a><\/p>\n<p>The above output shows the data retrieved from our GridDB container.<\/p>\n<p>Once we store data from a GridDB container in a Pandas DataFrame, we can perform various analyses on it.<\/p>\n<p>Using a Pie chart, Let&#8217;s see the topic category distribution in all PDF documents.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-python\">\npdf_dataset[\"topic_category\"].value_counts().plot.pie(autopct=\"%1.1f%%\")\nplt.title(\"Distribution of topic categories\")\nplt.ylabel(\"\")\nplt.show()<\/code><\/pre>\n<\/div>\n<p><strong>Output:<\/strong><\/p>\n<p><a href=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/10\/img4-distribution-of-topic-categories.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/10\/img4-distribution-of-topic-categories.png\" alt=\"\" width=\"560\" height=\"498\" class=\"aligncenter size-full wp-image-32479\" \/><\/a><\/p>\n<p>The output shows that the majority of documents are related to science, followed by business.<\/p>\n<p>Next, we can plot the distribution of document types using a donut chart.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-python\">\ndf[\"doc_type\"].value_counts().plot.pie(\n    autopct=\"%1.1f%%\",\n    wedgeprops=dict(width=0.50)   # makes the \u201cdonut\u201d hole\n)\nplt.title(\"Document type\")\nplt.ylabel(\"\")\nplt.gca().set_aspect(\"equal\")\nplt.show()<\/code><\/pre>\n<\/div>\n<p><strong>Output:<\/strong><\/p>\n<p><a href=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/10\/img5-distribution-of-document-type.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/10\/img5-distribution-of-document-type.png\" alt=\"\" width=\"531\" height=\"511\" class=\"aligncenter size-full wp-image-32480\" \/><\/a><\/p>\n<p>The output shows that the majority of documents are reports.<\/p>\n<p>Finally, we can plot the sentiments expressed in documents as a bar plot.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-python\">\npdf_dataset[\"sentiment\"].value_counts().plot.bar()\nplt.title(\"Distribution of sentiment values\")\nplt.xlabel(\"sentiment\")\nplt.ylabel(\"count\")\nplt.tight_layout()\nplt.show()<\/code><\/pre>\n<\/div>\n<p><strong>Output:<\/strong><\/p>\n<p><a href=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/10\/img6-distribution-of-document-sentiment-values.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/10\/img6-distribution-of-document-sentiment-values.png\" alt=\"\" width=\"783\" height=\"587\" class=\"aligncenter size-full wp-image-32481\" \/><\/a><\/p>\n<p>The above output shows that most of the documents have neutral sentiments.<\/p>\n<h2>Conclusion<\/h2>\n<p>This article explained how to build a complete pipeline for extracting metadata from unstructured PDF documents using LLMs and storing the result in GridDB. You explored using LangChain with OpenAI&#8217;s GPT-4 model to extract key information such as document title, summary, type, category, and sentiment and how to save this structured output into a GridDB container.<\/p>\n<p>The combination of LLM-driven data extraction and GridDB&#8217;s performance-oriented architecture makes this approach suitable for intelligent document processing in real-time applications.<\/p>\n<p>If you have questions or need assistance with GridDB please ask on Stack Overflow using the <code>griddb<\/code> tag. Our team is always happy to help.<\/p>\n<p>For the complete code, visit my <a href=\"https:\/\/github.com\/usmanmalik57\/GridDB-Blogs\/tree\/main\/Inserting%20Structured%20Information%20from%20PDF%20documents%20into%20GridDB%20using%20LLMs\">GridDB Blogs GitHub repository<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Extracting meaningful insights from a large corpus of documents is a challenging task. With advancements in Large Language Models (LLMS), it is now possible to automate the process of structured data extraction from text documents. In this article, you will learn how to extract structured data from PDF documents using LLMs in LangChain and store [&hellip;]<\/p>\n","protected":false},"author":41,"featured_media":52250,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[121],"tags":[],"class_list":["post-52249","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-blog"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.1.1 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Inserting Structured Information from PDF documents into GridDB using LLMs | GridDB: Open Source Time Series Database for IoT<\/title>\n<meta name=\"description\" content=\"Extracting meaningful insights from a large corpus of documents is a challenging task. With advancements in Large Language Models (LLMS), it is now\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Inserting Structured Information from PDF documents into GridDB using LLMs | GridDB: Open Source Time Series Database for IoT\" \/>\n<meta property=\"og:description\" content=\"Extracting meaningful insights from a large corpus of documents is a challenging task. With advancements in Large Language Models (LLMS), it is now\" \/>\n<meta property=\"og:url\" content=\"https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/\" \/>\n<meta property=\"og:site_name\" content=\"GridDB: Open Source Time Series Database for IoT\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/griddbcommunity\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-10-10T07:00:00+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/griddb.net\/wp-content\/uploads\/2025\/12\/img6-distribution-of-document-sentiment-values.png\" \/>\n\t<meta property=\"og:image:width\" content=\"783\" \/>\n\t<meta property=\"og:image:height\" content=\"587\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"griddb-admin\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@GridDBCommunity\" \/>\n<meta name=\"twitter:site\" content=\"@GridDBCommunity\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"griddb-admin\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"11 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/\"},\"author\":{\"name\":\"griddb-admin\",\"@id\":\"https:\/\/griddb.net\/en\/#\/schema\/person\/4fe914ca9576878e82f5e8dd3ba52233\"},\"headline\":\"Inserting Structured Information from PDF documents into GridDB using LLMs\",\"datePublished\":\"2025-10-10T07:00:00+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/\"},\"wordCount\":1212,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/griddb.net\/en\/#organization\"},\"image\":{\"@id\":\"https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/#primaryimage\"},\"thumbnailUrl\":\"\/wp-content\/uploads\/2025\/12\/img6-distribution-of-document-sentiment-values.png\",\"articleSection\":[\"Blog\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/\",\"url\":\"https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/\",\"name\":\"Inserting Structured Information from PDF documents into GridDB using LLMs | GridDB: Open Source Time Series Database for IoT\",\"isPartOf\":{\"@id\":\"https:\/\/griddb.net\/en\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/#primaryimage\"},\"thumbnailUrl\":\"\/wp-content\/uploads\/2025\/12\/img6-distribution-of-document-sentiment-values.png\",\"datePublished\":\"2025-10-10T07:00:00+00:00\",\"description\":\"Extracting meaningful insights from a large corpus of documents is a challenging task. With advancements in Large Language Models (LLMS), it is now\",\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/#primaryimage\",\"url\":\"\/wp-content\/uploads\/2025\/12\/img6-distribution-of-document-sentiment-values.png\",\"contentUrl\":\"\/wp-content\/uploads\/2025\/12\/img6-distribution-of-document-sentiment-values.png\",\"width\":783,\"height\":587},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/griddb.net\/en\/#website\",\"url\":\"https:\/\/griddb.net\/en\/\",\"name\":\"GridDB: Open Source Time Series Database for IoT\",\"description\":\"GridDB is an open source time-series database with the performance of NoSQL and convenience of SQL\",\"publisher\":{\"@id\":\"https:\/\/griddb.net\/en\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/griddb.net\/en\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/griddb.net\/en\/#organization\",\"name\":\"Fixstars\",\"url\":\"https:\/\/griddb.net\/en\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/griddb.net\/en\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/griddb.net\/wp-content\/uploads\/2019\/04\/fixstars_logo_web_tagline.png\",\"contentUrl\":\"https:\/\/griddb.net\/wp-content\/uploads\/2019\/04\/fixstars_logo_web_tagline.png\",\"width\":200,\"height\":83,\"caption\":\"Fixstars\"},\"image\":{\"@id\":\"https:\/\/griddb.net\/en\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/griddbcommunity\/\",\"https:\/\/x.com\/GridDBCommunity\",\"https:\/\/www.linkedin.com\/company\/griddb-by-toshiba\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/griddb.net\/en\/#\/schema\/person\/4fe914ca9576878e82f5e8dd3ba52233\",\"name\":\"griddb-admin\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/griddb.net\/en\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/5bceca1cafc06886a7ba873e2f0a28011a1176c4dea59709f735b63ae30d0342?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/5bceca1cafc06886a7ba873e2f0a28011a1176c4dea59709f735b63ae30d0342?s=96&d=mm&r=g\",\"caption\":\"griddb-admin\"},\"url\":\"https:\/\/griddb.net\/en\/author\/griddb-admin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Inserting Structured Information from PDF documents into GridDB using LLMs | GridDB: Open Source Time Series Database for IoT","description":"Extracting meaningful insights from a large corpus of documents is a challenging task. With advancements in Large Language Models (LLMS), it is now","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/","og_locale":"en_US","og_type":"article","og_title":"Inserting Structured Information from PDF documents into GridDB using LLMs | GridDB: Open Source Time Series Database for IoT","og_description":"Extracting meaningful insights from a large corpus of documents is a challenging task. With advancements in Large Language Models (LLMS), it is now","og_url":"https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/","og_site_name":"GridDB: Open Source Time Series Database for IoT","article_publisher":"https:\/\/www.facebook.com\/griddbcommunity\/","article_published_time":"2025-10-10T07:00:00+00:00","og_image":[{"width":783,"height":587,"url":"https:\/\/griddb.net\/wp-content\/uploads\/2025\/12\/img6-distribution-of-document-sentiment-values.png","type":"image\/png"}],"author":"griddb-admin","twitter_card":"summary_large_image","twitter_creator":"@GridDBCommunity","twitter_site":"@GridDBCommunity","twitter_misc":{"Written by":"griddb-admin","Est. reading time":"11 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/#article","isPartOf":{"@id":"https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/"},"author":{"name":"griddb-admin","@id":"https:\/\/griddb.net\/en\/#\/schema\/person\/4fe914ca9576878e82f5e8dd3ba52233"},"headline":"Inserting Structured Information from PDF documents into GridDB using LLMs","datePublished":"2025-10-10T07:00:00+00:00","mainEntityOfPage":{"@id":"https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/"},"wordCount":1212,"commentCount":0,"publisher":{"@id":"https:\/\/griddb.net\/en\/#organization"},"image":{"@id":"https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/#primaryimage"},"thumbnailUrl":"\/wp-content\/uploads\/2025\/12\/img6-distribution-of-document-sentiment-values.png","articleSection":["Blog"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/","url":"https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/","name":"Inserting Structured Information from PDF documents into GridDB using LLMs | GridDB: Open Source Time Series Database for IoT","isPartOf":{"@id":"https:\/\/griddb.net\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/#primaryimage"},"image":{"@id":"https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/#primaryimage"},"thumbnailUrl":"\/wp-content\/uploads\/2025\/12\/img6-distribution-of-document-sentiment-values.png","datePublished":"2025-10-10T07:00:00+00:00","description":"Extracting meaningful insights from a large corpus of documents is a challenging task. With advancements in Large Language Models (LLMS), it is now","inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/griddb.net\/en\/blog\/inserting-structured-information-from-pdf-documents-into-griddb-using-llms\/#primaryimage","url":"\/wp-content\/uploads\/2025\/12\/img6-distribution-of-document-sentiment-values.png","contentUrl":"\/wp-content\/uploads\/2025\/12\/img6-distribution-of-document-sentiment-values.png","width":783,"height":587},{"@type":"WebSite","@id":"https:\/\/griddb.net\/en\/#website","url":"https:\/\/griddb.net\/en\/","name":"GridDB: Open Source Time Series Database for IoT","description":"GridDB is an open source time-series database with the performance of NoSQL and convenience of SQL","publisher":{"@id":"https:\/\/griddb.net\/en\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/griddb.net\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/griddb.net\/en\/#organization","name":"Fixstars","url":"https:\/\/griddb.net\/en\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/griddb.net\/en\/#\/schema\/logo\/image\/","url":"https:\/\/griddb.net\/wp-content\/uploads\/2019\/04\/fixstars_logo_web_tagline.png","contentUrl":"https:\/\/griddb.net\/wp-content\/uploads\/2019\/04\/fixstars_logo_web_tagline.png","width":200,"height":83,"caption":"Fixstars"},"image":{"@id":"https:\/\/griddb.net\/en\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/griddbcommunity\/","https:\/\/x.com\/GridDBCommunity","https:\/\/www.linkedin.com\/company\/griddb-by-toshiba"]},{"@type":"Person","@id":"https:\/\/griddb.net\/en\/#\/schema\/person\/4fe914ca9576878e82f5e8dd3ba52233","name":"griddb-admin","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/griddb.net\/en\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/5bceca1cafc06886a7ba873e2f0a28011a1176c4dea59709f735b63ae30d0342?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/5bceca1cafc06886a7ba873e2f0a28011a1176c4dea59709f735b63ae30d0342?s=96&d=mm&r=g","caption":"griddb-admin"},"url":"https:\/\/griddb.net\/en\/author\/griddb-admin\/"}]}},"_links":{"self":[{"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/posts\/52249","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/users\/41"}],"replies":[{"embeddable":true,"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/comments?post=52249"}],"version-history":[{"count":0,"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/posts\/52249\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/media\/52250"}],"wp:attachment":[{"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/media?parent=52249"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/categories?post=52249"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/tags?post=52249"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}