{"id":46654,"date":"2021-07-21T00:00:00","date_gmt":"2021-07-21T07:00:00","guid":{"rendered":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/"},"modified":"2025-11-13T12:55:26","modified_gmt":"2025-11-13T20:55:26","slug":"web-scraping-with-jsoup-and-griddb-in-java","status":"publish","type":"post","link":"https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/","title":{"rendered":"Web Scraping with Jsoup and GridDB in Java"},"content":{"rendered":"<h2>Introduction<\/h2>\n<p>Most websites make their data available to users via APIs. However, there are websites that have not developed such APIs. To access data from such sites, we use <em>web scraping<\/em>.<\/p>\n<p>Web scraping is a technique used to extract data from website content. The data is normally extracted from the HTML elements of the respective website.<\/p>\n<p>Suppose you&#8217;re looking for a job as a Java Programmer in Washington DC. It means that you&#8217;ll have to invest a lot of time to look for the job. Searching for a job manually is boring and time-consuming.<\/p>\n<p>To make the process easier and save time, you can automate it by creating a web scraper using Jsoup. The work of the web scraper will be to scrape data about jobs from job listing websites of your choice and store it in a database such as GridDB.<\/p>\n<p>Web scraping can speed up the data collection process and save you time. In this article, I will be showing you how to scrape data from websites using Jsoup in Java and store the data in GridDB.<\/p>\n<h3>Perquisites<\/h3>\n<ul>\n<li>\n<p><a href=\"https:\/\/docs.griddb.net\/gettingstarted\/using-source-code\/\">GridDB<\/a><\/p>\n<\/li>\n<li>\n<p><a href=\"https:\/\/www.java.com\/en\/download\/help\/download_options.html\">Java<\/a><\/p>\n<\/li>\n<\/ul>\n<h2>What is Jsoup?<\/h2>\n<p>Jsoup is a Java library that is made up of methods for extracting and manipulating HTML document content. Jsoup is open source and it was developed by Jonathan Hedley in 2009. If you are good in jQuery, then working with Jsoup should be a walk in the park for you. This is what you can do with Jsoup:<\/p>\n<ul>\n<li>\n<p>Scraping and parsing HTML from a file, URL, or string<\/p>\n<\/li>\n<li>\n<p>Finding and extracting data using CSS selectors or DOM traversal<\/p>\n<\/li>\n<li>\n<p>Manipulating HTML elements, text, and attributes<\/p>\n<\/li>\n<li>\n<p>outputting tidy HTML<\/p>\n<\/li>\n<\/ul>\n<h2>How to Add Jsoup to your Project<\/h2>\n<p>To use the Jsoup library, you MUST add it to your Java project. You need to download its jar file from Jsoup site and then reference it in your Java project. Below is the URL to the Jsoup site:<\/p>\n<p>https:\/\/jsoup.org\/download<\/p>\n<p>In Eclipse, follow the steps given below:<\/p>\n<p><strong>Step 1:<\/strong> Right click the project name on the Project Explorer and choose &#8220;Properties..&#8221; from the menu that pops up.<\/p>\n<p><strong>Step 2:<\/strong> Do the following on the Properties dialog:<\/p>\n<ol>\n<li>\n<p>Select the Java Build path from the list given on the left<\/p>\n<\/li>\n<li>\n<p>Click the &#8220;Libraries&#8221; tab<\/p>\n<\/li>\n<li>\n<p>Click the &#8220;Add external JARS&#8230;&#8221; button then navigate to where you have stored the Jsoup jar file. Click the &#8220;Open&#8221; button.<\/p>\n<\/li>\n<\/ol>\n<p><strong>Step 3:<\/strong> Click &#8220;OK&#8221; to close the dialog box.<\/p>\n<p>However, it&#8217;s possible to use the Jsoup library directly from the terminal of your operating system as you will see later in this article.<\/p>\n<h2>Import the Packages<\/h2>\n<p>We should first import all the libraries that will be needed in the project. This is shown below:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-java\">import java.io.IOException;\nimport java.util.Collection;\nimport java.util.Properties;\nimport java.util.Scanner;\n\nimport org.jsoup.Jsoup;\nimport org.jsoup.nodes.Document;\nimport org.jsoup.nodes.Element;\nimport org.jsoup.select.Elements;\n\nimport com.toshiba.mwcloud.gs.Collection;\nimport com.toshiba.mwcloud.gs.GSException;\nimport com.toshiba.mwcloud.gs.GridStore;\nimport com.toshiba.mwcloud.gs.GridStoreFactory;\nimport com.toshiba.mwcloud.gs.Query;\nimport com.toshiba.mwcloud.gs.RowKey;\nimport com.toshiba.mwcloud.gs.RowSet;<\/code><\/pre>\n<\/div>\n<p>Next, I will be showing you how to fetch content from a web page using Jsoup.<\/p>\n<h2>How to Fetch the Page<\/h2>\n<p>To work with the DOM, you should have a parsable document markup. Jsoup uses the <code>org.jsoup.nodes.Document<\/code> object to represent web pages.<\/p>\n<p>The first step towards fetching a web page is establishing a connection to the resource. Next, you should call the <code>get()<\/code> function to retrieve the contents of the web page.<\/p>\n<p>In this article, we will be scraping the Java sub-reddit on the following url:<\/p>\n<p>https:\/\/www.reddit.com\/r\/java\/<\/p>\n<p>Let us fetch the page:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-java\">Document subreddit = Jsoup.connect(\"https:\/\/www.reddit.com\/r\/java\/\").get();<\/code><\/pre>\n<\/div>\n<p>Some websites don&#8217;t allow crawling using unknown user agents. To prevent the web scraper from being blocked, let us use the Mozilla user agent:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-java\">Document subreddit = Jsoup.connect(\"https:\/\/www.reddit.com\/r\/java\/\").userAgent(\"Mozilla\").data(\"name\", \"jsoup\").get();<\/code><\/pre>\n<\/div>\n<p>We have created a Document container and given it the name <code>subreddit<\/code> to store the HTML contents of the web page.<\/p>\n<p>If you print out the contents of <code>subreddit<\/code> container, you should get the HTML contents of the web page, that is, the Java sub-reddit. Simply run the following command:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-java\">System.out.println(subreddit);<\/code><\/pre>\n<\/div>\n<p>In our case, we are only interested in fetching the titles of posts and the time each post was created. We will use HTML tags for this. All titles have been stored within an <code>&lt;h3&gt;<\/code> tag and given a class name as shown below:<\/p>\n<p><a href=\"https:\/\/griddb.net\/wp-content\/uploads\/2021\/07\/titles.png\"><img fetchpriority=\"high\" decoding=\"async\" src=\"https:\/\/griddb.net\/wp-content\/uploads\/2021\/07\/titles.png\" alt=\"\" width=\"494\" height=\"310\" class=\"aligncenter size-full wp-image-27625\" srcset=\"\/wp-content\/uploads\/2021\/07\/titles.png 494w, \/wp-content\/uploads\/2021\/07\/titles-300x188.png 300w\" sizes=\"(max-width: 494px) 100vw, 494px\" \/><\/a><\/p>\n<p>The times the posts were created have been stored within an <code>&lt;a&gt;<\/code> tag and given a class name as shown below:<\/p>\n<p><a href=\"https:\/\/griddb.net\/wp-content\/uploads\/2021\/07\/time.png\"><img decoding=\"async\" src=\"https:\/\/griddb.net\/wp-content\/uploads\/2021\/07\/time.png\" alt=\"\" width=\"501\" height=\"336\" class=\"aligncenter size-full wp-image-27624\" srcset=\"\/wp-content\/uploads\/2021\/07\/time.png 501w, \/wp-content\/uploads\/2021\/07\/time-300x201.png 300w\" sizes=\"(max-width: 501px) 100vw, 501px\" \/><\/a><\/p>\n<p>We will use the above HTML tags to scrape the title posts and the times they were created from the website. The following is the Java code for scraping the titles:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-java\">Elements titles = subreddit.select(\"h3[class]._eYtD2XCVieq6emjKBH3m\");\n      for (Element title : titles) \n          {\n              System.out.println(\"Title: \" + title.text());\n      }<\/code><\/pre>\n<\/div>\n<p>We have created an element named <code>titles<\/code> in which all the titles will be stored. We have then created a <code>for<\/code> loop to iterate over all the items stored in the element and print them out. This will help us to access individual titles of posts.<\/p>\n<p>The following is the Java code for scraping the times the posts were created:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-java\"> Elements times = subreddit.select(\"a[class]._3jOxDPIQ0KaOWpzvSQo-1s\");\n      for (Element time : times) {\n        \n        System.out.println(\"Time Posted: \" + time.text());\n      }<\/code><\/pre>\n<\/div>\n<p>We have created an element named <code>times<\/code> to store all the times extracted from the site. A <code>for<\/code> loop has then been used to iterate over the items stored in the element and print them out.<\/p>\n<p>Let us put the code together by nesting the <code>for<\/code> loops and surround it by a <code>try<\/code> and <code>catch<\/code> block to catch any exceptions that may be thrown when scraping the website. This is shown below:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-java\"> try { \n  for (Element title : titles) {\n\n      for (Element time : times) {\n          \n        System.out.println(\"Title: \" + title.text());\n\n        \n        System.out.println(\"Time Posted: \" + time.text());\n    \n       }\n      } \n        catch (IOException e) {\n    e.printStackTrace();\n        }\n    }<\/code><\/pre>\n<\/div>\n<p>If you run the code at this point, you should see the text that has been scraped from the site, including post titles and the times they were created.<\/p>\n<h2>Store Scraped Data in GridDB<\/h2>\n<p>Now that we&#8217;ve managed to successfully scrape data from the site, it&#8217;s time to store it in GridDB.<\/p>\n<p>First, let&#8217;s create the container schema as a static class:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-java\">public static class Post{\n    \n     @RowKey String post_title;\n     String when;\n    \n    }<\/code><\/pre>\n<\/div>\n<p>The above class represents a container in our cluster. You can see it as a SQL table.<\/p>\n<p>To establish a connection to GridDB, we&#8217;ve to create a Properties instance using the particulars of our GridDB installation, including the name of the cluster to connect to, the name of the user who needs to connect, and the password for that user. The following code demonstrates this:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-java\">        Properties props = new Properties();\n        props.setProperty(\"notificationAddress\", \"239.0.0.1\");\n        props.setProperty(\"notificationPort\", \"31999\");\n        props.setProperty(\"clusterName\", \"defaultCluster\");\n        props.setProperty(\"user\", \"admin\");\n        props.setProperty(\"password\", \"admin\");\n        GridStore store = GridStoreFactory.getInstance().getGridStore(props);<\/code><\/pre>\n<\/div>\n<p>For us to start running queries, we have to get the respective container. Remember that we created a container named <code>Post<\/code>. Let us get it:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-java\"> Collection&lt;String, Post> coll = store.putCollection(\"col01\", Post.class);<\/code><\/pre>\n<\/div>\n<p>We have created an instance of the container and given it the name <code>coll<\/code>. This is what we will be using to refer to the container.<\/p>\n<p>Next, let us create indexes for each of the two columns of the container:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-java\">     coll.createIndex(\"post_title\");\n \n     coll.createIndex(\"when\");\n \n     coll.setAutoCommit(false);<\/code><\/pre>\n<\/div>\n<p>Note that we&#8217;ve also set autocommit to <code>false<\/code>. This means that we will have to commit changes manually.<\/p>\n<p>Next, we will create an instance of our container, that is, <code>Post<\/code> and use that instance to prepare and insert data into the container. This is shown below:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-java\">     Post post = new Post();\n     post.post_title = title.text();\n     post.when = time.text();\n     \n     coll.put(post);    \n     coll.commit();<\/code><\/pre>\n<\/div>\n<p>We have also committed the changes made to the database, hence, they cannot be undone.<\/p>\n<h2>Compile the Code<\/h2>\n<p>First, login as the <code>gsadm<\/code> user. Move your java file and the .jar file for Jsoup to the <code>bin<\/code> folder of your GridDB located in the following path:<\/p>\n<p><code>\/griddb_4.6.0-1_amd64\/usr\/griddb-4.6.0\/bin<\/code><\/p>\n<p>Next, run the following command on your Linux terminal to set the path for the gridstore.jar file:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-java\">   export CLASSPATH=$CLASSPATH:\/home\/osboxes\/Downloads\/griddb_4.6.0-1_amd64\/usr\/griddb-4.6.0\/bin\/gridstore.jar<\/code><\/pre>\n<\/div>\n<p>Next, navigate to the above directory and run the following command to compile your <code>WebScraping.java<\/code> file:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-java\"> javac -cp jsoup-1.13.1.jar WebScraping.java<\/code><\/pre>\n<\/div>\n<p>Run the .class file that is generated by running the following command:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-java\"> java WebScraping<\/code><\/pre>\n<\/div>\n<p>Congratulations!<\/p>\n<p>That&#8217;s how to scrape data from a website using Jsoup in Java and insert data into GridDB.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Most websites make their data available to users via APIs. However, there are websites that have not developed such APIs. To access data from such sites, we use web scraping. Web scraping is a technique used to extract data from website content. The data is normally extracted from the HTML elements of the respective [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":27682,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[121],"tags":[],"class_list":["post-46654","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-blog"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.1.1 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Web Scraping with Jsoup and GridDB in Java | GridDB: Open Source Time Series Database for IoT<\/title>\n<meta name=\"description\" content=\"Introduction Most websites make their data available to users via APIs. However, there are websites that have not developed such APIs. To access data from\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Web Scraping with Jsoup and GridDB in Java | GridDB: Open Source Time Series Database for IoT\" \/>\n<meta property=\"og:description\" content=\"Introduction Most websites make their data available to users via APIs. However, there are websites that have not developed such APIs. To access data from\" \/>\n<meta property=\"og:url\" content=\"https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/\" \/>\n<meta property=\"og:site_name\" content=\"GridDB: Open Source Time Series Database for IoT\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/griddbcommunity\/\" \/>\n<meta property=\"article:published_time\" content=\"2021-07-21T07:00:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-11-13T20:55:26+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/griddb.net\/wp-content\/uploads\/2021\/07\/57720-bowl_2560x1723.jpeg\" \/>\n\t<meta property=\"og:image:width\" content=\"2560\" \/>\n\t<meta property=\"og:image:height\" content=\"1723\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Israel\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@GridDBCommunity\" \/>\n<meta name=\"twitter:site\" content=\"@GridDBCommunity\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Israel\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/\"},\"author\":{\"name\":\"Israel\",\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/person\/c8a430e7156a9e10af73b1fbb46c2740\"},\"headline\":\"Web Scraping with Jsoup and GridDB in Java\",\"datePublished\":\"2021-07-21T07:00:00+00:00\",\"dateModified\":\"2025-11-13T20:55:26+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/\"},\"wordCount\":1190,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#organization\"},\"image\":{\"@id\":\"https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/#primaryimage\"},\"thumbnailUrl\":\"\/wp-content\/uploads\/2021\/07\/57720-bowl_2560x1723.jpeg\",\"articleSection\":[\"Blog\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/\",\"url\":\"https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/\",\"name\":\"Web Scraping with Jsoup and GridDB in Java | GridDB: Open Source Time Series Database for IoT\",\"isPartOf\":{\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/#primaryimage\"},\"thumbnailUrl\":\"\/wp-content\/uploads\/2021\/07\/57720-bowl_2560x1723.jpeg\",\"datePublished\":\"2021-07-21T07:00:00+00:00\",\"dateModified\":\"2025-11-13T20:55:26+00:00\",\"description\":\"Introduction Most websites make their data available to users via APIs. However, there are websites that have not developed such APIs. To access data from\",\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/#primaryimage\",\"url\":\"\/wp-content\/uploads\/2021\/07\/57720-bowl_2560x1723.jpeg\",\"contentUrl\":\"\/wp-content\/uploads\/2021\/07\/57720-bowl_2560x1723.jpeg\",\"width\":2560,\"height\":1723},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#website\",\"url\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/\",\"name\":\"GridDB: Open Source Time Series Database for IoT\",\"description\":\"GridDB is an open source time-series database with the performance of NoSQL and convenience of SQL\",\"publisher\":{\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#organization\",\"name\":\"Fixstars\",\"url\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/griddb.net\/wp-content\/uploads\/2019\/04\/fixstars_logo_web_tagline.png\",\"contentUrl\":\"https:\/\/griddb.net\/wp-content\/uploads\/2019\/04\/fixstars_logo_web_tagline.png\",\"width\":200,\"height\":83,\"caption\":\"Fixstars\"},\"image\":{\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/griddbcommunity\/\",\"https:\/\/x.com\/GridDBCommunity\",\"https:\/\/www.linkedin.com\/company\/griddb-by-toshiba\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/person\/c8a430e7156a9e10af73b1fbb46c2740\",\"name\":\"Israel\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/4df8cfc155402a2928d11f80b0220037b8bd26c4f1b19c4598d826e0306e6307?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/4df8cfc155402a2928d11f80b0220037b8bd26c4f1b19c4598d826e0306e6307?s=96&d=mm&r=g\",\"caption\":\"Israel\"},\"url\":\"https:\/\/griddb.net\/en\/author\/israel\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Web Scraping with Jsoup and GridDB in Java | GridDB: Open Source Time Series Database for IoT","description":"Introduction Most websites make their data available to users via APIs. However, there are websites that have not developed such APIs. To access data from","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/","og_locale":"en_US","og_type":"article","og_title":"Web Scraping with Jsoup and GridDB in Java | GridDB: Open Source Time Series Database for IoT","og_description":"Introduction Most websites make their data available to users via APIs. However, there are websites that have not developed such APIs. To access data from","og_url":"https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/","og_site_name":"GridDB: Open Source Time Series Database for IoT","article_publisher":"https:\/\/www.facebook.com\/griddbcommunity\/","article_published_time":"2021-07-21T07:00:00+00:00","article_modified_time":"2025-11-13T20:55:26+00:00","og_image":[{"width":2560,"height":1723,"url":"https:\/\/griddb.net\/wp-content\/uploads\/2021\/07\/57720-bowl_2560x1723.jpeg","type":"image\/jpeg"}],"author":"Israel","twitter_card":"summary_large_image","twitter_creator":"@GridDBCommunity","twitter_site":"@GridDBCommunity","twitter_misc":{"Written by":"Israel","Est. reading time":"7 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/#article","isPartOf":{"@id":"https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/"},"author":{"name":"Israel","@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/person\/c8a430e7156a9e10af73b1fbb46c2740"},"headline":"Web Scraping with Jsoup and GridDB in Java","datePublished":"2021-07-21T07:00:00+00:00","dateModified":"2025-11-13T20:55:26+00:00","mainEntityOfPage":{"@id":"https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/"},"wordCount":1190,"commentCount":0,"publisher":{"@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#organization"},"image":{"@id":"https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/#primaryimage"},"thumbnailUrl":"\/wp-content\/uploads\/2021\/07\/57720-bowl_2560x1723.jpeg","articleSection":["Blog"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/","url":"https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/","name":"Web Scraping with Jsoup and GridDB in Java | GridDB: Open Source Time Series Database for IoT","isPartOf":{"@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/#primaryimage"},"image":{"@id":"https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/#primaryimage"},"thumbnailUrl":"\/wp-content\/uploads\/2021\/07\/57720-bowl_2560x1723.jpeg","datePublished":"2021-07-21T07:00:00+00:00","dateModified":"2025-11-13T20:55:26+00:00","description":"Introduction Most websites make their data available to users via APIs. However, there are websites that have not developed such APIs. To access data from","inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/griddb.net\/en\/blog\/web-scraping-with-jsoup-and-griddb-in-java\/#primaryimage","url":"\/wp-content\/uploads\/2021\/07\/57720-bowl_2560x1723.jpeg","contentUrl":"\/wp-content\/uploads\/2021\/07\/57720-bowl_2560x1723.jpeg","width":2560,"height":1723},{"@type":"WebSite","@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#website","url":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/","name":"GridDB: Open Source Time Series Database for IoT","description":"GridDB is an open source time-series database with the performance of NoSQL and convenience of SQL","publisher":{"@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#organization","name":"Fixstars","url":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/logo\/image\/","url":"https:\/\/griddb.net\/wp-content\/uploads\/2019\/04\/fixstars_logo_web_tagline.png","contentUrl":"https:\/\/griddb.net\/wp-content\/uploads\/2019\/04\/fixstars_logo_web_tagline.png","width":200,"height":83,"caption":"Fixstars"},"image":{"@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/griddbcommunity\/","https:\/\/x.com\/GridDBCommunity","https:\/\/www.linkedin.com\/company\/griddb-by-toshiba"]},{"@type":"Person","@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/person\/c8a430e7156a9e10af73b1fbb46c2740","name":"Israel","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/4df8cfc155402a2928d11f80b0220037b8bd26c4f1b19c4598d826e0306e6307?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/4df8cfc155402a2928d11f80b0220037b8bd26c4f1b19c4598d826e0306e6307?s=96&d=mm&r=g","caption":"Israel"},"url":"https:\/\/griddb.net\/en\/author\/israel\/"}]}},"_links":{"self":[{"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/posts\/46654","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/comments?post=46654"}],"version-history":[{"count":1,"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/posts\/46654\/revisions"}],"predecessor-version":[{"id":51329,"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/posts\/46654\/revisions\/51329"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/media\/27682"}],"wp:attachment":[{"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/media?parent=46654"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/categories?post=46654"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/tags?post=46654"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}