{"id":46650,"date":"2021-06-05T00:00:00","date_gmt":"2021-06-05T07:00:00","guid":{"rendered":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/blog\/collecting-data-using-scrapy-and-griddb\/"},"modified":"2025-11-13T12:55:24","modified_gmt":"2025-11-13T20:55:24","slug":"collecting-data-using-scrapy-and-griddb","status":"publish","type":"post","link":"https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/","title":{"rendered":"Collecting Data using Scrapy and GridDB"},"content":{"rendered":"<h2>Introduction<\/h2>\n<p>Today, we will cover how to scrape data from any website using Python\u00e2\u20ac\u2122s library Scrapy. We will then save the data in a JSON and HTML file. Finally, we will see how we can also store this data in GridDB for long-term and efficient use.<\/p>\n<h2>Pre-requisites<\/h2>\n<p>This post requires the prior installation of the following:<\/p>\n<ol>\n<li><a href=\"https:\/\/www.python.org\/downloads\/\">Python 3.6+<\/a><\/li>\n<li><a href=\"https:\/\/docs.scrapy.org\/en\/latest\/intro\/install.html#installation-guide\">Scrapy<\/a><\/li>\n<li><a href=\"https:\/\/griddb.net\/en\/blog\/griddb-quickstart\/\">GridDB<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/griddb\/c_client\">GridDB C-client<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/griddb\/python_client\">GridDB python-client<\/a><\/li>\n<\/ol>\n<p>We also recommend installing <a href=\"https:\/\/www.anaconda.com\/products\/individual\/get-started\">Anaconda Navigator<\/a>, if not already installed. Anaconda provides a large range of tools for data scientists to experiment with. Also, a virtual environment can help you meet the specific version requirements while running an application without interfering with the actual system paths.<\/p>\n<h2>Creating a new project using Scrapy<\/h2>\n<p>For this tutorial, we will be using Anaconda\u00e2\u20ac\u2122s Command Line Interface and Jupyter Notebooks. Both of these tools can be found in the Anaconda Dashboard.<\/p>\n<p><a href=\"https:\/\/griddb.net\/wp-content\/uploads\/2021\/06\/unnamed.png\"><img fetchpriority=\"high\" decoding=\"async\" src=\"https:\/\/griddb.net\/wp-content\/uploads\/2021\/06\/unnamed.png\" alt=\"\" width=\"1366\" height=\"728\" class=\"aligncenter size-full wp-image-27518\" srcset=\"\/wp-content\/uploads\/2021\/06\/unnamed.png 1366w, \/wp-content\/uploads\/2021\/06\/unnamed-300x160.png 300w, \/wp-content\/uploads\/2021\/06\/unnamed-1024x546.png 1024w, \/wp-content\/uploads\/2021\/06\/unnamed-768x409.png 768w, \/wp-content\/uploads\/2021\/06\/unnamed-600x320.png 600w\" sizes=\"(max-width: 1366px) 100vw, 1366px\" \/><\/a><\/p>\n<p>Creating a new project with scrapy is simple. Just type the following command within the directory you wish to create a new project folder in:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-py\">scrapy startproject griddb_tutorial\n<\/code><\/pre>\n<\/div>\n<p>A new folder with the name <code>griddb_tutorial<\/code> is now created in the current directory. Let us look at the contents of this folder:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-py\">tree directory_path \/F\n<\/code><\/pre>\n<\/div>\n<h2>Extracting data from a URL<\/h2>\n<p>Scrapy uses a class called <code>Spider<\/code> to crawl websites and extract information. We can write our custom code and mention initial requests inside this <code>Spider<\/code> class. For this tutorial, we will be scraping funny quotes from the website <a href=\"http:\/\/quotes.toscrape.com\/\">quotes.toscrape.com<\/a> and storing the information in JSON format.<\/p>\n<p>The following lines of code collect the information about the text, author, and tags associated with a quote.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-py\">import scrapy\n\n\nclass QuotesSpider(scrapy.Spider):\n    name = \"quotes_funny\"\n    start_urls = [\n        'http:\/\/quotes.toscrape.com\/tag\/humor',\n    ]\n\n    def parse(self, response):\n        for quote in response.css('div.quote'):\n            yield {\n                'text': quote.css('span.text::text').get(),\n                'author': quote.css('small.author::text').get(),\n                'tags': quote.css('div.tags a.tag::text').getall(),\n            }<\/code><\/pre>\n<\/div>\n<p>We will now save this python file in the <code>\/griddb_tutorial\/spider<\/code> directory. We execute a spider by providing its name to scrapy through the command line. Therefore, it is important to use a unique name for each spider.<\/p>\n<p>Coming back to the home directory. Let\u00e2\u20ac\u2122s run this spider now and see what we get &#8211;<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-py\">scrapy crawl quotes_funny<\/code><\/pre>\n<\/div>\n<p>It takes some time to extract the data. Once the execution is completed, our output looks like &#8211;<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-py\">DEBUG: Scraped from &lt;200 http:\/\/quotes.toscrape.com\/tag\/humor\/>\n{'text': '\"The reason I talk to myself is because I'm the only one whose answers I accept.\"', 'author': 'George Carlin', 'tags': ['humor', 'insanity', 'lies', 'lying', 'self-indulgence', 'truth']}\n2021-05-29 21:29:44 [scrapy.core.engine] INFO: Closing spider (finished)<\/code><\/pre>\n<\/div>\n<h2>Storing data into JSON<\/h2>\n<p>To store the previously crawled data, we modify the above command by simply passing an additional parameter &#8211;<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-py\">scrapy crawl quotes_funny -O quotes_funny.json<\/code><\/pre>\n<\/div>\n<p>This will create a new file named <code>quotes_funny.json<\/code> in the home directory. Note that the <code>-O<\/code> command overwrites any existing file with the same name. In case, you want to append new content to an existing <code>JSON<\/code> file, use <code>-o<\/code> instead.<\/p>\n<p>The content of the <code>JSON<\/code> file will look like this:<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-py\">[\n{\"text\": \"u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.u201d\", \"author\": \"Jane Austen\", \"tags\": [\"aliteracy\", \"books\", \"classic\", \"humor\"]},\n{\"text\": \"u201cAll you need is love. But a little chocolate now and then doesn't hurt.u201d\", \"author\": \"Charles M. Schulz\", \"tags\": [\"chocolate\", \"food\", \"humor\"]},\n{\"text\": \"u201cThe reason I talk to myself is because Iu2019m the only one whose answers I accept.u201d\", \"author\": \"George Carlin\", \"tags\": [\"humor\", \"insanity\", \"lies\", \"lying\", \"self-indulgence\", \"truth\"]}\n]<\/code><\/pre>\n<\/div>\n<h2>Storing data into GridDB<\/h2>\n<p>If you\u00e2\u20ac\u2122re collecting continuous data over time, it is only a wise decision to store it in a Database. <a href=\"https:\/\/griddb.net\/en\/\">GridDB<\/a> allows you to store time-series data and is specially optimized for IoT and Big Data. Its highly scalable nature gives room to both SQL and NoSQL interfaces. Follow their <a href=\"https:\/\/github.com\/griddb\/griddb\">tutorial<\/a> to get started.<\/p>\n<p>We have our data collected from scrapy in a JSON file. Moreover, most websites allow you to export data in JSON format. Therefore, we will be writing a python script to load the data in our environment.<\/p>\n<h3>Reading data from a JSON file<\/h3>\n<div class=\"clipboard\">\n<pre><code class=\"language-py\">import json\nf = open('quotes_funny.json',)\ndata = json.load(f)\nprint(data[0])<\/code><\/pre>\n<\/div>\n<p>We will get an output similar to<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-py\">{'text': '\"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\"',\n 'author': 'Albert Einstein',\n 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}<\/code><\/pre>\n<\/div>\n<p>To extract all the key-value pairs &#8211;<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-py\">for d in data:\n    for key, value in d.items():\n        print(key, value)<\/code><\/pre>\n<\/div>\n<p>Now that we have extracted each key-value pair, let us initialize a GridDB instance.<\/p>\n<h3>GridDB container initialization<\/h3>\n<div class=\"clipboard\">\n<pre><code class=\"language-py\">import griddb_python as griddb\n\nfactory = griddb.StoreFactory.get_instance()\n\n# Container Initialization\ntry:\n    gridstore = factory.get_store(host=your_host, port=your_port, \n            cluster_name=your_cluster_name, username=your_username, \n            password=your_password)\n\n    conInfo = griddb.ContainerInfo(\"Dataset_Name\",\n                    [[\"attribute1\", griddb.Type.STRING],[\"attribute2\",griddb.Type.FLOAT],\n                    ....],\n                    griddb.ContainerType.COLLECTION, True)\n    \n    cont = gridstore.put_container(conInfo)   \n    cont.create_index(\"id\", griddb.IndexType.DEFAULT)\n\nexcept griddb.GSException as e:\n    for i in range(e.get_error_stack_size()):\n        print(\"[\", i, \"]\")\n        print(e.get_error_code(i))\n        print(e.get_location(i))\n        print(e.get_message(i))<\/code><\/pre>\n<\/div>\n<p>Fill in your custom details in the above code. Note that in our case, the data type is essentially a <code>STRING<\/code>. More information on data types supported by GridDB can be found <a href=\"https:\/\/docs.griddb.net\/architecture\/data-model\/#data-type\">here<\/a>.<\/p>\n<h3>Insert data into the GridDB container<\/h3>\n<p>Our JSON file is a list of dictionaries. Each dictionary contains 3 attributes: Text, Author, and Tags. We will have to run two loops to get items under each category.<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-py\">for d in data:\n    for key in d:\n        ret = cont.put(d[key])<\/code><\/pre>\n<\/div>\n<p>The final insertion script looks like this &#8211;<\/p>\n<div class=\"clipboard\">\n<pre><code class=\"language-py\">import griddb_python as griddb\n\nfactory = griddb.StoreFactory.get_instance()\n\n# Container Initialization\ntry:\n    gridstore = factory.get_store(host=your_host, port=your_port, \n            cluster_name=your_cluster_name, username=your_username, \n            password=your_password)\n\n    conInfo = griddb.ContainerInfo(\"Dataset_Name\",\n                    [[\"attribute1\", griddb.Type.INTEGER],[\"attribute2\",griddb.Type.FLOAT],\n                    ....],\n                    griddb.ContainerType.COLLECTION, True)\n    \n    cont = gridstore.put_container(conInfo)   \n    cont.create_index(\"id\", griddb.IndexType.DEFAULT)\n    \n    #Adding data to container\n    for d in data:\n        for key in d:\n            ret = cont.put(d[key])\n\nexcept griddb.GSException as e:\n    for i in range(e.get_error_stack_size()):\n        print(\"[\", i, \"]\")\n        print(e.get_error_code(i))\n        print(e.get_location(i))\n        print(e.get_message(i))<\/code><\/pre>\n<\/div>\n<p>Check out default cluster values on the official Github page of <a href=\"https:\/\/github.com\/griddb\/python_client\">GridDB\u00e2\u20ac\u2122s python-client<\/a>.<\/p>\n<h2>Conclusion<\/h2>\n<p>In this tutorial, we saw how to create a spider in order to crawl data from a website. We stored the collected data in a JSON format so that it is easy to share among different platforms. Later on, we developed an insertion script for storing this data into GridDB.<\/p>\n<p>Storing data in a database is crucial if you\u00e2\u20ac\u2122re working on continuous data. It can be hard to store multiple JSON files in such a case. <a href=\"https:\/\/griddb.net\/en\/\">GridDB<\/a> makes it easier to store every bit of information in one place. This saves time and helps team integrate without any hassle. Get started with <a href=\"https:\/\/griddb.net\/en\/\">GridDB<\/a> today!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Today, we will cover how to scrape data from any website using Python\u00e2\u20ac\u2122s library Scrapy. We will then save the data in a JSON and HTML file. Finally, we will see how we can also store this data in GridDB for long-term and efficient use. Pre-requisites This post requires the prior installation of the [&hellip;]<\/p>\n","protected":false},"author":41,"featured_media":27517,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[121],"tags":[],"class_list":["post-46650","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-blog"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.1.1 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Collecting Data using Scrapy and GridDB | GridDB: Open Source Time Series Database for IoT<\/title>\n<meta name=\"description\" content=\"Introduction Today, we will cover how to scrape data from any website using Python\u00e2\u20ac\u2122s library Scrapy. We will then save the data in a JSON and HTML file.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Collecting Data using Scrapy and GridDB | GridDB: Open Source Time Series Database for IoT\" \/>\n<meta property=\"og:description\" content=\"Introduction Today, we will cover how to scrape data from any website using Python\u00e2\u20ac\u2122s library Scrapy. We will then save the data in a JSON and HTML file.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/\" \/>\n<meta property=\"og:site_name\" content=\"GridDB: Open Source Time Series Database for IoT\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/griddbcommunity\/\" \/>\n<meta property=\"article:published_time\" content=\"2021-06-05T07:00:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-11-13T20:55:24+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/griddb.net\/wp-content\/uploads\/2021\/06\/scrapy.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1160\" \/>\n\t<meta property=\"og:image:height\" content=\"653\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"griddb-admin\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@GridDBCommunity\" \/>\n<meta name=\"twitter:site\" content=\"@GridDBCommunity\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"griddb-admin\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/\"},\"author\":{\"name\":\"griddb-admin\",\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/person\/4fe914ca9576878e82f5e8dd3ba52233\"},\"headline\":\"Collecting Data using Scrapy and GridDB\",\"datePublished\":\"2021-06-05T07:00:00+00:00\",\"dateModified\":\"2025-11-13T20:55:24+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/\"},\"wordCount\":731,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#organization\"},\"image\":{\"@id\":\"https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/#primaryimage\"},\"thumbnailUrl\":\"\/wp-content\/uploads\/2021\/06\/scrapy.png\",\"articleSection\":[\"Blog\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/\",\"url\":\"https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/\",\"name\":\"Collecting Data using Scrapy and GridDB | GridDB: Open Source Time Series Database for IoT\",\"isPartOf\":{\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/#primaryimage\"},\"thumbnailUrl\":\"\/wp-content\/uploads\/2021\/06\/scrapy.png\",\"datePublished\":\"2021-06-05T07:00:00+00:00\",\"dateModified\":\"2025-11-13T20:55:24+00:00\",\"description\":\"Introduction Today, we will cover how to scrape data from any website using Python\u00e2\u20ac\u2122s library Scrapy. We will then save the data in a JSON and HTML file.\",\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/#primaryimage\",\"url\":\"\/wp-content\/uploads\/2021\/06\/scrapy.png\",\"contentUrl\":\"\/wp-content\/uploads\/2021\/06\/scrapy.png\",\"width\":1160,\"height\":653},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#website\",\"url\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/\",\"name\":\"GridDB: Open Source Time Series Database for IoT\",\"description\":\"GridDB is an open source time-series database with the performance of NoSQL and convenience of SQL\",\"publisher\":{\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#organization\",\"name\":\"Fixstars\",\"url\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/griddb.net\/wp-content\/uploads\/2019\/04\/fixstars_logo_web_tagline.png\",\"contentUrl\":\"https:\/\/griddb.net\/wp-content\/uploads\/2019\/04\/fixstars_logo_web_tagline.png\",\"width\":200,\"height\":83,\"caption\":\"Fixstars\"},\"image\":{\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/griddbcommunity\/\",\"https:\/\/x.com\/GridDBCommunity\",\"https:\/\/www.linkedin.com\/company\/griddb-by-toshiba\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/person\/4fe914ca9576878e82f5e8dd3ba52233\",\"name\":\"griddb-admin\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/5bceca1cafc06886a7ba873e2f0a28011a1176c4dea59709f735b63ae30d0342?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/5bceca1cafc06886a7ba873e2f0a28011a1176c4dea59709f735b63ae30d0342?s=96&d=mm&r=g\",\"caption\":\"griddb-admin\"},\"url\":\"https:\/\/griddb.net\/en\/author\/griddb-admin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Collecting Data using Scrapy and GridDB | GridDB: Open Source Time Series Database for IoT","description":"Introduction Today, we will cover how to scrape data from any website using Python\u00e2\u20ac\u2122s library Scrapy. We will then save the data in a JSON and HTML file.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/","og_locale":"en_US","og_type":"article","og_title":"Collecting Data using Scrapy and GridDB | GridDB: Open Source Time Series Database for IoT","og_description":"Introduction Today, we will cover how to scrape data from any website using Python\u00e2\u20ac\u2122s library Scrapy. We will then save the data in a JSON and HTML file.","og_url":"https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/","og_site_name":"GridDB: Open Source Time Series Database for IoT","article_publisher":"https:\/\/www.facebook.com\/griddbcommunity\/","article_published_time":"2021-06-05T07:00:00+00:00","article_modified_time":"2025-11-13T20:55:24+00:00","og_image":[{"width":1160,"height":653,"url":"https:\/\/griddb.net\/wp-content\/uploads\/2021\/06\/scrapy.png","type":"image\/png"}],"author":"griddb-admin","twitter_card":"summary_large_image","twitter_creator":"@GridDBCommunity","twitter_site":"@GridDBCommunity","twitter_misc":{"Written by":"griddb-admin","Est. reading time":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/#article","isPartOf":{"@id":"https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/"},"author":{"name":"griddb-admin","@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/person\/4fe914ca9576878e82f5e8dd3ba52233"},"headline":"Collecting Data using Scrapy and GridDB","datePublished":"2021-06-05T07:00:00+00:00","dateModified":"2025-11-13T20:55:24+00:00","mainEntityOfPage":{"@id":"https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/"},"wordCount":731,"commentCount":0,"publisher":{"@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#organization"},"image":{"@id":"https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/#primaryimage"},"thumbnailUrl":"\/wp-content\/uploads\/2021\/06\/scrapy.png","articleSection":["Blog"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/","url":"https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/","name":"Collecting Data using Scrapy and GridDB | GridDB: Open Source Time Series Database for IoT","isPartOf":{"@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/#primaryimage"},"image":{"@id":"https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/#primaryimage"},"thumbnailUrl":"\/wp-content\/uploads\/2021\/06\/scrapy.png","datePublished":"2021-06-05T07:00:00+00:00","dateModified":"2025-11-13T20:55:24+00:00","description":"Introduction Today, we will cover how to scrape data from any website using Python\u00e2\u20ac\u2122s library Scrapy. We will then save the data in a JSON and HTML file.","inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/griddb.net\/en\/blog\/collecting-data-using-scrapy-and-griddb\/#primaryimage","url":"\/wp-content\/uploads\/2021\/06\/scrapy.png","contentUrl":"\/wp-content\/uploads\/2021\/06\/scrapy.png","width":1160,"height":653},{"@type":"WebSite","@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#website","url":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/","name":"GridDB: Open Source Time Series Database for IoT","description":"GridDB is an open source time-series database with the performance of NoSQL and convenience of SQL","publisher":{"@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#organization","name":"Fixstars","url":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/logo\/image\/","url":"https:\/\/griddb.net\/wp-content\/uploads\/2019\/04\/fixstars_logo_web_tagline.png","contentUrl":"https:\/\/griddb.net\/wp-content\/uploads\/2019\/04\/fixstars_logo_web_tagline.png","width":200,"height":83,"caption":"Fixstars"},"image":{"@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/griddbcommunity\/","https:\/\/x.com\/GridDBCommunity","https:\/\/www.linkedin.com\/company\/griddb-by-toshiba"]},{"@type":"Person","@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/person\/4fe914ca9576878e82f5e8dd3ba52233","name":"griddb-admin","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/griddb-linux-hte8hndjf8cka8ht.westus-01.azurewebsites.net\/en\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/5bceca1cafc06886a7ba873e2f0a28011a1176c4dea59709f735b63ae30d0342?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/5bceca1cafc06886a7ba873e2f0a28011a1176c4dea59709f735b63ae30d0342?s=96&d=mm&r=g","caption":"griddb-admin"},"url":"https:\/\/griddb.net\/en\/author\/griddb-admin\/"}]}},"_links":{"self":[{"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/posts\/46650","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/users\/41"}],"replies":[{"embeddable":true,"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/comments?post=46650"}],"version-history":[{"count":1,"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/posts\/46650\/revisions"}],"predecessor-version":[{"id":51325,"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/posts\/46650\/revisions\/51325"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/media\/27517"}],"wp:attachment":[{"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/media?parent=46650"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/categories?post=46650"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/griddb.net\/en\/wp-json\/wp\/v2\/tags?post=46650"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}