Collecting Data using Scrapy and GridDB

Introduction

Today, we will cover how to scrape data from any website using Python’s library Scrapy. We will then save the data in a JSON and HTML file. Finally, we will see how we can also store this data in GridDB for long-term and efficient use.

Pre-requisites

This post requires the prior installation of the following:

  1. Python 3.6+
  2. Scrapy
  3. GridDB
  4. GridDB C-client
  5. GridDB python-client

We also recommend installing Anaconda Navigator, if not already installed. Anaconda provides a large range of tools for data scientists to experiment with. Also, a virtual environment can help you meet the specific version requirements while running an application without interfering with the actual system paths.

Creating a new project using Scrapy

For this tutorial, we will be using Anaconda’s Command Line Interface and Jupyter Notebooks. Both of these tools can be found in the Anaconda Dashboard.

Creating a new project with scrapy is simple. Just type the following command within the directory you wish to create a new project folder in:

scrapy startproject griddb_tutorial

A new folder with the name griddb_tutorial is now created in the current directory. Let us look at the contents of this folder:

tree directory_path /F

Extracting data from a URL

Scrapy uses a class called Spider to crawl websites and extract information. We can write our custom code and mention initial requests inside this Spider class. For this tutorial, we will be scraping funny quotes from the website quotes.toscrape.com and storing the information in JSON format.

The following lines of code collect the information about the text, author, and tags associated with a quote.

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes_funny"
    start_urls = [
        'http://quotes.toscrape.com/tag/humor',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

We will now save this python file in the /griddb_tutorial/spider directory. We execute a spider by providing its name to scrapy through the command line. Therefore, it is important to use a unique name for each spider.

Coming back to the home directory. Let’s run this spider now and see what we get –

scrapy crawl quotes_funny

It takes some time to extract the data. Once the execution is completed, our output looks like –

DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/>
{'text': '"The reason I talk to myself is because I'm the only one whose answers I accept."', 'author': 'George Carlin', 'tags': ['humor', 'insanity', 'lies', 'lying', 'self-indulgence', 'truth']}
2021-05-29 21:29:44 [scrapy.core.engine] INFO: Closing spider (finished)

Storing data into JSON

To store the previously crawled data, we modify the above command by simply passing an additional parameter –

scrapy crawl quotes_funny -O quotes_funny.json

This will create a new file named quotes_funny.json in the home directory. Note that the -O command overwrites any existing file with the same name. In case, you want to append new content to an existing JSON file, use -o instead.

The content of the JSON file will look like this:

[
{"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d", "author": "Jane Austen", "tags": ["aliteracy", "books", "classic", "humor"]},
{"text": "\u201cAll you need is love. But a little chocolate now and then doesn't hurt.\u201d", "author": "Charles M. Schulz", "tags": ["chocolate", "food", "humor"]},
{"text": "\u201cThe reason I talk to myself is because I\u2019m the only one whose answers I accept.\u201d", "author": "George Carlin", "tags": ["humor", "insanity", "lies", "lying", "self-indulgence", "truth"]}
]

Storing data into GridDB

If you’re collecting continuous data over time, it is only a wise decision to store it in a Database. GridDB allows you to store time-series data and is specially optimized for IoT and Big Data. Its highly scalable nature gives room to both SQL and NoSQL interfaces. Follow their tutorial to get started.

We have our data collected from scrapy in a JSON file. Moreover, most websites allow you to export data in JSON format. Therefore, we will be writing a python script to load the data in our environment.

Reading data from a JSON file

import json
f = open('quotes_funny.json',)
data = json.load(f)
print(data[0])

We will get an output similar to

{'text': '"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."',
 'author': 'Albert Einstein',
 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}

To extract all the key-value pairs –

for d in data:
    for key, value in d.items():
        print(key, value)

Now that we have extracted each key-value pair, let us initialize a GridDB instance.

GridDB container initialization

import griddb_python as griddb

factory = griddb.StoreFactory.get_instance()

# Container Initialization
try:
    gridstore = factory.get_store(host=your_host, port=your_port, 
            cluster_name=your_cluster_name, username=your_username, 
            password=your_password)

    conInfo = griddb.ContainerInfo("Dataset_Name",
                    [["attribute1", griddb.Type.STRING],["attribute2",griddb.Type.FLOAT],
                    ....],
                    griddb.ContainerType.COLLECTION, True)
    
    cont = gridstore.put_container(conInfo)   
    cont.create_index("id", griddb.IndexType.DEFAULT)

except griddb.GSException as e:
    for i in range(e.get_error_stack_size()):
        print("[", i, "]")
        print(e.get_error_code(i))
        print(e.get_location(i))
        print(e.get_message(i))

Fill in your custom details in the above code. Note that in our case, the data type is essentially a STRING. More information on data types supported by GridDB can be found here.

Insert data into the GridDB container

Our JSON file is a list of dictionaries. Each dictionary contains 3 attributes: Text, Author, and Tags. We will have to run two loops to get items under each category.

for d in data:
    for key in d:
        ret = cont.put(d[key])

The final insertion script looks like this –

import griddb_python as griddb

factory = griddb.StoreFactory.get_instance()

# Container Initialization
try:
    gridstore = factory.get_store(host=your_host, port=your_port, 
            cluster_name=your_cluster_name, username=your_username, 
            password=your_password)

    conInfo = griddb.ContainerInfo("Dataset_Name",
                    [["attribute1", griddb.Type.INTEGER],["attribute2",griddb.Type.FLOAT],
                    ....],
                    griddb.ContainerType.COLLECTION, True)
    
    cont = gridstore.put_container(conInfo)   
    cont.create_index("id", griddb.IndexType.DEFAULT)
    
    #Adding data to container
    for d in data:
        for key in d:
            ret = cont.put(d[key])

except griddb.GSException as e:
    for i in range(e.get_error_stack_size()):
        print("[", i, "]")
        print(e.get_error_code(i))
        print(e.get_location(i))
        print(e.get_message(i))

Check out default cluster values on the official Github page of GridDB’s python-client.

Conclusion

In this tutorial, we saw how to create a spider in order to crawl data from a website. We stored the collected data in a JSON format so that it is easy to share among different platforms. Later on, we developed an insertion script for storing this data into GridDB.

Storing data in a database is crucial if you’re working on continuous data. It can be hard to store multiple JSON files in such a case. GridDB makes it easier to store every bit of information in one place. This saves time and helps team integrate without any hassle. Get started with GridDB today!

If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.