In [3]:

import os
# os.chdir('PLEASE PUT YOUR WORKING DIRECTORY HERE')
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
import plotly.offline as pyo
import matplotlib.pyplot as plt
pyo.init_notebook_mode()
import seaborn as sns
import requests
import http
http.client.HTTPConnection.debuglevel = 1
import json
import re

A short note on the Analysis and the Dataset¶

Popular social cataloging website 'Goodreads' had once algorithmically analyzed user-generated readinglists and top voted authors, and had compiled a 'Best Books Ever' list. This dataset is available for public use and can be downloaded from https://zenodo.org/record/4265096#.Y9LGVsnMK5d from the 'Files' section as shown below -

We use this dataset for our analysis.

Below is a snapshot of the downloaded file - Downloaded%20File-3.jpg

Overall Scheme of the Analysis¶

Below is the overall scheme of the analysis -

Overall%20Scheme.jpg

Creating the Request and Containers¶

Creating the Request¶

In [1221]:

#Construct an object to hold the request headers (ensure that you replace the XXX placeholder with the correct value that matches the credentials for your GridDB instance)
header_obj = {"Authorization":"Basic XXX","Content-Type":"application/json; charset=UTF-8","User-Agent":"PostmanRuntime/7.29.0"}

#Construct the base URL based on your GridDB cluster you'd like to connect to (ensure that you replace the placeholders in the URL below with the correct values that correspond to your GridDB instance)
base_url = 'https://[host]:[port]/griddb/v2/[clustername]/dbs/[database_name]/'

Creating the container 'Best_Books_Ever_Written'¶

In [ ]:

#Construct an object to hold the request body (i.e., the container that needs to be created)

data_obj = {
    "container_name": "Best_Books_Ever_Written",
    "container_type": "COLLECTION",
    "rowkey": False,
    "columns": [
    {
    "name": "title",
    "type": "STRING"
    },
    {
    "name": "series",
    "type": "STRING"
    },
    {
    "name": "author",
    "type": "STRING"
    },
    {
    "name": "rating",
    "type": "FLOAT"
    },
    {
    "name": "language",
    "type": "STRING"
    },	
	{
    "name": "genres",
    "type": "STRING"
    },
	{
    "name": "characters",
    "type": "STRING"
    },
	{
    "name": "bookFormat",
    "type": "STRING"
    },
	{
    "name": "edition",
    "type": "STRING"
    },
	{
    "name": "pages",
    "type": "STRING"
    },
	{
    "name": "publisher",
    "type": "STRING"
    },	
	{
    "name": "publishDate",
    "type": "STRING"
    },
	{
    "name": "firstPublishDate",
    "type": "STRING"
    },
	{
    "name": "awards",
    "type": "STRING"
    },
	{
    "name": "numRatings",
    "type": "INTEGER"
    },
	{
    "name": "ratingsByStars",
    "type": "STRING"
    },
	{
    "name": "likedPercent",
    "type": "FLOAT"
    },
	{
    "name": "setting",
    "type": "STRING"
    },
	{
    "name": "bbeScore",
    "type": "LONG"
    },
	{
    "name": "bbeVotes",
    "type": "LONG"
    },
	{
    "name": "price",
    "type": "FLOAT"
    }
	
    ]
}


#Set up the GridDB WebAPI URL
url = base_url + 'containers'

#Invoke the GridDB WebAPI with the headers and the request body
x = requests.post(url, json = data_obj, headers = header_obj)

Loading the Container 'Best_Books_Ever_Written' (Row Registration)¶

Below is a diagram showing how row registration can be done using Python and the GriDB WebAPI.

In [1219]:

# Loading the data
Best_Books_Ever_Written_df = pd.read_csv('books_1.Best_Books_Ever.csv')

Data Cleaning & Transformation¶

In [497]:

#Best_Books_Ever_Written_df['likedPercent'] = Best_Books_Ever_Written_df['likedPercent'].replace('',0,regex = True)

Best_Books_Ever_Written_df['likedPercent'] = Best_Books_Ever_Written_df['likedPercent'].replace(np. nan,0)

#In order to have only the necessary columns, we also drop a few columns that are not useful in our analysis
Best_Books_Ever_Written_df.drop(['bookId', 'isbn','description','coverImg'], axis=1,inplace=True)

In [498]:

#Best_Books_Ever_Written_df['price'] = Best_Books_Ever_Written_df['price'].astype(float)

def convertToCorrectPriceFormat(theValue):
    final_value = ''
    split_values_list = theValue.split(".")  #this can have 1 or more entries
    if len(split_values_list) <= 2:
        final_value = theValue #uset he original value as multiple dots dont exist
    else:  #more than 1 dot exists
        #ignore all dots but for the last dot
        lenSplit = len(split_values_list)
        #print('len is ', lenSplit)
        for i in range(len(split_values_list)):
            #print('i is ', i)
            if(i < lenSplit-1):
                final_value = final_value + split_values_list[i]
            else:
                final_value = final_value + '.' + split_values_list[i]
    return float(final_value)

In [499]:

Best_Books_Ever_Written_df['price_float'] = 0.0
for index, row in Best_Books_Ever_Written_df.iterrows():
    Best_Books_Ever_Written_df.at[index, 'price_float'] = convertToCorrectPriceFormat(str(row['price']))

In [500]:

Best_Books_Ever_Written_df.drop(['price'], axis=1,inplace=True)
Best_Books_Ever_Written_df.rename(columns ={'price_float':'price'}, inplace = True)

In [501]:

#Convert the data in the dataframe to the JSON format
Best_Books_Ever_Written_json = Best_Books_Ever_Written_df.to_json(orient='values')
request_body_Best_Books_Ever_Written = Best_Books_Ever_Written_json

In [502]:

#Setup the URL to be used to invoke the GridDB WebAPI to register rows in the container created previously
url = base_url + 'containers/Best_Books_Ever_Written/rows'

#Invoke the GridDB WebAPI using the request constructed
x = requests.put(url, data=request_body_Best_Books_Ever_Written, headers=header_obj)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

reply: 'HTTP/1.1 200 \r\n'
header: Date: Sun, 29 Jan 2023 05:54:45 GMT
header: Content-Type: application/json;charset=UTF-8
header: Transfer-Encoding: chunked
header: Connection: keep-alive
header: Server: Apache/2.4.54 (IUS)

Data Analysis & Visualization¶

Below is a high flow diagram showing the manner in which a request is sent from Python to GridDB. Here, we construct a SQL query and pass it as a request to the GridDB WebApi which it turn retrieves the requested records from the GridDB database as a JSON structure.

We then use Python to parse this JSON structure and store it in a dataframe. The results are then visualized using Pythons' visualization libraries.

Attribution: Icons procured from Flaticon

What are the Top 10 Titles by Rating?¶

A quick tip - While using aggregations in SQL, use an alias. This is because when the response is received as a JSON object, the column name will come in blank making it difficult to retrieve the records.

In [1223]:

sql_query1 = (f"""SELECT title, avg(rating) as Avg_Rating, numRatings FROM Best_Books_Ever_Written GROUP BY 1,3 HAVING avg(rating) >=4 ORDER BY 3 desc LIMIT 10""")

In [1224]:

#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'

#Construct the request body
request_body = '[{"type":"sql-select", "stmt":"'+sql_query1+'"}]'

#Validate the constructed request body
request_body

Out[1224]:

'[{"type":"sql-select", "stmt":"SELECT title, avg(rating) as Avg_Rating, numRatings FROM Best_Books_Ever_Written GROUP BY 1,3 HAVING avg(rating) >=4 ORDER BY 3 desc LIMIT 10"}]'

In [1225]:

#Invoke the GridDB WebAPI
data_req1 = requests.post(url, data=request_body, headers=header_obj)

In [1226]:

#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req1.json()
Titles_by_Rating = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"],myJson[0]["columns"][2]["name"]])
Titles_by_Rating.rename(columns = {'title':'Title'}, inplace = True)
Titles_by_Rating['Avg_Rating']= Titles_by_Rating['Avg_Rating'].astype(int)
Titles_by_Rating.rename(columns={'Avg_Rating':'Average Rating','numRatings':'Number of Ratings'},inplace=True)

In [1227]:

Titles_by_Rating.sort_values(by='Number of Ratings',ascending=False,inplace=True)
fig = ff.create_table(Titles_by_Rating)
fig.show()

Insight(s):

The top book with over 7 Million Ratings is 'Harry Potter and the Sorcerer's Stone'.
The second highest is 'The Hunger Games' which has over 6 Million Ratings on GoodReads.
It is interesting to note that many of the books in the Top 10 list have been made into movies as well.

What are the Top 15 Genres having the maximum book titles?¶

Note that the genres in this case is a list of values. In other words, each title has one or more genres. If a book title is attributed with multiple genres, it is stored as a list. Example - The genre for the book title 'The Hunger Games' is ['Young Adult', 'Fiction', 'Dystopia', 'Fantasy']. GridDB easily stores such list of values in its table without the need for any data manipulation or transformation. These values can be retrieved using the same JSON request that is passed. The data can then be iterated through the list of genres and stored in the dataframe.

In [509]:

sql_query2 = (f"""SELECT title, genres FROM Best_Books_Ever_Written WHERE rating >= 4.5 and rating <=  5 and genres is not null""")

In [510]:

#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'

#Construct the request body
request_body = '[{"type":"sql-select", "stmt":"'+sql_query2+'"}]'

#Validate the constructed request body
request_body

Out[510]:

'[{"type":"sql-select", "stmt":"SELECT title, genres FROM Best_Books_Ever_Written WHERE rating >= 4.5 and rating <=  5 and genres is not null"}]'

In [1228]:

#Invoke the GridDB WebAPI
data_req2 = requests.post(url, data=request_body, headers=header_obj)

A quick tip here - At any time, if you get an Response 400, doing a request.text will provide more information about the error. As an example, if there had been an error in data_req2, type data_req2.text to get the details of the error. GridDB returns the detailed error code and error message. As an example, below is a an error response returned by GridDB -

'{"version":"v2","errorCode":240008,"errorMessage":"[240008:SQL_COMPILE_COLUMN_NOT_FOUND] Column not found (name=titles) on executing query (sql=\"SELECT genres,titles FROM Best_Books_Ever_Written WHERE rating between 4 and 5\") (db='') (user='') (clientId='') (clientNd='{clientId=, address=}') (address=, partitionId=)"}'

It is very easy to understand that the SQL query is referencing a column with the name 'titles' which doesn't exist in the GridDB container.

In [512]:

#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req2.json()
Titles_and_Genres = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"]])

In [513]:

genre_count_dict_obj = {}
genre_count_iterator = 0
#indices = range(len(Titles_and_Genres['genres'][0]))
for index,row in Titles_and_Genres.iterrows():
    for i in (row['genres']).split(','):   
        i = i.replace('[','')
        i = i.replace(']','')
        i = i.replace('\'','')    
        i = i.strip()    
        if i in genre_count_dict_obj: # If the element exists in the dict object
            genre_count_iterator = genre_count_dict_obj.get(i) +1
            genre_count_dict_obj[i]= genre_count_iterator
        else: # If element doesn't already exist in the dict object
            genre_count_iterator = 1
            genre_count_dict_obj[i]= genre_count_iterator            

In [514]:

Genre_Frequency_df = pd.DataFrame.from_dict(genre_count_dict_obj,orient='index')
Genre_Frequency_df.rename(columns = {0:'Number of book titles per Genre'},inplace=True)

In [515]:

Genre_Frequency_df = Genre_Frequency_df.reset_index()

In [516]:

Genre_Frequency_df.rename(columns = {'index':'Genre'},inplace=True)
Genre_Frequency_df = Genre_Frequency_df[Genre_Frequency_df['Genre']!='']

In [517]:

Top15_Genres = Genre_Frequency_df.nlargest(15, 'Number of book titles per Genre', keep='all')

In [863]:

fig = px.bar(Top15_Genres, x="Genre", y="Number of book titles per Genre",title="Top 15 Genres", text='Number of book titles per Genre')
fig.show()

Insight(s):

As seen in the visual above, 'Fiction' and 'Fantasy' Genres have the most number of book titles.
Romance, Young Adult and Nonfiction novels have around 2000 book titles per Genre.
We also see a few interesting genres in the Top 15 Genres namely 'Magic','Paranormal', 'Graphic Novels' and 'Poetry'.

Who are the Top 10 authors who have written the most number of book titles ?¶

In [1230]:

sql_query3 = (f"""SELECT author,count(title) as cnt_books FROM Best_Books_Ever_Written WHERE rating >= 4.5 and author!= 'NOT A BOOK' and author!='anonymous' and rating <=  5 GROUP BY 1""")

In [1231]:

#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'

#Construct the request body
request_body = '[{"type":"sql-select", "stmt":"'+sql_query3+'"}]'

#Validate the constructed request body
request_body

Out[1231]:

'[{"type":"sql-select", "stmt":"SELECT author,count(title) as cnt_books FROM Best_Books_Ever_Written WHERE rating >= 4.5 and author!= \'NOT A BOOK\' and author!=\'anonymous\' and rating <=  5 GROUP BY 1"}]'

In [1232]:

#Invoke the GridDB WebAPI
data_req3 = requests.post(url, data=request_body, headers=header_obj)

In [811]:

#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req3.json()
Author_and_Books = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"]])

In [812]:

Author_and_Books = Author_and_Books.rename(columns = {'author':'Author','cnt_books':'Number of books written'})

In [813]:

Top10_Authors = Author_and_Books.nlargest(10, 'Number of books written', keep='all')

In [814]:

#fig = px.bar(Author_and_Books, x="cnt_books", y="author", orientation='h')
#fig.show()
fig = ff.create_table(Top10_Authors)
fig.show()

Insight(s):

Bella Forrest is the top writer in terms of the number of books written. Bella has written around 200 books which have been rated by various readers on GoodReads.
Here is a page about Bella Forrest on Goodreads.
The next author on the list is Idries Shah who has written 160 books.

What are the top 10 book titles and authors with the most number of Ratings on Goodreads?¶

In [1238]:

sql_query4 = (f"""SELECT title, author, numRatings FROM Best_Books_Ever_Written WHERE rating >=4.5 and rating <=  5 and author!= 'NOT A BOOK' and author!='anonymous' LIMIT 10""")

In [1239]:

#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'

#Construct the request body
request_body = '[{"type":"sql-select", "stmt":"'+sql_query4+'"}]'

#Validate the constructed request body
request_body

Out[1239]:

'[{"type":"sql-select", "stmt":"SELECT title, author, numRatings FROM Best_Books_Ever_Written WHERE rating >=4.5 and rating <=  5 and author!= \'NOT A BOOK\' and author!=\'anonymous\' LIMIT 10"}]'

In [1240]:

#Invoke the GridDB WebAPI
data_req4 = requests.post(url, data=request_body, headers=header_obj)

In [531]:

#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req4.json()
Books_and_Ratings = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"],myJson[0]["columns"][2]["name"]])

In [532]:

Books_and_Ratings.rename(columns={'title':'Title','author':'Author','numRatings':'Number of Ratings'},inplace=True)

In [392]:

#Top15_Books_and_Ratings = Books_and_Ratings.nlargest(15, 'Number of Ratings', keep='all')

In [815]:

fig = px.bar(Books_and_Ratings, x="Title", y="Number of Ratings", color="Author", title="Top 15 Book Titles and their Authors",text='Number of Ratings')
fig.update_layout(yaxis={'categoryorder':'total descending'}) # add only this line
fig.show()

Insight(s):

The results of this visual are pretty interesting. The popular 'Harry Potter' series of books occupy the top 5 slots.
3 of the Harry Potter books have Mary GrandPre as the Illustrator.
4 Harry Potter books each have over 2.5 million ratings while the 5th one has slightly lesser than 2.5 million ratings.
Follow along the color legend to get the name of the author or hover over the bars to access the interactive tooltip.

Given a list of awards for a book given to an author, what is the first award?¶

The 'awards' column in this case has the 'year' appended to it. Below are a couple of examples of what the 'awards' column looks like -

Carnegie Medal Nominee (2009)
Carnegie Medal Nominee (2012)
Goodreads Choice Award Nominee for Fiction (2009)

To get an accurate representation of the number and type of awards given to recipients, we will need to remove the year from the name of the award. Similar to substring expressions available in different SQL products, GridDB offers both 'SUBSTR' and 'INSTR' to find the position of a substring and to perform a slicing operation on the position of the substring. We use this for the same.

In order to demonstrate this capability, let's try to find the first award given for each book title written by an author. Currently, the 'awards' is a comma separated list. Below is an example -
'['Georgia Peach Book Award (2007)', 'Buxtehude Award']

In [1242]:

sql_query5 = (f"""SELECT DISTINCT author, SUBSTR(awards, 0, INSTR(awards, '(')) as award FROM Best_Books_Ever_Written where (awards != '[]' and awards is not null)  and rating >= 4.5""")

In [1243]:

#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'

In [1244]:

#Construct the request body
request_body = '[{"type":"sql-select", "stmt":"'+sql_query5+'"}]'

In [1245]:

#Validate the constructed request body
request_body

Out[1245]:

'[{"type":"sql-select", "stmt":"SELECT DISTINCT author, SUBSTR(awards, 0, INSTR(awards, \'(\')) as award FROM Best_Books_Ever_Written where (awards != \'[]\' and awards is not null)  and rating >= 4.5"}]'

In [1246]:

#Invoke the GridDB WebAPI
data_req5 = requests.post(url, data=request_body, headers=header_obj)

In [896]:

#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req5.json()
Awards_Authors = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"]])

In [897]:

for index,row in Awards_Authors.iterrows():
    a = row['award'].replace('[','')
    a = a.replace(']','')
    a = a.replace('\'','')
    a = a.replace('\"','')
    Awards_Authors.at[index,'award'] = a

In [898]:

fig = ff.create_table(Awards_Authors)
fig.show()

Insight(s): Above is a list of the first award received by each author for one of their books.

How many awards has each author received for books rated 4.5 and above?¶

In [1248]:

sql_query6 = (f"""SELECT DISTINCT author, awards FROM Best_Books_Ever_Written where (awards != '[]' and awards is not null)  and rating >= 4.5""")

In [1249]:

#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'

In [1250]:

#Construct the request body
request_body = '[{"type":"sql-select", "stmt":"'+sql_query6+'"}]'

In [1251]:

#Validate the constructed request body
request_body

Out[1251]:

'[{"type":"sql-select", "stmt":"SELECT DISTINCT author, awards FROM Best_Books_Ever_Written where (awards != \'[]\' and awards is not null)  and rating >= 4.5"}]'

In [1252]:

#Invoke the GridDB WebAPI
data_req6 = requests.post(url, data=request_body, headers=header_obj)

In [929]:

#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req6.json()
Awards_list = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"]])

In [930]:

Awards_list_dict_obj = {}
Awards_list_iterator = 0
for index,row in Awards_list.iterrows():
    count = 0
    for i in (row['awards']).split(','):   
        count=count+1
    if row['author'] in Awards_list_dict_obj: # If the element exists in the dict object
        Awards_list_iterator = Awards_list_dict_obj.get(row['author']) +count
        Awards_list_dict_obj[row['author']]= Awards_list_iterator
    else:   
        Awards_list_iterator = count
        Awards_list_dict_obj[row['author']]= Awards_list_iterator

In [931]:

Awards_list_df = pd.DataFrame.from_dict(Awards_list_dict_obj,orient='index')

In [932]:

Awards_list_df.rename(columns = {0:'Number of awards'},inplace=True)

In [933]:

Awards_list_df = Awards_list_df.reset_index()

In [934]:

Awards_list_df.rename(columns = {'index':'Author'},inplace=True)

In [935]:

Awards_list_df = Awards_list_df.sort_values(by='Number of awards',ascending=False)

In [1009]:

fig = ff.create_table(Awards_list_df, height_constant=60)
fig.show()

Insight(s):

J.K. Rowling and her illustrator Mary GrandPre have jointly received the most number of awards.
The next on the list are Brian K. Vaughan and artist Fiona Staples with 19 awards.
Angie Thomas follows him closely with 18 awards so far.
J.K Rowling has separately received 12 awards.

What are some correlation patterns in the data?¶

In [1254]:

sql_query7 = (f"""SELECT price, rating, numRatings, likedPercent, bbeScore, bbeVotes FROM Best_Books_Ever_Written""")

In [1255]:

#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'

#Construct the request body
request_body = '[{"type":"sql-select", "stmt":"'+sql_query7+'"}]'

#Validate the constructed request body
request_body

#Invoke the GridDB WebAPI
data_req7 = requests.post(url, data=request_body, headers=header_obj)

In [1028]:

#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req7.json()
Correlation_df = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"],myJson[0]["columns"][2]["name"],myJson[0]["columns"][3]["name"],myJson[0]["columns"][4]["name"],myJson[0]["columns"][5]["name"]])

In [2]:

#Best_Books_Ever_Written_df['bookFormat'].unique()

# Compute the correlation matrix
corr = Correlation_df.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-e77f60acf78e> in <module>
      2 
      3 # Compute the correlation matrix
----> 4 corr = Correlation_df.corr()
      5 
      6 # Generate a mask for the upper triangle

NameError: name 'Correlation_df' is not defined

Insight(s):

There is a strong correlation between the 'Best Books Ever Votes' (bbeVotes) and 'Best Books Ever Score' (bbeScore).
There is also a strong correlation between the 'LikedPercent' and 'rating' on GoodReads.
Similarly, the number of ratings is strongly correlated with both bbeVotes and the bbeScore.

Who were the authors that have written multiple books that were rated over 4.5?¶

In [1257]:

sql_query8 = (f"""SELECT author,language, bookFormat,count(title) as Num_books FROM Best_Books_Ever_Written where PublishDate != 'None' and bookFormat!='None' and rating>=4.5 GROUP BY 1,2,3 ORDER BY 3 DESC LIMIT 50""")

In [1258]:

#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'

In [1259]:

#Construct the request body
request_body = '[{"type":"sql-select", "stmt":"'+sql_query8+'"}]'

#Validate the constructed request body
request_body

Out[1259]:

'[{"type":"sql-select", "stmt":"SELECT author,language, bookFormat,count(title) as Num_books FROM Best_Books_Ever_Written where PublishDate != \'None\' and bookFormat!=\'None\' and rating>=4.5 GROUP BY 1,2,3 ORDER BY 3 DESC LIMIT 50"}]'

In [1260]:

#Invoke the GridDB WebAPI
data_req8 = requests.post(url, data=request_body, headers=header_obj)

#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req8.json()
lang_liked_percent = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"],myJson[0]["columns"][2]["name"],myJson[0]["columns"][3]["name"]])
lang_liked_percent.rename(columns={'author':'Author','Num_books':'Number of books'},inplace=True)

In [1217]:

fig = px.bar(lang_liked_percent, x="Author", y='Number of books', color="bookFormat", pattern_shape="language",title="Authors & Number of books; Language and Bookformats")
fig.show()

Insight(s):

As we see here, the above authors have written 8 books each that have a rating of 4.5 and above.
Of these, there are three non-english books too namely Japanese, Arabic and Spanish.

Concluding Remarks¶

The 'Best Books Ever Written' dataset is an interesting dataset that helps one derive insights into the books that have eagerly lapped up by audiences. GridDB and Python coupled with this dataset provides for a very informative and explorative analysis. This is merely a sample of the capabilities that GridDB offers. Check out the official documentation to learn more.

E-commerce Data Analysis using GridDB

Using SQL Batch Inserts with...

Analyzing Movies on IMDB using...

Building a Blog Voting Web...

Analyzing the Best Books Ever Written using GridDB

A short note on the Analysis and the Dataset¶

Overall Scheme of the Analysis¶

Creating the Request and Containers¶

Creating the Request¶

Creating the container 'Best_Books_Ever_Written'¶

Loading the Container 'Best_Books_Ever_Written' (Row Registration)¶

Data Cleaning & Transformation¶

Data Analysis & Visualization¶

What are the Top 10 Titles by Rating?¶

What are the Top 15 Genres having the maximum book titles?¶

Who are the Top 10 authors who have written the most number of book titles ?¶

What are the top 10 book titles and authors with the most number of Ratings on Goodreads?¶

Given a list of awards for a book given to an author, what is the first award?¶

How many awards has each author received for books rated 4.5 and above?¶

What are some correlation patterns in the data?¶

Who were the authors that have written multiple books that were rated over 4.5?¶

Concluding Remarks¶

Leave a Reply Cancel reply

About This Site