#Libraries for basic data manipulation
import os
import pandas as pd
import numpy as np
#Libraries for the webAPI requests
import http
http.client.HTTPConnection.debuglevel = 1
import json
import requests
#libraries for the JayDeBeAPI
import jaydebeapi
import urllib.parse
#library to calculate time
import time
#Library for loading files into dataframes
import dask.dataframe as dd
#Libraries for graphs
from tabulate import tabulate
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.figure_factory as ff
from IPython.display import Image, display, Markdown
#os.chdir("Your Working Directory") #Put your working directory here
About the Dataset¶
The dataset we use for this analysis can be accessed at https://developer.imdb.com/non-commercial-datasets/.
##Specify the path to your image file
image_path = 'About the Dataset.png'
width = 500
## Display the image
Image(filename=image_path, width=width)
The instructions to download the dataset are -
- Click the link 'https://datasets.imdbws.com/' as shown in the above snapshot.
- Click to download the specific files 'name.basics.tsv.gz' and 'title.akas.tsv.gz'.
- Save the downloaded files from the 'Downloads' folder to your working directory.
##Specify the path to your image file
image_path = 'Datasets_Step2.png'
width = 500
## Display the image
Image(filename=image_path, width=width)
- If using a Windows machine, Right-click each tsv.gz file and click 'Extract All' to decompress the file.
- If using a MacOS, double-clicking the file should automatically decompress it and create the corresponding .tsv file.
- If using a Linux-based system, you can use the gunzip command in the terminal - gunzip filename.tsv.gz
##Specify the path to your image file
image_path = 'Extract_file.png'
width = 500
## Display the image
Image(filename=image_path, width=width)
The data gets downloaded as folders shown below. Each folder has the corresponding data file in a tsv format (tab-separated-values).
##Specify the path to your image file
image_path = 'After_Extraction.png'
width = 200
## Display the image
Image(filename=image_path, width=width)
In the context of the IMDb (Internet Movie Database) dataset, the files title.akas.tsv and name.basics.tsv are part of the IMDb datasets and provide information about movie titles and individuals involved in the film industry, respectively.
- title.akas.tsv:
- Purpose: This file contains information about the alternate titles of movie titles in different regions and languages.
- Content:
- titleId: A unique identifier for the movie title.
- ordering: The order of the title in the list of titles associated with the same titleId.
- title: The title itself, which might be an alternate title in a different language or region.
- region: The region or country for which the title is applicable.
- language: The language of the title.
- Other attributes providing additional details.
- name.basics.tsv:
- Purpose: This file contains information about individuals involved in the film industry, such as actors, directors, writers, etc.
- Content:
- nconst: A unique identifier for the person.
- primaryName: The primary name of the person.
- birthYear and deathYear: The birth and death years of the person (if available).
- primaryProfession: The primary profession of the person (e.g., actor, director).
- knownForTitles: Titles for which the person is known, indicating their notable works.
Loading data into Python¶
The '.tsv' files in this analysis are of around 1 GB in size. To handle data load effectively, we use the dask library.
tsv_file_path = 'name.basics.tsv/data.tsv'
ddf1 = dd.read_csv(tsv_file_path, sep='\t', assume_missing=True, dtype={'nconst':'object','primaryName':'object','birthYear':'object','deathYear':'object','primaryProfession':'object','knownForTitles':'object'})
Name_Basics_df = ddf1.compute()
tsv_file_path = 'title.akas.tsv/data.tsv'
ddf2 = dd.read_csv(tsv_file_path, sep='\t', assume_missing=True, dtype={'titleId':'object', 'ordering':'object', 'title':'object', 'region':'object', 'language':'object', 'types':'object',
'attributes':'object', 'isOriginalTitle':'object'})
titles_df = ddf2.compute()
Subsetting the data¶
We subset the Name_Basics_df to only include personalities where the birthyear is available and where it is between 1990 and 2005. We also include only actors and actresses for this analysis.
subset_valid_years = Name_Basics_df[Name_Basics_df['birthYear']!='\\N']
subset_valid_years = subset_valid_years[(subset_valid_years['birthYear'].astype(int) >= 1990) & (subset_valid_years['birthYear'].astype(int) <= 2005)]
Name_Basics_subset = subset_valid_years[(subset_valid_years['primaryProfession']=='actor') | (subset_valid_years['primaryProfession']=='actress')]
Regarding titles_df, we only consider titles in the 'english' language. Moreover, we only include titles that were either screened in a film festival or specific titles that had one of the attributes - original, new, complete, literal, alternative or transliterated.
#titles_df[titles_df['attributes']=='\\N']
subset_english_titles = titles_df[titles_df['language'] == 'en']
selected_attributes = ['complete title','literal title','literal English title','alternative spelling','transliterated title','new title','original script title','Cannes festival title', 'Berlin film festival title', 'Venice film festival title', 'Locarno film festival title', 'Bilbao festival title']
filtered_titles_df = subset_english_titles[subset_english_titles['attributes'].isin(selected_attributes)]
Now that we have subset our datasets, we are good to create containers and load them into GridDB. Containers is one of the data structures within GridDB and we learn more about them in the section 'A short note on Containers in GridDB'.
Ways to Connect to GridDB¶
We load the data from the Python DataFrame onto GridDB. There are a number of ways to load data into GridDB and query the data -
- There are three options to connect to GridDB -
- JaydebeAPI - GridDB provides a JDBC (Java Database Connectivity) driver that allows you to connect to GridDB using Java. You can use the JDBC driver to establish a connection, execute SQL queries, and perform database operations programmatically. JayDeBeApi is a Python library that works with Java JDBC drivers, allowing you to connect to and interact with Java-based databases from your Python applications.
- RESTful Web API - GridDB offers a RESTful Web API that allows you to interact with GridDB using HTTP requests. You can use standard HTTP methods like GET, POST, PUT, DELETE to perform operations on GridDB. This option provides a platform-independent way to connect to GridDB from various programming languages and platforms.
- griddb-python library - This is a python library that can be used to connect to a GridDB database and perform query operations.
Connection Approaches used in this article¶
In this article, we use a hybrid approach. We will be using WebAPI to create the first container and to load rows into the container. We then use the JaydebeAPI to use the second container and to load rows into the second container. This will help us compare the processes involved in both the WebAPI and JaydebeAPI.
But before we get started, below are two diagrams explaining the WebAPI and the JayDeBeAPI.
A schema explaining the WebAPI¶
##Specify the path to your image file
image_path = 'Schema_Working_of_WebAPI.png'
width = 600
display(Markdown("## Schema - Interaction between the WebAPI and GridDB"))
## Display the image
display(Image(filename=image_path, width=width))
Schema - Interaction between the WebAPI and GridDB¶
Drivers required for JayDeBeAPI¶
##Specify the path to your image file
image_path = 'Pre-requisites_Jaydebeapi.png'
width = 700
display(Markdown("## Pre-requisites - Drivers required for JaydebeAPI"))
## Display the image
display(Image(filename=image_path, width=width))
Pre-requisites - Drivers required for JaydebeAPI¶
A schema explaining the JayDeBeAPI¶
##Specify the path to your image file
image_path = 'Schema_Working_of_JaydebeAPI.png'
width = 800
display(Markdown("## Schema - Interaction between JaydebeAPI and GridDB"))
## Display the image
display(Image(filename=image_path, width=width))
Schema - Interaction between JaydebeAPI and GridDB¶
Note that dbn_encode, sm_encode, cn_encode and np_encode are URL encoded versions of the database_name, SSL Mode, Cluster name and Notification Provider URL. URL encoding is accomplished using the Python library urllib to convert special characters and non-ASCII characters into a format that can be safely transmitted within a URL.
Important Points Regarding WebAPI and the JayDeBeAPI¶
Setting up the connection to GridDB using the WebAPI is pretty quick as it primarily uses http requests. But the structure of the container needs to be passed as a JSON via the http post request. This can be time-consuming in terms of constructing the JSON, for large containers with several columns and mixed datatypes. However, calls made to query/retrieve data from containers follow standard SQL 'select' syntax.
Setting up the JayDeBeAPI requires some time as drivers need to be downloaded and the class path should be setup with the location of the driver files. This can take up some time in terms of setup. But both the container creation and data retrieval are simple as JayDeBeAPI uses SQL Create Table syntax similar to relational databases.
If you are comfortable with SQL and relational database syntax, JayDeBeAPI is the best approach whereas if you are comfortable with JSON, WebAPI might be the better option.
A short note on Containers in GridDB¶
In GridDB, a container is a fundamental data structure used to organize and store data. It serves as a logical unit for storing and managing sets of related data within a GridDB database. Containers in GridDB can be compared to tables in a relational database, but they offer some unique features and capabilities tailored to GridDB's architecture.
Key characteristics of containers in GridDB include:
Highly Scalable: GridDB is designed to be a highly scalable and distributed database. Containers can be partitioned across multiple nodes in a GridDB cluster, enabling horizontal scaling as the data volume grows.
Data Distribution: Containers can be distributed across multiple nodes for load balancing and improved performance. GridDB supports sharding, which allows data to be distributed based on a specified sharding key.
Container Types: GridDB supports various types of containers, each optimized for specific use cases. The main container types are:
- Collection : A basic container type for storing data.
- TimeSeries: Optimized for time-series data with a focus on efficient time-based queries.
CRUD Operations: Containers support standard CRUD (Create, Read, Update, Delete) operations. You can insert, retrieve, update, and delete records within a container.
Transaction Support: GridDB provides support for ACID transactions within containers, ensuring the consistency and integrity of data. GridDB has its query language called TQL (Time Series Query Language) and also supports SQL querying similar to relational databases.
Connecting to GridDB using the WebAPI¶
In this article, we use the WebAPI to connect to GridDB. WebAPIs are simple http POST requests which use basic authentication. To connect to the GridDB cloud's WebAPI, the Request that is constructed should have the username and password encoded to base64 standards. To determine the base64 encoded credentials for your username and password, a tool such as https://www.base64encode.org/ can be used. Refer to the resource listed here to learn more about the different entities involved in creating a request - https://griddb.net/en/blog/using-griddb-cloud-version-1-2/. The basic authentication format in GridDB is 'username:password' where the case-sensitive username and password should be used for encoding.
Construct the base URL based on your GridDB cluster you'd like to connect to. Ensure that you replace the placeholders in the base_url below with the correct values that correspond to your GridDB instance.
Setting up the Authorization and Request¶
#Construct an object to hold the request headers
#Ensure that you replace the XXXX placeholder with the correct value that matches the credentials for your GridDB instance)
header_obj = {"Authorization":"XXXX","Content-Type":"application/json; charset=UTF-8","User-Agent":"PostmanRuntime/7.29.0"}
#Construct the base URL based on your GridDB cluster you'd like to connect to
#ensure that you replace the placeholders in the URL below with the correct values that correspond to your GridDB instance)
base_url = 'https://[host]:[port]/griddb/v2/[clustername]/dbs/[database_name]/'
Using the WebAPI to create the 'Name_Basics' Container¶
With the WebAPI, requests to create containers are usually passed as a JSON object. A request to create a container typically contains the following elements - container_name, container_type, rowkey and the columns in the container passed as a list of dictionaries.
For Collection containers, the rowkey can is usually set to False whereas for Timeseries containers, the rowkey has to be set to True. When we set 'rowkey' to True, we are effectively telling GridDB that the container has a primary key column. Timeseries containers require the timestamp variable to be the primary key. While Collection Containers can either have a rowkey of True or False, when the rowkey is set to False, GriDB creates a Collection Container by default.
For ease of use and simplicity, we use Python's zip() function. Here, we create a dictionary named 'data_obj'. We then create a list named 'input_variables' and another list named 'data_types'. We then iterate through the zipped function of input_variables and data_types which helps map the column name and datatype that corresponds to each column name.
Creating the container 'Name_Basics'¶
We use the dataframe 'Name_Basics_subset' to determine the columns and datatypes of the container that is going to be created.
data_obj = {
"container_name": "Name_Basics",
"container_type": "COLLECTION",
"rowkey": False,
"columns": []
}
input_variables = [
"nconst", "primaryName", "birthYear", "deathYear", "primaryProfession", "knownForTitles"
]
data_types = [
"STRING", "STRING", "STRING", "STRING", "STRING", "STRING"
]
for variable, data_type in zip(input_variables, data_types):
column = {
"name": variable,
"type": data_type
}
data_obj["columns"].append(column)
We then pass the 'data_obj' to the request.
#Set up the GridDB WebAPI URL
url = base_url + 'containers'
#Invoke the GridDB WebAPI with the headers and the request body
x = requests.post(url, json = data_obj, headers = header_obj)
Using the WebAPI to perform Row Registration (Loading data into GridDB)¶
In GridDB, the process of loading data into a container is referred to as "Row Registration". When you register a row, you are essentially inserting a new record into the container. This is similar to an 'Insert' DDL task in SQL. Refer to this resource to learn more- https://www.toshiba-sol.co.jp/en/pro/griddb/docs-en/v4_6/GridDB_Web_API_Reference.html#row-registration.
The good thing about using Python in conjunction with GridDB is that we can use both inbuilt functions in Python and packages such as Json and Pandas to manipulate data easily and effectively. In this case, we use the 'to_json' function of the 'json' module.
In this case, we use batch loading in GridDB. Here is a resource on batch loading - https://griddb.net/en/blog/griddb-optimization-with-multi-put-and-query/. If you use the WebAPI, then you can just use a simple 'for' loop to load your subsets of data into your GridDB container. Batch Loading enables users to import data in large chunks, reducing the overall time and resources required for the data migration process. GridDB supports parallelized batch loading, which can significantly enhance the efficiency and speed of data loading operations. Parallelization in GridDB is achieved through various mechanisms such as Multi-Node Architecture, Container Partitioning, Multi-Threaded Loading and Parallel Query Execution. Read on to know more - https://docs.griddb.net/.
Loading the Name_Basics container¶
To perform batch loading, let's use the np.array_split to break data into chunks.
#creating 12 subsets of the data
Name_Basics_Subsets = np.array_split(Name_Basics_subset, 12)
With the np.array_split, we can access individual subsets like Name_Basics_Subsets[0], Name_Basics_Subsets[1] ... Name_Basics_Subsets[11], each representing a portion of the original data divided into 12 parts. This kind of data splitting is often done for parallel processing or distributing work across multiple processors or nodes in a computing cluster. We use this for our bulk loading initiative.
http.client.HTTPConnection.debuglevel = 0
#Setup the URL to be used to invoke the GridDB WebAPI to register rows in the container created previously
url = base_url + 'containers/Name_Basics/rows'
#Invoke the GridDB WebAPI using the request constructed
for subset in Name_Basics_Subsets:
Name_Basics_Subsets_json = subset.to_json(orient='values')
request_body_subset = Name_Basics_Subsets_json
x = requests.put(url, data=request_body_subset, headers=header_obj)
print('______________________________________')
print('_______________',x.text,'___________')
print('______________________________________')
if x.status_code > 299: #To account for HTTP response codes indicating a success - 200 and 201 etc.
print(x.status_code) # an error has occurred; so, lets stop the process here
print(Name_Basics_Subsets_json)
break
else:
print('Success for chunk..')
______________________________________ _______________ {"count":2065} ___________ ______________________________________ Success for chunk.. ______________________________________ _______________ {"count":2065} ___________ ______________________________________ Success for chunk.. ______________________________________ _______________ {"count":2065} ___________ ______________________________________ Success for chunk.. ______________________________________ _______________ {"count":2065} ___________ ______________________________________ Success for chunk.. ______________________________________ _______________ {"count":2065} ___________ ______________________________________ Success for chunk.. ______________________________________ _______________ {"count":2065} ___________ ______________________________________ Success for chunk.. ______________________________________ _______________ {"count":2065} ___________ ______________________________________ Success for chunk.. ______________________________________ _______________ {"count":2065} ___________ ______________________________________ Success for chunk.. ______________________________________ _______________ {"count":2064} ___________ ______________________________________ Success for chunk.. ______________________________________ _______________ {"count":2064} ___________ ______________________________________ Success for chunk.. ______________________________________ _______________ {"count":2064} ___________ ______________________________________ Success for chunk.. ______________________________________ _______________ {"count":2064} ___________ ______________________________________ Success for chunk..
Using the JayDeBeApi to create the 'Titles' Container¶
The JayDeBeApi is a Python module that enables connecting to databases that support the Java Database Connectivity (JDBC) standard. It acts as a bridge between Python and the database using JDBC, allowing users to interact with standalone and cloud databases. In this article, we use the GridDB Cloud environment which is currently only available in Japan but it will be available globally in the near-future. Refer to the GridDB cloud documentation for more information.
The JayDeBeApi library requires the JDBC driver specific to the database you want to connect to. In the case of GridDB, you should have the GridDB JDBC driver (gridstore-jdbc.jar) installed.
Here are the steps to do so -
- Download the GridDB JDBC drivers from the GridDB Cloud portal. Upon logging into the Cloud portal, the drivers can be downloaded by accessing the 'Supports' page on the Cloud portal and scrolling down to the "GridDB Cloud Library and Plugin download" section.
- Note that the file downloaded is a zip archive. On extracting it, the JDBC driver files can be found within the JDBC folder. Below is a screenshot of the 'Downloads' section in the GridDB Cloud Portal -
##Specify the path to your image file
image_path = 'Griddb_Libraries_Snapshot.png'
width = 400
## Display the image
Image(filename=image_path, width=width)
- The GridDB JDBC driver files to be downloaded are -
- gridstore-5.2.0.jar
- gridstore-jdbc-5.2.0.jar
- gridstore-advanced-5.2.0.jar
- Save the jar files in the same folder location and in a location accessible by your Python environment.
- Add the location of the jar files to the CLASSPATH system environment variable.
- To do this on a Windows machine, access the 'Environment Variables' from the 'Control Panel'.
- Under the 'System Variables', create a new variable called 'CLASSPATH' and mention the locations of the 3 jar files.
- Install the JayDeBeApi Python package in your Python environment.
- You are now ready to connect to the GridDB database from Python.Once the connection is established, you can execute SQL queries and perform DML database operations using the JayDeBeApi connection object.
- It is important to note that the drivers for GridDB Cloud are different from GridDB OSS.
- Ensure that the SSL Mode is set to 'sslMode=PREFERRED'. If this doesn't work, try with '&sslMode=VERIFY'. Ensure that the connectionRoute is set to PUBLIC.
Notification Provider and Environment Variable Setup¶
We will use the GridDB notification provider method to connect to GridDB. First, make a note of the notificationProvider, the GridDB cluster name, the GridDB database name and the username/password to access the database. These values will be used to construct the connection string and establish the connection to GridDB.
One important consideration is that JayDeBeApi relies on Java. Specifically, it necessitates the presence of a JAVA_HOME environment variable, which should correctly reference the installation location of the Java JRE (Java Runtime Environment). To ensure a smooth operation, it is crucial to confirm that the JAVA_HOME environment variable is set up and that it points to the designated JRE location.
Establishing the connection to GridDB¶
notification_provider = "https://dbaasshareextconsta.blob.core.windows.net/dbaas-share-extcon-blob/trial2002.json?sv=2015-04-05&sr=b&st=2024-01-16T00%3A00%3A00.0000000Z&se=2074-01-16T00%3A00%3A00.0000000Z&sp=r&sig=39U8P%2B%2BqzKAcUQNRw6K54dZHBhyojfiFmgHTmcVOHlY%3D"
np_encode = urllib.parse.quote(notification_provider)
cluster_name = "XX" ## Specify the cluster name here
cn_encode = urllib.parse.quote(cluster_name)
database_name = "XX" ## Specify the database name here
dbn_encode = urllib.parse.quote(database_name)
sslMode = "&sslMode=PREFERRED" #sslMode should be PREFERRED and connectionRoute should be PUBLIC
sm_encode = urllib.parse.quote(sslMode)
username = 'XX'
password = 'XX'
#Construct the JDBC URL to be used to connect to the GridDB database. The format is -
# jdbc:gs:///clusterName/databaseName?notificationProvider=theNotificationProviderURL&sslMode=PREFERRED
url = "jdbc:gs:///" + cn_encode + "/" + dbn_encode + "?notificationProvider=" + np_encode + sm_encode
#print("JDBC URL is " + url)
conn = jaydebeapi.connect("com.toshiba.mwcloud.gs.sql.Driver",
url,
{'user': username, 'password': password,
'connectionRoute':'PUBLIC',
'loginTimeout': '20'}
, "gridstore-jdbc-5.2.0.jar") #ensure to provide the correct location of the gridstore-jdbc JAR library
print('success!')
success!
Creating the container 'Titles'¶
We use the dataframe 'filtered_titles_df' to determine the column names and datatypes.
As discussed in the section on 'Important Points Regarding WebAPI and the JayDeBeAPI'; while using the JaydebeAPI, we can use GridDB's SQL like syntax for container creation and data extraction. In other words, we can run CREATE TABLE statements to create a container. GridDB supports standard DDL and DML operations in its SQL feature in addition to TQL. Refer to this GridDB SQL Reference to know more.
sql_query1 = (f"""
CREATE TABLE Titles
(
titleId STRING,
ordering STRING,
title STRING,
region STRING,
language STRING,
types STRING,
attributes STRING,
isOriginalTitle STRING
)
""")
Note that you can either specify VARCHAR or STRING for string based datatypes. If you specify VARCHAR, the limit needs to be specified. e.g. VARCHAR(15).
with conn.cursor() as cursor:
cursor.execute(sql_query1)
Using the JayDeBeAPI to perform Row Registration¶
While using JayDeBeAPI to load data into containers, we use the 'Insert' SQL statement. Refer to this GridDB resource to learn more. We iterate through each row of the pandas dataframe 'filtered_titles_df' to insert rows into GridDB. Each row is taken in as a tuple and loaded into the container 'Titles' in GridDB Cloud.
Loading the Titles Container¶
# Prepare the SQL statement for insertion
sql_query2 = "INSERT INTO Titles (titleId, ordering, title, region, language, types,attributes, isOriginalTitle) VALUES (?,?,?,?,?,?,?,?)"
# Create a list of tuples containing the data to be inserted
data_to_insert = [tuple(row) for _, row in filtered_titles_df.iterrows()]
# Use a loop to insert each row of data and record the time taken
cursor = conn.cursor()
try:
start_time = time.time() # Record the start time
for row in data_to_insert:
#print(row)
cursor.execute(sql_query2, row)
end_time = time.time() # Record the end time
execution_time = end_time - start_time
print(f"Time taken for insertion: {execution_time:.6f} seconds")
except Exception as e:
# Handle any exceptions that may occur during execution
print("Error:", e)
# Commit the changes
conn.commit()
Time taken for insertion: 7971.316895 seconds
Now that we have loaded both containers with data, we are good to begin our analysis.
Analysis of Movies¶
What titles are living American actors known for?¶
In this case, we attempt to find the 'titleID' from the 'Titles' table and look for these titleIDs in the 'knownForTitles' column from the 'Names' table. To do this, we use the INSTR function in GridDB. The INSTR function helps search a string in another string. In this case, we search the titleId from titles in the 'knownforTitles' from names. The 'knownforTitles' is a list of titles. The INSTR function helps search the titleID in the list of titles. Refer to this resource to learn more. The INSTR function in this case is equivalent to an INNER JOIN.
GridDB also supports the 'IN' operator and the 'like' operator. The 'IN' operator is used here to select specific regions. Refer to this GridDB resource to learn more. The 'like' operator is used to filter down to those celebrities who have been actors in addition to having other primary professions.
sql_query3 = '''SELECT names.PrimaryName,Titles.title FROM Titles,
Name_Basics names
WHERE INSTR(names.knownForTitles,Titles.titleId)>0
AND names.primaryProfession like '%actor%'
AND names.deathYear = \'\\N\'
AND Titles.region IN ('US','IN','CA','NZ','AU')
AND Titles.language = 'en'
'''
# Create a cursor to execute the query
cursor = conn.cursor()
# Execute the query
cursor.execute(sql_query3)
# Fetch the column names
column_names = [desc[0] for desc in cursor.description]
# Fetch the query results
results = cursor.fetchall()
# Convert the results to a Pandas DataFrame
Known_for_Titles = pd.DataFrame(results, columns=column_names)
# Close the cursor and connection
cursor.close()
pd.set_option('display.max_rows', None)
Known_for_Titles
primaryName | title | |
---|---|---|
0 | Logan Arens | The Land Before Time XIII: The Wisdom of Friends |
1 | Lee Jung-Hyun | Mr. Sunshine |
2 | Kaine Applegate | File 13 |
3 | Mathew Thomas | Watermelon Days |
4 | Abraham Evan Anil | Mr. and Ms. Rowdy |
5 | Narendra Singh Dhami | Laxmmi Bomb |
6 | Alex Knoll | Star Wars: Episode 2 - Attack of the Clones |
7 | Ankur Rastogi | Laxmmi Bomb |
8 | Yu Teng Yang | We Best Love |
9 | Young Woo Chu | Police University |
10 | Alain-Fabien Delon | Savage Days |
11 | Derek Savage | LetsPostIt |
12 | Niklas Maienschein | Eighteen Ninety-Nine |
13 | Karan Sharma | Dil Bechara - Hotstar |
14 | Jack McElhone | Young Adam |
15 | Anuj Pandit | I've... Found Someone |
16 | Amar Kumar Dang | A Silent Voice: The Movie |
17 | Darcy Ray Flavell-Hudson | Choice |
18 | Oliver Blaha-McGillick | The Devil's Ground |
19 | Noah Ryan Scott | Rick Mercer's Monday Report |
20 | Robert Döhlert | Cherry Blossoms |
21 | Vladimir Kuznetsov | Gold Diggers |
22 | Giulio Mezza | The Hunter |
23 | Olivier Rousseau | Death Dive |
24 | Jeffrey Nham | Tony Watt's Acid Head |
25 | Anthony Clerveaux | Exit 67 |
26 | Quinton Shore | Black Box |
27 | Ryôsuke Ikeoka | Our Tomorrow |
28 | Takuma Hiraoka | Blade of the Immortal |
29 | Callum Seagram Airlie | Roald Dahl's the BFG |
30 | Andrew Cox | Dollface |
31 | Shunpei Kawagoishi | Dance with Me |
32 | Daniel Malik | The Witch |
33 | Armaan Ralhan | Carefree |
34 | Armaan Ralhan | Carefree |
35 | Sam Lin | We Best Love |
36 | Yûta Koseki | If Tomorrow Comes |
37 | Lee Tae-Hwan | W - Two Worlds |
38 | Ryo Narita | The Cornered Mouse Yearns Cheese |
39 | Ryunosuke Watanuki | A Silent Voice: The Movie |
40 | Martin Krupa | Shadowplay |
41 | Yuya Hirose | Demon Slayer: Kimetsu no Yaiba - The Movie: Mu... |
42 | Gokul Suresh | 21st Century |
43 | Gandharv Dewan | Shiddat: Journey Beyond Love |
44 | Miguel Bernardeau | Eighteen Ninety-Nine |
45 | Pavle Mensur | Romantic Lover |
46 | Alberto Lanzafame | The Hunter |
47 | Hariman Dhanjal | Realisation |
48 | Seo Ji-Hoon | Signal |
49 | Lorenzo Adorni | The Hunter |
50 | Jang Dong-yoon | Mr. Sunshine |
51 | Zakaria Kaakour | Perhaps Today... |
52 | Zakaria Kaakour | Perhaps Today... |
53 | Krystof Bartos | Shadowplay |
54 | Krish Chauhan | Devi Adi Parashakti |
55 | James Taylor-Watts | Live! with Kelly |
56 | Tariq Azees | Rick Mercer's Monday Report |
57 | Artemiy Egorov | The Rising Hawk: Battle for the Carpathians |
58 | Minyeong Choi | W - Two Worlds |
59 | Minyeong Choi | Mr. Sunshine |
60 | Zhao Ziluo | Monkey King and the City of Demons |
61 | Kashish Thakur Pundir | Oh Dear |
What is the breakdown of titles screened in film festivals by Region?¶
GridDB supports aggregate functions like Count, Average, Groupby, etc. Refer to this resource to know more.
sql_query4 = '''SELECT Titles.region, COUNT(Titles.title)
FROM Titles
WHERE attributes like '%festival%'
GROUP BY 1
'''
# Create a cursor to execute the query
cursor = conn.cursor()
# Execute the query
cursor.execute(sql_query4)
# Fetch the column names
column_names = [desc[0] for desc in cursor.description]
# Fetch the query results
results = cursor.fetchall()
# Convert the results to a Pandas DataFrame
Fest_Titles_Region = pd.DataFrame(results, columns=column_names)
# Close the cursor and connection
cursor.close()
Fest_Titles_Region.rename(columns={'region':'region','':'count'}, inplace=True)
sorted_fest_titles = Fest_Titles_Region.sort_values(by='count', ascending=False)
sorted_fest_titles
region | count | |
---|---|---|
2 | XEU | 270 |
1 | XWW | 43 |
0 | XAS | 1 |
3 | BD | 1 |
4 | PH | 1 |
5 | DE | 1 |
6 | US | 1 |
7 | CA | 1 |
Insight(s): The most number of titles that went to film festivals seem to be from the European Union (XEU). Of these, 43 titles that went to film festivals were worldwide.
Which English titles are associated with Gen-Z film personalities?¶
In this case, we use subqueries and the 'EXISTS' feature of GridDB. Refer to this GridDB resource to learn more on 'Exists'. To know more on supported subqueries in GridDB, refer to this resource.
sql_query5 = '''
SELECT titleId,title
FROM Titles
WHERE EXISTS (
SELECT *
FROM Name_Basics
WHERE INSTR(knownForTitles,titleId) >0
AND
birthYear != \'\\N\'
AND
CAST(birthYear AS INT) BETWEEN 1997 AND 2012
)
AND language = 'en'
AND Titles.region IN ('US','IN','CA','NZ','AU')
'''
# Create a cursor to execute the query
cursor = conn.cursor()
# Execute the query
cursor.execute(sql_query5)
# Fetch the column names
column_names = [desc[0] for desc in cursor.description]
# Fetch the query results
results = cursor.fetchall()
# Convert the results to a Pandas DataFrame
Genz_titles = pd.DataFrame(results, columns=column_names)
# Close the cursor and connection
cursor.close()
pd.set_option('display.max_rows', None)
Genz_titles
titleId | title | |
---|---|---|
0 | tt0096636 | Live! with Kelly |
1 | tt0293429 | Maut Ka Sangraam |
2 | tt0396347 | Rick Mercer's Monday Report |
3 | tt0910559 | Cherry Blossoms |
4 | tt10350922 | Laxmmi Bomb |
5 | tt10551608 | Watermelon Days |
6 | tt1127012 | Garnier Fructis Presents: Nach Baliye |
7 | tt1134854 | George A. Romero's Survival of the Dead |
8 | tt1135083 | Black Box |
9 | tt11417728 | VRBangers |
10 | tt12251982 | Devi Adi Parashakti |
11 | tt12850322 | Record of Youth |
12 | tt13490944 | [CSCT-003] the Masterpiece Girl: Aoi Kuruki |
13 | tt1440273 | Romantic Lover |
14 | tt14518136 | Police University |
15 | tt14575182 | BarelyLegal |
16 | tt15669564 | Jalwayu Enclave |
17 | tt1576686 | File 13 |
18 | tt26934390 | LetsPostIt |
19 | tt3061202 | Blue Sky, Green Ocean, Red Earth |
20 | tt3691740 | Roald Dahl's the BFG |
21 | tt3868240 | All or Nothing |
22 | tt4190482 | If Tomorrow Comes |
23 | tt4192662 | Kidnapped Chick |
24 | tt4263482 | The Witch |
25 | tt5084170 | Blade of the Immortal |
26 | tt5323662 | A Silent Voice: The Movie |
27 | tt5332206 | Signal |
28 | tt5797194 | W - Two Worlds |
29 | tt5813366 | Bleeding Steel |
30 | tt6212854 | Voice |
31 | tt6221256 | PropertySex |
32 | tt6286304 | MommysGirl |
33 | tt6426832 | The New Indian Rupee |
34 | tt7094780 | Mr. Sunshine |
35 | tt7278862 | My Brilliant Friend |
36 | tt7813208 | WebYoung |
37 | tt8188872 | The Outsider |
38 | tt8956398 | Mr. and Ms. Rowdy |
39 | tt9319668 | Eighteen Ninety-Nine |
How many film personalities are associated with GenZ and Millenials?¶
Refer to CASE matching in this resource.
sql_query6 = '''SELECT CASE
WHEN CAST(birthYear AS INT) BETWEEN 1997 AND 2012 THEN \'GenZ\'
WHEN CAST(birthYear AS INT) BETWEEN 2013 AND 2025 THEN \'GenAlpha\'
WHEN CAST(birthYear AS INT) BETWEEN 1981 AND 1996 THEN \'Millennials\'
WHEN CAST(birthYear AS INT) BETWEEN 1965 AND 1980 THEN \'GenX\'
WHEN CAST(birthYear AS INT) BETWEEN 1946 AND 1964 THEN \'BabyBoomers \'
WHEN CAST(birthYear AS INT) BETWEEN 1925 AND 1945 THEN \'SilentGeneration\'
WHEN CAST(birthYear AS INT) BETWEEN 1901 AND 1924 THEN \'GreatestGeneration\'
END AS Generation,
COUNT(primaryName) as Number_personalities
FROM Name_Basics
WHERE birthYear != \'\\N\'
GROUP BY 1
'''
# Create a cursor to execute the query
cursor = conn.cursor()
# Execute the query
cursor.execute(sql_query6)
# Fetch the column names
column_names = [desc[0] for desc in cursor.description]
# Fetch the query results
results = cursor.fetchall()
# Convert the results to a Pandas DataFrame
generations_df = pd.DataFrame(results, columns=column_names)
# Close the cursor and connection
cursor.close()
generations_df
Generation | Number_personalities | |
---|---|---|
0 | GenZ | 8642 |
1 | Millennials | 16134 |
Concluding Remarks¶
In this article, we looked at the different options to connect to GridDB. We then compared the JayDeBeAPI and the WebAPI by creating containers using both of them. Similarly, we used these two methods to perform row registration. We then used the JayDeBeAPI to extract data from the containers. We looked at different GridDB features that included character functions, subqueries and string operations.
If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.