In [15]:

#Libraries for basic data manipulation
import os
import pandas as pd
import numpy as np

#Libraries for the webAPI requests
import http
http.client.HTTPConnection.debuglevel = 1
import json
import requests

#libraries for the JayDeBeAPI
import jaydebeapi
import urllib.parse

#library to calculate time
import time

#Library for loading files into dataframes
import dask.dataframe as dd

#Libraries for graphs
from tabulate import tabulate
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.figure_factory as ff
from IPython.display import Image, display, Markdown

#os.chdir("Your Working Directory") #Put your working directory here

About the Dataset¶

The dataset we use for this analysis can be accessed at https://developer.imdb.com/non-commercial-datasets/.

In [7]:

##Specify the path to your image file
image_path = 'About the Dataset.png'

width = 500

## Display the image
Image(filename=image_path, width=width)

Out[7]:

The instructions to download the dataset are -

Click the link 'https://datasets.imdbws.com/' as shown in the above snapshot.
Click to download the specific files 'name.basics.tsv.gz' and 'title.akas.tsv.gz'.
Save the downloaded files from the 'Downloads' folder to your working directory.

In [3]:

##Specify the path to your image file
image_path = 'Datasets_Step2.png'

width = 500

## Display the image
Image(filename=image_path, width=width)

Out[3]:

If using a Windows machine, Right-click each tsv.gz file and click 'Extract All' to decompress the file.
If using a MacOS, double-clicking the file should automatically decompress it and create the corresponding .tsv file.
If using a Linux-based system, you can use the gunzip command in the terminal - gunzip filename.tsv.gz

In [4]:

##Specify the path to your image file
image_path = 'Extract_file.png'

width = 500

## Display the image
Image(filename=image_path, width=width)

Out[4]:

The data gets downloaded as folders shown below. Each folder has the corresponding data file in a tsv format (tab-separated-values).

In [5]:

##Specify the path to your image file
image_path = 'After_Extraction.png'

width = 200

## Display the image
Image(filename=image_path, width=width)

Out[5]:

In the context of the IMDb (Internet Movie Database) dataset, the files title.akas.tsv and name.basics.tsv are part of the IMDb datasets and provide information about movie titles and individuals involved in the film industry, respectively.

title.akas.tsv:

Purpose: This file contains information about the alternate titles of movie titles in different regions and languages.

titleId: A unique identifier for the movie title.
ordering: The order of the title in the list of titles associated with the same titleId.
title: The title itself, which might be an alternate title in a different language or region.
region: The region or country for which the title is applicable.
language: The language of the title.
Other attributes providing additional details.

name.basics.tsv:

Purpose: This file contains information about individuals involved in the film industry, such as actors, directors, writers, etc.

nconst: A unique identifier for the person.
primaryName: The primary name of the person.
birthYear and deathYear: The birth and death years of the person (if available).
primaryProfession: The primary profession of the person (e.g., actor, director).
knownForTitles: Titles for which the person is known, indicating their notable works.

Loading data into Python¶

The '.tsv' files in this analysis are of around 1 GB in size. To handle data load effectively, we use the dask library.

In [41]:

tsv_file_path = 'name.basics.tsv/data.tsv'
ddf1 = dd.read_csv(tsv_file_path, sep='\t', assume_missing=True, dtype={'nconst':'object','primaryName':'object','birthYear':'object','deathYear':'object','primaryProfession':'object','knownForTitles':'object'})
Name_Basics_df = ddf1.compute()

In [42]:

tsv_file_path = 'title.akas.tsv/data.tsv'
ddf2 = dd.read_csv(tsv_file_path, sep='\t', assume_missing=True, dtype={'titleId':'object', 'ordering':'object', 'title':'object', 'region':'object', 'language':'object', 'types':'object',
'attributes':'object', 'isOriginalTitle':'object'})
titles_df = ddf2.compute()

Subsetting the data¶

We subset the Name_Basics_df to only include personalities where the birthyear is available and where it is between 1990 and 2005. We also include only actors and actresses for this analysis.

In [70]:

subset_valid_years = Name_Basics_df[Name_Basics_df['birthYear']!='\\N']
subset_valid_years = subset_valid_years[(subset_valid_years['birthYear'].astype(int) >= 1990) & (subset_valid_years['birthYear'].astype(int) <= 2005)]
Name_Basics_subset = subset_valid_years[(subset_valid_years['primaryProfession']=='actor') | (subset_valid_years['primaryProfession']=='actress')]

Regarding titles_df, we only consider titles in the 'english' language. Moreover, we only include titles that were either screened in a film festival or specific titles that had one of the attributes - original, new, complete, literal, alternative or transliterated.

In [71]:

#titles_df[titles_df['attributes']=='\\N']
subset_english_titles = titles_df[titles_df['language'] == 'en']

In [72]:

selected_attributes = ['complete title','literal title','literal English title','alternative spelling','transliterated title','new title','original script title','Cannes festival title', 'Berlin film festival title', 'Venice film festival title', 'Locarno film festival title', 'Bilbao festival title']
filtered_titles_df = subset_english_titles[subset_english_titles['attributes'].isin(selected_attributes)]

Now that we have subset our datasets, we are good to create containers and load them into GridDB. Containers is one of the data structures within GridDB and we learn more about them in the section 'A short note on Containers in GridDB'.

Ways to Connect to GridDB¶

We load the data from the Python DataFrame onto GridDB. There are a number of ways to load data into GridDB and query the data -

JaydebeAPI - GridDB provides a JDBC (Java Database Connectivity) driver that allows you to connect to GridDB using Java. You can use the JDBC driver to establish a connection, execute SQL queries, and perform database operations programmatically. JayDeBeApi is a Python library that works with Java JDBC drivers, allowing you to connect to and interact with Java-based databases from your Python applications.
RESTful Web API - GridDB offers a RESTful Web API that allows you to interact with GridDB using HTTP requests. You can use standard HTTP methods like GET, POST, PUT, DELETE to perform operations on GridDB. This option provides a platform-independent way to connect to GridDB from various programming languages and platforms.
griddb-python library - This is a python library that can be used to connect to a GridDB database and perform query operations.

Connection Approaches used in this article¶

In this article, we use a hybrid approach. We will be using WebAPI to create the first container and to load rows into the container. We then use the JaydebeAPI to use the second container and to load rows into the second container. This will help us compare the processes involved in both the WebAPI and JaydebeAPI.

But before we get started, below are two diagrams explaining the WebAPI and the JayDeBeAPI.

A schema explaining the WebAPI¶

In [22]:

##Specify the path to your image file
image_path = 'Schema_Working_of_WebAPI.png'

width = 600

display(Markdown("## Schema - Interaction between the WebAPI and GridDB"))

## Display the image
display(Image(filename=image_path, width=width))

Schema - Interaction between the WebAPI and GridDB¶

Drivers required for JayDeBeAPI¶

In [23]:

##Specify the path to your image file
image_path = 'Pre-requisites_Jaydebeapi.png'

width = 700

display(Markdown("## Pre-requisites - Drivers required for JaydebeAPI"))

## Display the image
display(Image(filename=image_path, width=width))

Pre-requisites - Drivers required for JaydebeAPI¶

A schema explaining the JayDeBeAPI¶

In [24]:

##Specify the path to your image file
image_path = 'Schema_Working_of_JaydebeAPI.png'

width = 800

display(Markdown("## Schema - Interaction between JaydebeAPI and GridDB"))

## Display the image
display(Image(filename=image_path, width=width))

Schema - Interaction between JaydebeAPI and GridDB¶

Note that dbn_encode, sm_encode, cn_encode and np_encode are URL encoded versions of the database_name, SSL Mode, Cluster name and Notification Provider URL. URL encoding is accomplished using the Python library urllib to convert special characters and non-ASCII characters into a format that can be safely transmitted within a URL.

Important Points Regarding WebAPI and the JayDeBeAPI¶

Setting up the connection to GridDB using the WebAPI is pretty quick as it primarily uses http requests. But the structure of the container needs to be passed as a JSON via the http post request. This can be time-consuming in terms of constructing the JSON, for large containers with several columns and mixed datatypes. However, calls made to query/retrieve data from containers follow standard SQL 'select' syntax.

Setting up the JayDeBeAPI requires some time as drivers need to be downloaded and the class path should be setup with the location of the driver files. This can take up some time in terms of setup. But both the container creation and data retrieval are simple as JayDeBeAPI uses SQL Create Table syntax similar to relational databases.

If you are comfortable with SQL and relational database syntax, JayDeBeAPI is the best approach whereas if you are comfortable with JSON, WebAPI might be the better option.

A short note on Containers in GridDB¶

In GridDB, a container is a fundamental data structure used to organize and store data. It serves as a logical unit for storing and managing sets of related data within a GridDB database. Containers in GridDB can be compared to tables in a relational database, but they offer some unique features and capabilities tailored to GridDB's architecture.

Key characteristics of containers in GridDB include:

Highly Scalable: GridDB is designed to be a highly scalable and distributed database. Containers can be partitioned across multiple nodes in a GridDB cluster, enabling horizontal scaling as the data volume grows.

Data Distribution: Containers can be distributed across multiple nodes for load balancing and improved performance. GridDB supports sharding, which allows data to be distributed based on a specified sharding key.

Container Types: GridDB supports various types of containers, each optimized for specific use cases. The main container types are:

Collection : A basic container type for storing data.
TimeSeries: Optimized for time-series data with a focus on efficient time-based queries.

CRUD Operations: Containers support standard CRUD (Create, Read, Update, Delete) operations. You can insert, retrieve, update, and delete records within a container.

Transaction Support: GridDB provides support for ACID transactions within containers, ensuring the consistency and integrity of data. GridDB has its query language called TQL (Time Series Query Language) and also supports SQL querying similar to relational databases.

Connecting to GridDB using the WebAPI¶

In this article, we use the WebAPI to connect to GridDB. WebAPIs are simple http POST requests which use basic authentication. To connect to the GridDB cloud's WebAPI, the Request that is constructed should have the username and password encoded to base64 standards. To determine the base64 encoded credentials for your username and password, a tool such as https://www.base64encode.org/ can be used. Refer to the resource listed here to learn more about the different entities involved in creating a request - https://griddb.net/en/blog/using-griddb-cloud-version-1-2/. The basic authentication format in GridDB is 'username:password' where the case-sensitive username and password should be used for encoding.

Construct the base URL based on your GridDB cluster you'd like to connect to. Ensure that you replace the placeholders in the base_url below with the correct values that correspond to your GridDB instance.

Setting up the Authorization and Request¶

In [75]:

#Construct an object to hold the request headers 
#Ensure that you replace the XXXX placeholder with the correct value that matches the credentials for your GridDB instance)
header_obj = {"Authorization":"XXXX","Content-Type":"application/json; charset=UTF-8","User-Agent":"PostmanRuntime/7.29.0"}

#Construct the base URL based on your GridDB cluster you'd like to connect to
#ensure that you replace the placeholders in the URL below with the correct values that correspond to your GridDB instance)

base_url = 'https://[host]:[port]/griddb/v2/[clustername]/dbs/[database_name]/'

Using the WebAPI to create the 'Name_Basics' Container¶

With the WebAPI, requests to create containers are usually passed as a JSON object. A request to create a container typically contains the following elements - container_name, container_type, rowkey and the columns in the container passed as a list of dictionaries.

For Collection containers, the rowkey can is usually set to False whereas for Timeseries containers, the rowkey has to be set to True. When we set 'rowkey' to True, we are effectively telling GridDB that the container has a primary key column. Timeseries containers require the timestamp variable to be the primary key. While Collection Containers can either have a rowkey of True or False, when the rowkey is set to False, GriDB creates a Collection Container by default.

For ease of use and simplicity, we use Python's zip() function. Here, we create a dictionary named 'data_obj'. We then create a list named 'input_variables' and another list named 'data_types'. We then iterate through the zipped function of input_variables and data_types which helps map the column name and datatype that corresponds to each column name.

Creating the container 'Name_Basics'¶

We use the dataframe 'Name_Basics_subset' to determine the columns and datatypes of the container that is going to be created.

In [79]:

data_obj = {
    "container_name": "Name_Basics",
    "container_type": "COLLECTION",
    "rowkey": False,
    "columns": []
}

input_variables = [
    "nconst", "primaryName", "birthYear", "deathYear", "primaryProfession", "knownForTitles"
]


data_types = [
    "STRING", "STRING", "STRING", "STRING", "STRING", "STRING"
]

for variable, data_type in zip(input_variables, data_types):
    column = {
        "name": variable,
        "type": data_type
    }
    data_obj["columns"].append(column)

We then pass the 'data_obj' to the request.

In [ ]:

#Set up the GridDB WebAPI URL
url = base_url + 'containers'

#Invoke the GridDB WebAPI with the headers and the request body
x = requests.post(url, json = data_obj, headers = header_obj)

Using the WebAPI to perform Row Registration (Loading data into GridDB)¶

In GridDB, the process of loading data into a container is referred to as "Row Registration". When you register a row, you are essentially inserting a new record into the container. This is similar to an 'Insert' DDL task in SQL. Refer to this resource to learn more- https://www.toshiba-sol.co.jp/en/pro/griddb/docs-en/v4_6/GridDB_Web_API_Reference.html#row-registration.

The good thing about using Python in conjunction with GridDB is that we can use both inbuilt functions in Python and packages such as Json and Pandas to manipulate data easily and effectively. In this case, we use the 'to_json' function of the 'json' module.

In this case, we use batch loading in GridDB. Here is a resource on batch loading - https://griddb.net/en/blog/griddb-optimization-with-multi-put-and-query/. If you use the WebAPI, then you can just use a simple 'for' loop to load your subsets of data into your GridDB container. Batch Loading enables users to import data in large chunks, reducing the overall time and resources required for the data migration process. GridDB supports parallelized batch loading, which can significantly enhance the efficiency and speed of data loading operations. Parallelization in GridDB is achieved through various mechanisms such as Multi-Node Architecture, Container Partitioning, Multi-Threaded Loading and Parallel Query Execution. Read on to know more - https://docs.griddb.net/.

Loading the Name_Basics container¶

To perform batch loading, let's use the np.array_split to break data into chunks.

In [106]:

#creating 12 subsets of the data
Name_Basics_Subsets = np.array_split(Name_Basics_subset, 12)

With the np.array_split, we can access individual subsets like Name_Basics_Subsets[0], Name_Basics_Subsets[1] ... Name_Basics_Subsets[11], each representing a portion of the original data divided into 12 parts. This kind of data splitting is often done for parallel processing or distributing work across multiple processors or nodes in a computing cluster. We use this for our bulk loading initiative.

In [109]:

http.client.HTTPConnection.debuglevel = 0
#Setup the URL to be used to invoke the GridDB WebAPI to register rows in the container created previously
url = base_url + 'containers/Name_Basics/rows'

#Invoke the GridDB WebAPI using the request constructed
for subset in Name_Basics_Subsets:
    Name_Basics_Subsets_json = subset.to_json(orient='values')
    request_body_subset = Name_Basics_Subsets_json
    x = requests.put(url, data=request_body_subset, headers=header_obj)
    print('______________________________________')
    print('_______________',x.text,'___________')
    print('______________________________________')
    if x.status_code > 299: #To account for HTTP response codes indicating a success - 200 and 201 etc.
        print(x.status_code) # an error has occurred; so, lets stop the process here
        print(Name_Basics_Subsets_json)
        break
    else:
        print('Success for chunk..')

______________________________________
_______________ {"count":2065} ___________
______________________________________
Success for chunk..
______________________________________
_______________ {"count":2065} ___________
______________________________________
Success for chunk..
______________________________________
_______________ {"count":2065} ___________
______________________________________
Success for chunk..
______________________________________
_______________ {"count":2065} ___________
______________________________________
Success for chunk..
______________________________________
_______________ {"count":2065} ___________
______________________________________
Success for chunk..
______________________________________
_______________ {"count":2065} ___________
______________________________________
Success for chunk..
______________________________________
_______________ {"count":2065} ___________
______________________________________
Success for chunk..
______________________________________
_______________ {"count":2065} ___________
______________________________________
Success for chunk..
______________________________________
_______________ {"count":2064} ___________
______________________________________
Success for chunk..
______________________________________
_______________ {"count":2064} ___________
______________________________________
Success for chunk..
______________________________________
_______________ {"count":2064} ___________
______________________________________
Success for chunk..
______________________________________
_______________ {"count":2064} ___________
______________________________________
Success for chunk..

Using the JayDeBeApi to create the 'Titles' Container¶

The JayDeBeApi is a Python module that enables connecting to databases that support the Java Database Connectivity (JDBC) standard. It acts as a bridge between Python and the database using JDBC, allowing users to interact with standalone and cloud databases. In this article, we use the GridDB Cloud environment which is currently only available in Japan but it will be available globally in the near-future. Refer to the GridDB cloud documentation for more information.

The JayDeBeApi library requires the JDBC driver specific to the database you want to connect to. In the case of GridDB, you should have the GridDB JDBC driver (gridstore-jdbc.jar) installed.

Here are the steps to do so -

Download the GridDB JDBC drivers from the GridDB Cloud portal. Upon logging into the Cloud portal, the drivers can be downloaded by accessing the 'Supports' page on the Cloud portal and scrolling down to the "GridDB Cloud Library and Plugin download" section.
Note that the file downloaded is a zip archive. On extracting it, the JDBC driver files can be found within the JDBC folder. Below is a screenshot of the 'Downloads' section in the GridDB Cloud Portal -

In [89]:

##Specify the path to your image file
image_path = 'Griddb_Libraries_Snapshot.png'

width = 400

## Display the image
Image(filename=image_path, width=width)

Out[89]:

gridstore-5.2.0.jar
gridstore-jdbc-5.2.0.jar
gridstore-advanced-5.2.0.jar

Save the jar files in the same folder location and in a location accessible by your Python environment.
Add the location of the jar files to the CLASSPATH system environment variable.
To do this on a Windows machine, access the 'Environment Variables' from the 'Control Panel'.
Under the 'System Variables', create a new variable called 'CLASSPATH' and mention the locations of the 3 jar files.
Install the JayDeBeApi Python package in your Python environment.
You are now ready to connect to the GridDB database from Python.Once the connection is established, you can execute SQL queries and perform DML database operations using the JayDeBeApi connection object.
It is important to note that the drivers for GridDB Cloud are different from GridDB OSS.
Ensure that the SSL Mode is set to 'sslMode=PREFERRED'. If this doesn't work, try with '&sslMode=VERIFY'. Ensure that the connectionRoute is set to PUBLIC.

Notification Provider and Environment Variable Setup¶

We will use the GridDB notification provider method to connect to GridDB. First, make a note of the notificationProvider, the GridDB cluster name, the GridDB database name and the username/password to access the database. These values will be used to construct the connection string and establish the connection to GridDB.

One important consideration is that JayDeBeApi relies on Java. Specifically, it necessitates the presence of a JAVA_HOME environment variable, which should correctly reference the installation location of the Java JRE (Java Runtime Environment). To ensure a smooth operation, it is crucial to confirm that the JAVA_HOME environment variable is set up and that it points to the designated JRE location.

Establishing the connection to GridDB¶

In [8]:

notification_provider = "https://dbaasshareextconsta.blob.core.windows.net/dbaas-share-extcon-blob/trial2002.json?sv=2015-04-05&sr=b&st=2024-01-16T00%3A00%3A00.0000000Z&se=2074-01-16T00%3A00%3A00.0000000Z&sp=r&sig=39U8P%2B%2BqzKAcUQNRw6K54dZHBhyojfiFmgHTmcVOHlY%3D"
np_encode = urllib.parse.quote(notification_provider)

cluster_name = "XX" ## Specify the cluster name here
cn_encode = urllib.parse.quote(cluster_name)

database_name = "XX" ## Specify the database name here
dbn_encode = urllib.parse.quote(database_name)

sslMode = "&sslMode=PREFERRED"  #sslMode should be PREFERRED and connectionRoute should be PUBLIC
sm_encode = urllib.parse.quote(sslMode)

username = 'XX'
password = 'XX'

#Construct the JDBC URL to be used to connect to the GridDB database. The format is -
# jdbc:gs:///clusterName/databaseName?notificationProvider=theNotificationProviderURL&sslMode=PREFERRED
url = "jdbc:gs:///" + cn_encode +  "/" + dbn_encode + "?notificationProvider=" + np_encode + sm_encode 

#print("JDBC URL is " + url)

conn = jaydebeapi.connect("com.toshiba.mwcloud.gs.sql.Driver",
    url, 
    {'user': username, 'password': password,
      'connectionRoute':'PUBLIC',
     'loginTimeout': '20'}
    , "gridstore-jdbc-5.2.0.jar")  #ensure to provide the correct location of the gridstore-jdbc JAR library

print('success!')

success!

Creating the container 'Titles'¶

We use the dataframe 'filtered_titles_df' to determine the column names and datatypes.

As discussed in the section on 'Important Points Regarding WebAPI and the JayDeBeAPI'; while using the JaydebeAPI, we can use GridDB's SQL like syntax for container creation and data extraction. In other words, we can run CREATE TABLE statements to create a container. GridDB supports standard DDL and DML operations in its SQL feature in addition to TQL. Refer to this GridDB SQL Reference to know more.

In [116]:

sql_query1 = (f"""
    CREATE TABLE Titles
        (
        titleId STRING,
        ordering STRING,
        title STRING,
        region STRING,
        language STRING,
        types STRING,
        attributes STRING,
        isOriginalTitle STRING
        )
""")

Note that you can either specify VARCHAR or STRING for string based datatypes. If you specify VARCHAR, the limit needs to be specified. e.g. VARCHAR(15).

In [117]:

with conn.cursor() as cursor:
    cursor.execute(sql_query1)

Using the JayDeBeAPI to perform Row Registration¶

While using JayDeBeAPI to load data into containers, we use the 'Insert' SQL statement. Refer to this GridDB resource to learn more. We iterate through each row of the pandas dataframe 'filtered_titles_df' to insert rows into GridDB. Each row is taken in as a tuple and loaded into the container 'Titles' in GridDB Cloud.

Loading the Titles Container¶

In [122]:

# Prepare the SQL statement for insertion
sql_query2 = "INSERT INTO Titles (titleId, ordering, title, region, language, types,attributes, isOriginalTitle) VALUES (?,?,?,?,?,?,?,?)"

# Create a list of tuples containing the data to be inserted
data_to_insert = [tuple(row) for _, row in filtered_titles_df.iterrows()]

# Use a loop to insert each row of data and record the time taken
cursor = conn.cursor()
try:
    start_time = time.time()  # Record the start time
    for row in data_to_insert:
        #print(row)
        cursor.execute(sql_query2, row)
    end_time = time.time()  # Record the end time
    execution_time = end_time - start_time
    print(f"Time taken for insertion: {execution_time:.6f} seconds")
except Exception as e:
    # Handle any exceptions that may occur during execution
    print("Error:", e)

# Commit the changes
conn.commit()

Time taken for insertion: 7971.316895 seconds

Now that we have loaded both containers with data, we are good to begin our analysis.

Analysis of Movies¶

What titles are living American actors known for?¶

In this case, we attempt to find the 'titleID' from the 'Titles' table and look for these titleIDs in the 'knownForTitles' column from the 'Names' table. To do this, we use the INSTR function in GridDB. The INSTR function helps search a string in another string. In this case, we search the titleId from titles in the 'knownforTitles' from names. The 'knownforTitles' is a list of titles. The INSTR function helps search the titleID in the list of titles. Refer to this resource to learn more. The INSTR function in this case is equivalent to an INNER JOIN.

GridDB also supports the 'IN' operator and the 'like' operator. The 'IN' operator is used here to select specific regions. Refer to this GridDB resource to learn more. The 'like' operator is used to filter down to those celebrities who have been actors in addition to having other primary professions.

In [156]:

sql_query3 = '''SELECT names.PrimaryName,Titles.title FROM Titles, 
     Name_Basics names 
     WHERE INSTR(names.knownForTitles,Titles.titleId)>0  
     AND names.primaryProfession like '%actor%'
     AND names.deathYear = \'\\N\'
     AND Titles.region IN ('US','IN','CA','NZ','AU')
     AND Titles.language = 'en'
     '''

# Create a cursor to execute the query
cursor = conn.cursor()

# Execute the query
cursor.execute(sql_query3)

# Fetch the column names
column_names = [desc[0] for desc in cursor.description]

# Fetch the query results
results = cursor.fetchall()

# Convert the results to a Pandas DataFrame
Known_for_Titles = pd.DataFrame(results, columns=column_names)

# Close the cursor and connection
cursor.close()

In [157]:

pd.set_option('display.max_rows', None)
Known_for_Titles

Out[157]:

	primaryName	title
0	Logan Arens	The Land Before Time XIII: The Wisdom of Friends
1	Lee Jung-Hyun	Mr. Sunshine
2	Kaine Applegate	File 13
3	Mathew Thomas	Watermelon Days
4	Abraham Evan Anil	Mr. and Ms. Rowdy
5	Narendra Singh Dhami	Laxmmi Bomb
6	Alex Knoll	Star Wars: Episode 2 - Attack of the Clones
7	Ankur Rastogi	Laxmmi Bomb
8	Yu Teng Yang	We Best Love
9	Young Woo Chu	Police University
10	Alain-Fabien Delon	Savage Days
11	Derek Savage	LetsPostIt
12	Niklas Maienschein	Eighteen Ninety-Nine
13	Karan Sharma	Dil Bechara - Hotstar
14	Jack McElhone	Young Adam
15	Anuj Pandit	I've... Found Someone
16	Amar Kumar Dang	A Silent Voice: The Movie
17	Darcy Ray Flavell-Hudson	Choice
18	Oliver Blaha-McGillick	The Devil's Ground
19	Noah Ryan Scott	Rick Mercer's Monday Report
20	Robert Döhlert	Cherry Blossoms
21	Vladimir Kuznetsov	Gold Diggers
22	Giulio Mezza	The Hunter
23	Olivier Rousseau	Death Dive
24	Jeffrey Nham	Tony Watt's Acid Head
25	Anthony Clerveaux	Exit 67
26	Quinton Shore	Black Box
27	Ryôsuke Ikeoka	Our Tomorrow
28	Takuma Hiraoka	Blade of the Immortal
29	Callum Seagram Airlie	Roald Dahl's the BFG
30	Andrew Cox	Dollface
31	Shunpei Kawagoishi	Dance with Me
32	Daniel Malik	The Witch
33	Armaan Ralhan	Carefree
34	Armaan Ralhan	Carefree
35	Sam Lin	We Best Love
36	Yûta Koseki	If Tomorrow Comes
37	Lee Tae-Hwan	W - Two Worlds
38	Ryo Narita	The Cornered Mouse Yearns Cheese
39	Ryunosuke Watanuki	A Silent Voice: The Movie
40	Martin Krupa	Shadowplay
41	Yuya Hirose	Demon Slayer: Kimetsu no Yaiba - The Movie: Mu...
42	Gokul Suresh	21st Century
43	Gandharv Dewan	Shiddat: Journey Beyond Love
44	Miguel Bernardeau	Eighteen Ninety-Nine
45	Pavle Mensur	Romantic Lover
46	Alberto Lanzafame	The Hunter
47	Hariman Dhanjal	Realisation
48	Seo Ji-Hoon	Signal
49	Lorenzo Adorni	The Hunter
50	Jang Dong-yoon	Mr. Sunshine
51	Zakaria Kaakour	Perhaps Today...
52	Zakaria Kaakour	Perhaps Today...
53	Krystof Bartos	Shadowplay
54	Krish Chauhan	Devi Adi Parashakti
55	James Taylor-Watts	Live! with Kelly
56	Tariq Azees	Rick Mercer's Monday Report
57	Artemiy Egorov	The Rising Hawk: Battle for the Carpathians
58	Minyeong Choi	W - Two Worlds
59	Minyeong Choi	Mr. Sunshine
60	Zhao Ziluo	Monkey King and the City of Demons
61	Kashish Thakur Pundir	Oh Dear

What is the breakdown of titles screened in film festivals by Region?¶

GridDB supports aggregate functions like Count, Average, Groupby, etc. Refer to this resource to know more.

In [9]:

sql_query4 = '''SELECT Titles.region, COUNT(Titles.title) 
                FROM Titles
                WHERE attributes like '%festival%'
                GROUP BY 1
             '''

# Create a cursor to execute the query
cursor = conn.cursor()

# Execute the query
cursor.execute(sql_query4)

# Fetch the column names
column_names = [desc[0] for desc in cursor.description]

# Fetch the query results
results = cursor.fetchall()

# Convert the results to a Pandas DataFrame
Fest_Titles_Region = pd.DataFrame(results, columns=column_names)

# Close the cursor and connection
cursor.close()

In [10]:

Fest_Titles_Region.rename(columns={'region':'region','':'count'}, inplace=True)

In [36]:

sorted_fest_titles = Fest_Titles_Region.sort_values(by='count', ascending=False)
sorted_fest_titles

Out[36]:

	region	count
2	XEU	270
1	XWW	43
0	XAS	1
3	BD	1
4	PH	1
5	DE	1
6	US	1
7	CA	1

Insight(s): The most number of titles that went to film festivals seem to be from the European Union (XEU). Of these, 43 titles that went to film festivals were worldwide.

Which English titles are associated with Gen-Z film personalities?¶

In this case, we use subqueries and the 'EXISTS' feature of GridDB. Refer to this GridDB resource to learn more on 'Exists'. To know more on supported subqueries in GridDB, refer to this resource.

In [62]:

sql_query5 = '''
                SELECT titleId,title
                FROM Titles
                WHERE EXISTS (
                SELECT *
                FROM Name_Basics
                WHERE INSTR(knownForTitles,titleId) >0 
                AND
                birthYear != \'\\N\'
                AND
                CAST(birthYear AS INT) BETWEEN 1997 AND 2012
                )
                AND language = 'en'
                AND Titles.region IN ('US','IN','CA','NZ','AU')
              '''  

# Create a cursor to execute the query
cursor = conn.cursor()

# Execute the query
cursor.execute(sql_query5)

# Fetch the column names
column_names = [desc[0] for desc in cursor.description]

# Fetch the query results
results = cursor.fetchall()

# Convert the results to a Pandas DataFrame
Genz_titles = pd.DataFrame(results, columns=column_names)

# Close the cursor and connection
cursor.close()

In [63]:

pd.set_option('display.max_rows', None)
Genz_titles

Out[63]:

	titleId	title
0	tt0096636	Live! with Kelly
1	tt0293429	Maut Ka Sangraam
2	tt0396347	Rick Mercer's Monday Report
3	tt0910559	Cherry Blossoms
4	tt10350922	Laxmmi Bomb
5	tt10551608	Watermelon Days
6	tt1127012	Garnier Fructis Presents: Nach Baliye
7	tt1134854	George A. Romero's Survival of the Dead
8	tt1135083	Black Box
9	tt11417728	VRBangers
10	tt12251982	Devi Adi Parashakti
11	tt12850322	Record of Youth
12	tt13490944	[CSCT-003] the Masterpiece Girl: Aoi Kuruki
13	tt1440273	Romantic Lover
14	tt14518136	Police University
15	tt14575182	BarelyLegal
16	tt15669564	Jalwayu Enclave
17	tt1576686	File 13
18	tt26934390	LetsPostIt
19	tt3061202	Blue Sky, Green Ocean, Red Earth
20	tt3691740	Roald Dahl's the BFG
21	tt3868240	All or Nothing
22	tt4190482	If Tomorrow Comes
23	tt4192662	Kidnapped Chick
24	tt4263482	The Witch
25	tt5084170	Blade of the Immortal
26	tt5323662	A Silent Voice: The Movie
27	tt5332206	Signal
28	tt5797194	W - Two Worlds
29	tt5813366	Bleeding Steel
30	tt6212854	Voice
31	tt6221256	PropertySex
32	tt6286304	MommysGirl
33	tt6426832	The New Indian Rupee
34	tt7094780	Mr. Sunshine
35	tt7278862	My Brilliant Friend
36	tt7813208	WebYoung
37	tt8188872	The Outsider
38	tt8956398	Mr. and Ms. Rowdy
39	tt9319668	Eighteen Ninety-Nine

How many film personalities are associated with GenZ and Millenials?¶

Refer to CASE matching in this resource.

In [65]:

sql_query6 = '''SELECT CASE 
                WHEN CAST(birthYear AS INT) BETWEEN 1997 AND 2012 THEN \'GenZ\'                
                WHEN CAST(birthYear AS INT) BETWEEN 2013 AND 2025 THEN \'GenAlpha\'                
                WHEN CAST(birthYear AS INT) BETWEEN 1981 AND 1996 THEN \'Millennials\'                
                WHEN CAST(birthYear AS INT) BETWEEN 1965 AND 1980 THEN \'GenX\'                
                WHEN CAST(birthYear AS INT) BETWEEN 1946 AND 1964 THEN \'BabyBoomers \'                
                WHEN CAST(birthYear AS INT) BETWEEN 1925 AND 1945 THEN \'SilentGeneration\'                
                WHEN CAST(birthYear AS INT) BETWEEN 1901 AND 1924 THEN \'GreatestGeneration\'
                END AS Generation,
                COUNT(primaryName) as Number_personalities
                FROM Name_Basics
                WHERE birthYear != \'\\N\'
                GROUP BY 1
              '''  

# Create a cursor to execute the query
cursor = conn.cursor()

# Execute the query
cursor.execute(sql_query6)

# Fetch the column names
column_names = [desc[0] for desc in cursor.description]

# Fetch the query results
results = cursor.fetchall()

# Convert the results to a Pandas DataFrame
generations_df = pd.DataFrame(results, columns=column_names)

# Close the cursor and connection
cursor.close()

In [66]:

generations_df

Out[66]:

	Generation	Number_personalities
0	GenZ	8642
1	Millennials	16134

Concluding Remarks¶

In this article, we looked at the different options to connect to GridDB. We then compared the JayDeBeAPI and the WebAPI by creating containers using both of them. Similarly, we used these two methods to perform row registration. We then used the JayDeBeAPI to extract data from the containers. We looked at different GridDB features that included character functions, subqueries and string operations.

CRUD Operations on GridDB with...

Building AI Voice Note

Normalization in Java

Real-time Infrastructure Monitoring Using GridDB...

Analyzing Movies on IMDB using GridDB