Packages required¶
import os
import pandas as pd
import numpy as np
from IPython.display import Image
import jaydebeapi
import urllib.parse
import sys
import matplotlib.pyplot as plt
import seaborn as sns
Overview¶
Marvel and DC comics have been around for several decades. Both Marvel and DC are known for their iconic characters, super-heroes and supervillains. While Marvel produced Spider-Man, Iron Man and Hulk; DC produced Superman, Batman and the Wonder Woman. Both of these publishers have diversified their content and expanded beyond traditional comic books to encompass various forms of media and storytelling.
DC and Marvel have created characters that transcend nationality and resonate with people worldwide. The iconic characters created by these brands have become a part of popular culture, prompting passionate debates among fans when comparing Marvel and DC heroes.
We delve deeper into the Marvel and DC universe with the help of a public dataset from Data.World called the 'Comic Characters Dataset'.
There are two data files within the dataset. Download each of them using the 'download' icon as shown below. Download the DC Dataset as shown below.
##Specify the path to your image file
image_path = 'dc-wikia-data_download_icon.png'
width = 500
## Display the image
Image(filename=image_path, width=width)
--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) <ipython-input-2-3581f17e51f6> in <module> 5 6 ## Display the image ----> 7 Image(filename=image_path, width=width) ~/.local/lib/python3.6/site-packages/IPython/core/display.py in __init__(self, data, url, filename, format, embed, width, height, retina, unconfined, metadata) 1223 self.unconfined = unconfined 1224 super(Image, self).__init__(data=data, url=url, filename=filename, -> 1225 metadata=metadata) 1226 1227 if self.width is None and self.metadata.get('width', {}): ~/.local/lib/python3.6/site-packages/IPython/core/display.py in __init__(self, data, url, filename, metadata) 628 self.metadata = {} 629 --> 630 self.reload() 631 self._check_data() 632 ~/.local/lib/python3.6/site-packages/IPython/core/display.py in reload(self) 1254 """Reload the raw data from file or URL.""" 1255 if self.embed: -> 1256 super(Image,self).reload() 1257 if self.retina: 1258 self._retina_shape() ~/.local/lib/python3.6/site-packages/IPython/core/display.py in reload(self) 653 """Reload the raw data from file or URL.""" 654 if self.filename is not None: --> 655 with open(self.filename, self._read_flags) as f: 656 self.data = f.read() 657 elif self.url is not None: FileNotFoundError: [Errno 2] No such file or directory: 'dc-wikia-data_download_icon.png'
Similarly download the Marvel Dataset.
##Specify the path to your image file
image_path = 'marvel-wikia-data_download_icon.png'
width = 500
## Display the image
Image(filename=image_path, width=width)
Each data file (.csv) can be downloaded using the 'Download' button as shown below -
Do the same for the Marvel Dataset as well.
Loading the data into a dataframe and Data Cleaning¶
Loading the DC Dataframe¶
##Loading the data into a dataframe in Python
dc_df = pd.read_csv('dc-wikia-data.csv',sep=',') #Load the dataframe
dc_df['YEAR']= dc_df['YEAR'].fillna(0).astype(int)
dc_df['APPEARANCES']= dc_df['APPEARANCES'].fillna(0).astype(int)
dc_df = dc_df.rename(columns={'ID': 'identity'})
dc_df.fillna('', inplace=True)
Loading the Marvel Dataframe¶
##Loading the data into a dataframe in Python
marvel_df = pd.read_csv('marvel-wikia-data.csv',sep=',') #Load the dataframe
marvel_df['Year']= marvel_df['Year'].fillna(0).astype(int)
marvel_df['APPEARANCES']= marvel_df['APPEARANCES'].fillna(0).astype(int)
marvel_df = marvel_df.rename(columns={'ID': 'identity'})
marvel_df.fillna('', inplace=True)
Why GridDB?¶
GridDB is a suitable choice for this analysis for the below reasons -
- High-performance and scalability: GridDB is designed to handle large-scale data processing with high-speed data ingestion and low-latency data retrieval.
- Real-time analytics: GridDB offers real-time analytics capabilities
- to enable real-time decision-making and the ability to derive actionable insights from your data in a timely manner.
- Integration with other technologies: GridDB offers support for various programming languages and provides APIs for easy integration with different applications and systems. It also supports integration with popular frameworks and tools commonly used in data processing and analytics workflows.
Containers in GridDB¶
GridDB supports the concept of containers. Containers in GridDB are logical containers that can hold data and define the schema for the stored data. They are analogous to tables in a relational database or collections in a NoSQL database.
You can create multiple containers within a GridDB database to organize and store different types of data. Each container can have its own schema, columns, and indexes based on your data requirements. This allows you to efficiently manage and access different types of data within the same GridDB database instance.
By using multiple containers, you can logically separate and structure your data according to your application's needs, making it easier to query, analyze, and manipulate the data independently within each container.
Options to connect to GridDB¶
There are three options to connect to GridDB -
- GridDB JDBC Driver
- RESTful Web API
- GridDB Python Libraries
- GridDB JDBC Driver:
- GridDB provides a JDBC (Java Database Connectivity) driver that allows you to connect to GridDB using Java.
- You can use the JDBC driver to establish a connection, execute SQL queries, and perform database operations programmatically.
- RESTful Web API:
- GridDB offers a RESTful Web API that allows you to interact with GridDB using HTTP requests.
- You can use standard HTTP methods like GET, POST, PUT, DELETE to perform operations on GridDB.
- This option provides a platform-independent way to connect to GridDB from various programming languages and platforms.
This is the official resource for the WebAPI. - Python Libraries:
- griddb-python: The 'griddb-python' library provides a mechanism to connect to GridDB and perform operations using Python.
Note that the Python JayDeBeApi library can be utilized to connect to GridDB using the GridDB JDBC driver. This allows Python programs to connect to GridDB and run queries using JDBC. In this article, we use Jaydebe API to connect to GridDB.
Establishing the connection to GridDB using Jaydebe API¶
JayDeBeApi is a Python module that enables connecting to databases that support the Java Database Connectivity (JDBC) standard. It acts as a bridge between Python and the database using JDBC, allowing users to execute SQL queries and interact with databases from their Python code. Moreover, JaydebeAPI enables connecting to both a standalone deployment of GridDB or cloud deployment. In this article, we use the GridDB Cloud environment.
Please note that GridDB Cloud is currently only available in Japan but it will be available globally in the near-future. If you have a cloud-based GridDB deployment, ensure that you have the necessary information to establish the JDBC connection, such as the JDBC URL, username, and password provided by your cloud service provider. Also make sure you whitelist your IP Address
Refer to GridDB cloud documentation for more information.
The JayDeBeApi library requires the JDBC driver specific to the database you want to connect to. In the case of GridDB, you need to have the GridDB JDBC driver (gridstore-jdbc.jar) installed.
Here are the steps to do so -
- Download the GridDB JDBC drivers from the GridDB Cloud portal by accessing the Supports page on the Cloud portal and scrolling down to the "GridDB Cloud Library and Plugin download" section.
- Note that the file downloaded is a zip archive. On extracting it, the JDBC driver files can be found within the JDBC folder.
- The GridDB JDBC driver files are -
- gridstore-5.2.0.jar
- gridstore-jdbc-5.2.0.jar
- gridstore-advanced-5.2.0.jar
We will use the GridDB notification provider method to connect to GridDB. First, make a note of the notificationProvider, the GridDB cluster name, the GridDB database name and the username/password to access the database. These values will be used to construct the connection string and establish the connection to GridDB.
notification_provider = "https://dbaasshareextconsta.blob.core.windows.net/dbaas-share-extcon-blob/trial1602.json?sv=2015-04-05&sr=b&st=2023-03-14T00%3A00%3A00.0000000Z&se=2073-03-14T00%3A00%3A00.0000000Z&sp=r&sig=h2VJ0xAqsnRsqWV5CAS66RifPIZ1PDCJ0x%2FiXb2FOhA%3D"
np_encode = urllib.parse.quote(notification_provider)
cluster_name = "gs_clustertrial1602" ## Specify the cluster name here
cn_encode = urllib.parse.quote(cluster_name)
database_name = "python_blogs" ## Specify the database name here
dbn_encode = urllib.parse.quote(database_name)
sslMode = "&sslMode=PREFERRED" #sslMode should be PREFERRED and connectionRoute should be PUBLIC
sm_encode = urllib.parse.quote(sslMode)
username = "Admin"
password = "O6cx2JY1"
#Construct the JDBC URL to be used to connect to the GridDB database. The format is -
# jdbc:gs:///clusterName/databaseName?notificationProvider=theNotificationProviderURL&sslMode=PREFERRED
url = "jdbc:gs:///" + cn_encode + "/" + dbn_encode + "?notificationProvider=" + np_encode + sm_encode
#print("JDBC URL is " + url)
os.chdir("/home/israel/development/upwork/GridDB/lib/")
os.environ['CLASSPATH'] = '/home/israel/development/upwork/GridDB/lib'
conn = jaydebeapi.connect("com.toshiba.mwcloud.gs.sql.Driver",
url,
{'user': username, 'password': password,
'connectionRoute':'PUBLIC'}
, "gridstore-jdbc.jar") #ensure to provide the correct location of the gridstore-jdbc JAR library
print('success!')
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-12-0458df6b53ce> in <module> 24 {'user': username, 'password': password, 25 'connectionRoute':'PUBLIC'} ---> 26 , "gridstore-jdbc.jar") #ensure to provide the correct location of the gridstore-jdbc JAR library 27 28 print('success!') ~/.local/lib/python3.6/site-packages/jaydebeapi/__init__.py in connect(jclassname, url, driver_args, jars, libs) 410 else: 411 libs = [] --> 412 jconn = _jdbc_connect(jclassname, url, driver_args, jars, libs) 413 return Connection(jconn, _converters) 414 ~/.local/lib/python3.6/site-packages/jaydebeapi/__init__.py in _jdbc_connect_jpype(jclassname, url, driver_args, jars, libs) 219 return jpype.JArray(jpype.JByte, 1)(data) 220 # register driver for DriverManager --> 221 jpype.JClass(jclassname) 222 if isinstance(driver_args, dict): 223 Properties = jpype.java.util.Properties ~/.local/lib/python3.6/site-packages/jpype/_jclass.py in __new__(cls, jc, loader, initialize) 97 98 # Pass to class factory to create the type ---> 99 return _jpype._getClass(jc) 100 101 TypeError: Class com.toshiba.mwcloud.gs.sql.Driver is not found
You're all set now. The connection established above can be used throughout for DDL and DML process within the database. You are good to use SQL like statements provided by GridDB to create the container and run queries on the same.
If you face any issues with the connection, here are some pointers on what to check -
1. Confirm that you have all 3 GridDB JDBC driver files downloaded and accessible in your system. 2. Ensure that all 3 GridDB JDBC driver files are accessible in the CLASSPATH environment system variable. 3. Ensure to use the correct notificationProvider URL. The notificationProvider URL can be found in the GridDB Cloud Clusters view. 4. Ensure that your IP Address is specified in the exception list maintained on the GridDB Cloud instance. 5. Ensure that your Python environment can access the GridDB JDBC driver files. If you are using an Anaconda environment, ensure to restart the environment after setting up the CLASSPATH environment system variable.
Creating the Containers¶
DC Dataset¶
sql_query1 = (f"""
CREATE TABLE DC_Dataset
(
page_id INT,
name VARCHAR(100),
urlslug VARCHAR(100),
identity VARCHAR(20),
align VARCHAR(20),
eye VARCHAR(20),
hair VARCHAR(30),
sex VARCHAR(25),
gsm VARCHAR(20),
alive VARCHAR(20),
appearances INT,
first_appearance VARCHAR(20),
YEAR INT,
PRIMARY KEY (page_id)
)
""")
with conn.cursor() as cursor:
cursor.execute(sql_query1)
Marvel Dataset¶
sql_query2 = (f"""
CREATE TABLE Marvel_Dataset
(
page_id INT,
name VARCHAR(100),
urlslug VARCHAR(100),
identity VARCHAR(20),
align VARCHAR(20),
eye VARCHAR(20),
hair VARCHAR(30),
sex VARCHAR(25),
gsm VARCHAR(20),
alive VARCHAR(20),
appearances INT,
first_appearance VARCHAR(20),
YEAR INT,
PRIMARY KEY (page_id)
)
""")
with conn.cursor() as cursor:
cursor.execute(sql_query2)
That's it. The containers have been created using SQL create table syntax.
Loading the Containers¶
One of the advantages of using JaydebeAPI and GridDB together is the ability to use GridDB's SQL-like syntaxes. Here is GridDB's SQL Reference.
Loading the Container 'DC_Dataset'¶
# Prepare the SQL statement for insertion
sql_query2 = "INSERT INTO DC_Dataset (page_id,name,urlslug,identity,align,eye,hair,sex,gsm,alive,appearances,first_appearance,YEAR) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)"
cursor = conn.cursor()
for _, row in dc_df.iterrows():
try:
# Execute the INSERT statement for each row of data
with conn.cursor() as cursor:
cursor.execute(sql_query2, tuple(row))
except Exception as e:
# Handle any exceptions that may occur during execution
print("Error:", e, ' ', row)
# Commit the changes
conn.commit()
Loading the Container 'Marvel_Dataset'¶
# Prepare the SQL statement for insertion
sql_query3 = "INSERT INTO Marvel_Dataset (page_id,name,urlslug,identity,align,eye,hair,sex,gsm,alive,appearances,first_appearance,Year) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)"
cursor = conn.cursor()
for _, row in marvel_df.iterrows():
try:
# Execute the INSERT statement for each row of data
with conn.cursor() as cursor:
cursor.execute(sql_query3, tuple(row))
except Exception as e:
# Handle any exceptions that may occur during execution
print("Error:", e, ' ', row)
# Commit the changes
conn.commit()
Visualization & Analysis of the Data in GridDB¶
How many Characters are alive and how many are dead in the DC Universe?¶
sql_query4 = """
SELECT CASE WHEN alive = '' THEN 'Not Known' ELSE alive END AS Living_Status, count(alive) as num_characters
FROM DC_Dataset
GROUP BY Living_Status
"""
cursor = conn.cursor()
cursor.execute(sql_query4)
results = cursor.fetchall()
# Convert the query results to a DataFrame
alive_dead_dc_df = pd.DataFrame(results, columns=['alive', 'num_characters'])
# Data for the pie plot
labels = alive_dead_dc_df['alive']
sizes = alive_dead_dc_df['num_characters']
# Create the pie plot
plt.pie(sizes, labels=labels, autopct='%1.2f%%', startangle=90)
# Set aspect ratio to be equal so that pie is drawn as a circle
plt.axis('equal')
# Set a title for the plot
plt.title('DC Universe - Distribution of Alive and Dead Characters')
# Display the plot
plt.show()
Insight(s): Around 75% of the characters in the DC world are alive while 25% of the characters are dead.There are some characters whose living status is not yet clear (0.04%).
Now, let's see how many in the Marvel World are alive and how many are dead.
How many Characters are alive and how many are dead in the Marvel Universe?¶
sql_query5 = """
SELECT CASE WHEN alive = '' THEN 'Not Known' ELSE alive END AS Living_Status, count(alive) as num_characters
FROM Marvel_Dataset
GROUP BY Living_Status
"""
cursor = conn.cursor()
cursor.execute(sql_query5)
results = cursor.fetchall()
# Convert the query results to a DataFrame
alive_dead_marvel_df = pd.DataFrame(results, columns=['alive', 'num_characters'])
# Data for the pie plot
labels = alive_dead_marvel_df['alive']
sizes = alive_dead_marvel_df['num_characters']
# Create the pie plot
plt.pie(sizes, labels=labels, autopct='%1.2f%%', startangle=90)
# Set aspect ratio to be equal so that pie is drawn as a circle
plt.axis('equal')
# Set a title for the plot
plt.title('Marvel Universe - Distribution of Alive and Dead Characters')
# Display the plot
plt.show()
Insight(s): About 77% of the characters in the Marvel Universe are alive while 23% of the characters are dead.There are some characters whose living status is not yet clear (0.02%).
What is the Gender Distribution of Characters in Marvel and DC Universes?¶
# Create a cursor to execute SQL statements
cursor = conn.cursor()
# Execute SELECT queries to count characters by sex in each container
sql_query_marvel = """
SELECT CASE WHEN Sex = '' THEN 'Not Available' ELSE Sex END as Gender, COUNT(*) as num_characters
FROM Marvel_Dataset
GROUP BY Sex
"""
cursor.execute(sql_query_marvel)
result_marvel = cursor.fetchall()
sql_query_dc = """
SELECT CASE WHEN Sex = '' THEN 'Not Available' ELSE Sex END as Gender, COUNT(*) as num_characters
FROM DC_Dataset
GROUP BY Sex
"""
cursor.execute(sql_query_dc)
result_dc = cursor.fetchall()
# Separate the sexes and counts from the Marvel and DC results
sexes_marvel = [row[0] for row in result_marvel]
counts_marvel = [row[1] for row in result_marvel]
sexes_dc = [row[0] for row in result_dc]
counts_dc = [row[1] for row in result_dc]
# Create a wider figure with a larger width
plt.figure(figsize=(17, 6))
# Generate unique colors for 'Marvel' and 'DC'
colors = ['tab:blue', 'tab:red']
# Create the Marvel subplot
plt.subplot(1, 2, 1)
bars_marvel = plt.barh(sexes_marvel, counts_marvel, color=colors[0])
plt.xlabel('Number of Characters')
plt.ylabel('Sex')
plt.title('Marvel Gender Distribution')
# Display values on the bars in the Marvel subplot
for bar in bars_marvel:
width = bar.get_width()
plt.text(width, bar.get_y() + bar.get_height() / 2, width, ha='left', va='center')
# Create the DC subplot
plt.subplot(1, 2, 2)
bars_dc = plt.barh(sexes_dc, counts_dc, color=colors[1])
plt.xlabel('Number of Characters')
plt.title('DC Gender Distribution')
# Display values on the bars in the DC subplot
for bar in bars_dc:
width = bar.get_width()
plt.text(width, bar.get_y() + bar.get_height() / 2, width, ha='left', va='center')
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.5, hspace=None)
plt.suptitle('Marvel vs DC Gender Distribution', fontsize=16, fontweight='bold')
plt.show()
Insight(s):
- It is interesting to note that the Marvel Universe has significantly more male characters than the DC Universe.
- Both DC and Marvel have characters across different genders thereby representing gender diversity fairly well.
From Secret to Public: Decoding Character Identities in DC and Marvel¶
Let's look at some interesting stats here to understand how many characters are secret identities Vs how many have an identity that is known to the public.
# Create a cursor to execute SQL statements
cursor = conn.cursor()
# Execute SELECT queries to count characters by sex in each container
sql_query_marvel = """
SELECT 'Marvel',CASE WHEN identity = '' THEN 'Not Available' ELSE identity END as identity, COUNT(*) as num_characters
FROM Marvel_Dataset
GROUP BY 1,2
UNION ALL
SELECT 'DC',CASE WHEN identity = '' THEN 'Not Available' ELSE identity END as Gender, COUNT(*) as num_characters
FROM DC_Dataset
GROUP BY 1,2
"""
cursor.execute(sql_query_marvel)
identity_data = cursor.fetchall()
# Extract data for DC and Marvel separately
dc_data = [(identity, count) for universe, identity, count in identity_data if universe == 'DC']
marvel_data = [(identity, count) for universe, identity, count in identity_data if universe == 'Marvel']
# Create subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Plot for DC
bars_dc = ax1.barh([x[0] for x in dc_data], [x[1] for x in dc_data], color='blue')
ax1.set_xlabel('Count')
ax1.set_ylabel('Identity')
ax1.set_title('DC Identity Distribution')
# Display values on the bars in the DC subplot
for bar in bars_dc:
width = bar.get_width()
ax1.text(width, bar.get_y() + bar.get_height() / 2, width, ha='left', va='center')
# Plot for Marvel
bars_marvel = ax2.barh([x[0] for x in marvel_data], [x[1] for x in marvel_data], color='red')
ax2.set_xlabel('Count')
ax2.set_ylabel('Identity')
ax2.set_title('Marvel Identity Distribution')
# Display values on the bars in the Marvel subplot
for bar in bars_marvel:
width = bar.get_width()
ax2.text(width, bar.get_y() + bar.get_height() / 2, width, ha='left', va='center')
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.5, hspace=None)
plt.suptitle('Marvel vs DC - Exploring Identities', fontsize=16, fontweight='bold')
# Show the plot
plt.show()
Insight(s):
- From the above plot, it is clear that Marvel has more characters with a secrety identity than DC.
- In the context of Marvel characters, the term "Known to Authorities" identity refers to a character whose true identity is known and recognized by government or law enforcement agencies within the Marvel Universe. This status typically indicates that the character's secret identity or alter ego is publicly known to certain official entities. Marvel has 15 characters with this status
- The term "No Dual Identity" refers to a character who does not have a secret or alter ego. These characters typically operate solely under their superhero or supervillain identities without concealing their true identities behind a civilian persona. Marvel has 1788 characters without a dual identity.
Physical traits in each of the Marvel & DC universes¶
# Create a cursor to execute SQL statements
cursor = conn.cursor()
# Execute SELECT queries to count characters by sex in each container
sql_query_traits = """
SELECT 'Marvel',
CASE WHEN eye = '' THEN 'Not Known' ELSE eye END as eye,
CASE WHEN hair = '' THEN 'Not Known' ELSE hair END as hair,
COUNT(*) as num_characters
FROM Marvel_Dataset
GROUP BY 1,2
UNION ALL
SELECT 'DC',
CASE WHEN eye = '' THEN 'Not Known' ELSE eye END as eye,
CASE WHEN hair = '' THEN 'Not Known' ELSE hair END as hair,
COUNT(*) as num_characters
FROM DC_Dataset
GROUP BY 1,2
order by 4 desc
"""
cursor.execute(sql_query_marvel)
physicaltraits_data = cursor.fetchall()
physicaltraits_df = pd.DataFrame(physicaltraits_data, columns=['Category', 'Eye_Color', 'Hair_Color','Cnt'])
# Assuming you have the DataFrame named top10_physical_traits_df with the provided data
# Sort the DataFrame in descending order by the count column
sorted_data = physicaltraits_df.sort_values(by='Cnt', ascending=False)
# Filter the sorted DataFrame for DC and Marvel categories separately
dc_data = sorted_data[sorted_data['Category'] == 'DC']
marvel_data = sorted_data[sorted_data['Category'] == 'Marvel']
#dc_data
#marvel_data
# Assuming you have the DataFrame named top10_physical_traits_df with the provided data
# Sort the DataFrame in descending order by the count column
sorted_data = physicaltraits_df.sort_values(by='Cnt', ascending=True)
# Filter the sorted DataFrame for DC and Marvel categories separately
dc_data = sorted_data[sorted_data['Category'] == 'DC']
marvel_data = sorted_data[sorted_data['Category'] == 'Marvel']
# Create a figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
# Plot for DC category
bars_dc = ax1.barh(dc_data['Eye_Color'] + ' | ' + dc_data['Hair_Color'], dc_data['Cnt'])
#sorted_bars_dc = [bar for _, bar in sorted(zip(dc_data['Cnt'], bars_dc), reverse=True)]
ax1.set_xlabel('Count')
ax1.set_ylabel('Eye Color | Hair Color')
ax1.set_title('Top 10 Physical Traits - DC Category')
for bar in bars_dc:
width = bar.get_width()
ax1.text(width, bar.get_y() + bar.get_height() / 2, int(width), ha='left', va='center')
# Plot for Marvel category
bars_marvel = ax2.barh(marvel_data['Eye_Color'] + ' | ' + marvel_data['Hair_Color'], marvel_data['Cnt'])
#sorted_bars_marvel = [bar for _, bar in sorted(zip(marvel_data['Cnt'], bars_marvel), reverse=True)]
ax2.set_xlabel('Count')
ax2.set_ylabel('Eye Color | Hair Color')
ax2.set_title('Top 10 Physical Traits - Marvel Category')
for bar in bars_marvel:
width = bar.get_width()
ax2.text(width, bar.get_y() + bar.get_height() / 2, int(width), ha='left', va='center')
# Rotate the x-axis tick labels for better readability
ax1.tick_params(axis='y', rotation=30)
ax2.tick_params(axis='y', rotation=30)
plt.subplots_adjust(wspace=0.5)
plt.suptitle('Top 10 Physical Traits - Marvel Vs DC', fontsize=16, fontweight='bold')
# Show the plot
plt.show()
Insight(s):
- Among the physical traits of DC characters, Black Hair seems to be the most popular.
- Among the physical traits of Marvel characters, Blond Hair seems to be the most popular.
- It is interesting to note that the most popular eye colors in both DC and Marvel are 'Blue Eyes' and 'Brown Eyes'.
- Among the physical traits of Marvel characters, Blonde Hair seems to be the most popular.
- An interesting eye color in the DC universe tend to be Photocellular eyes. 'Compound Eyes', 'Multiple Eyes' seem to be some unique physical traits among Marvel characters.
- Similarly, Auburn Hair in the DC world and Magenta Hair in the Marvel world are some unique hair colors.
What are the top 5 overall traits across both Marvel and DC?¶
To answer this, we use subqueries supported by GridDB. Read this resource to know more.
# Create a cursor to execute SQL statements
cursor = conn.cursor()
# Execute SELECT queries to count characters by sex in each container
sql_query_overall_traits = """
select * from
(
SELECT 'Marvel' as Category,
CASE WHEN eye = '' THEN 'Not Known' ELSE eye END as eye,
CASE WHEN hair = '' THEN 'Not Known' ELSE hair END as hair,
CASE WHEN align = '' THEN 'Not Known' ELSE align END as align,
COUNT(*) as num_characters
FROM Marvel_Dataset
WHERE eye !='' and hair !='' and align !=''
GROUP BY 1,2,3
UNION ALL
SELECT 'DC' as Category,
CASE WHEN eye = '' THEN 'Not Known' ELSE eye END as eye,
CASE WHEN hair = '' THEN 'Not Known' ELSE hair END as hair,
CASE WHEN align = '' THEN 'Not Known' ELSE align END as align,
COUNT(*) as num_characters
FROM DC_Dataset
WHERE eye !='' and hair !='' and align !=''
GROUP BY 1,2,3
) df_table
order by num_characters desc
LIMIT 5
"""
cursor.execute(sql_query_overall_traits)
top5_overall_traits_data = cursor.fetchall()
top5_overall_traits_df = pd.DataFrame(top5_overall_traits_data, columns=['Category', 'Eye_Color', 'Hair_Color','Align','Cnt'])
# Assuming you have a dataframe called "df"
# Set up the figure and axis
fig, ax = plt.subplots(figsize=(8, 4))
# Hide axis
ax.axis('off')
# Create the table
table = ax.table(cellText=top5_overall_traits_df.values, colLabels=top5_overall_traits_df.columns, loc='center', cellLoc='center', colWidths=[0.2, 0.2, 0.2, 0.2, 0.2])
# Set table properties
table.auto_set_font_size(False)
table.set_fontsize(12)
table.scale(1.2, 1.2) # Adjust table scale for better readability
# Color code the cells based on 'cnt' column
colormap = plt.cm.Spectral # Choose a colormap
max_cnt = top5_overall_traits_df['Cnt'].max() # Get the maximum value in the 'Cnt' column
colors = [colormap(1 - value / max_cnt) for value in top5_overall_traits_df['Cnt']]
# Iterate over the cells and set the facecolor
for i, cell in enumerate(table.get_celld().values()):
theIndex = int(i/5) #as there are 5 columns
if theIndex == 0: # Skip the header cell
cell.set_facecolor('Brown')
else:
cell.set_facecolor(colors[theIndex-1])
cell.set_facecolor(colors[theIndex-1])
plt.title('Top 5 Overall Traits across both Marvel and DC')
plt.show()
Insight(s):
- It's interesting to note that the most number of characters among the Top 5 come from Marvel.
- Most Marvel Characters either have brown/blue eyes and black/brown/blonde hair.
- Most DC Characters have brown eyes and black hair.
When did Batman and Superman from DC first appear and how many appearances have they given so far?¶
GridDB supports aggregate functions such as MIN and SUM. Refer to this document to know more.
# Create a cursor to execute SQL statements
cursor = conn.cursor()
# Execute SELECT queries to count characters by sex in each container
sql_query_overall_traits = """
SELECT CASE
WHEN name LIKE '%Batman%' THEN 'Batman'
WHEN name LIKE '%Superman%' THEN 'Superman'
END AS name,
min(first_appearance) as earliest_appearance, sum(appearances)
FROM DC_Dataset
WHERE (name like '%Batman%' or name like '%Superman%')
Group by 1
"""
cursor.execute(sql_query_overall_traits)
superheroes_dc = cursor.fetchall()
superheroes_dc_df = pd.DataFrame(superheroes_dc, columns=['Superhero', 'First Appearance', 'Number of Apperances'])
superheroes_dc_df
Superhero | First Appearance | Number of Apperances | |
---|---|---|---|
0 | Superman | 1986, October | 2496 |
1 | Batman | 1939, May | 3096 |
Insight(s): Superman first appeared as of October 1986 and has made 2496 appearances so far. Batman made his first apperance way before that in May 1939. Batman has given a total of 3096 appearances.
When did Spiderman and Ironman from Marvel first appear and how many appearances have they given so far?¶
GridDB supports the CASE WHEN statement, which allows conditional logic and branching within SQL queries. The CASE WHEN statement is used to evaluate conditions and return different values based on those conditions. It is commonly used for conditional data transformations or aggregations. Refer to this document.
# Create a cursor to execute SQL statements
cursor = conn.cursor()
# Execute SELECT queries to count characters by sex in each container
sql_query_marvel_superheroes = """
SELECT CASE
WHEN name LIKE '%Spider%' THEN 'Spider-Man'
WHEN name LIKE '%Iron%' THEN 'Iron Man'
END AS name,
min(first_appearance) as earliest_appearance, sum(appearances)
FROM Marvel_Dataset
WHERE (name like '%Spider%' or name like '%Iron%')
and first_appearance != ''
Group by 1
"""
cursor.execute(sql_query_marvel_superheroes)
superheroes_marvel = cursor.fetchall()
superheroes_marvel_df = pd.DataFrame(superheroes_marvel, columns=['Superhero', 'First Appearance', 'Number of Apperances'])
superheroes_marvel_df
Superhero | First Appearance | Number of Apperances | |
---|---|---|---|
0 | Iron Man | Apr-79 | 2998 |
1 | Spider-Man | Aug-62 | 4132 |
Insight(s): Spiderman first appeared as of August 1962 whereas Iron Man first appeared as of April 1979. Spiderman has appeared 4132 times so far.
What were the earliest movies that referenced villains in the movie name?¶
# Create a cursor to execute SQL statements
cursor = conn.cursor()
# Execute SELECT queries to count characters by sex in each container
sql_query_bad_characters = """
select * from
(
SELECT 'DC' as Category,
CASE WHEN name LIKE '%joker%' THEN 'Joker'
WHEN name LIKE '%Dark%' THEN 'Darkseid'
END AS villain,
name as movie_name,
first_appearance as movie_year
FROM DC_Dataset
WHERE (name like '%joker%' or name like '%darkseid%')
and first_appearance != ''
UNION ALL
SELECT 'Marvel' as Category,
CASE WHEN name LIKE '%Thanos%' THEN 'Thanos'
WHEN name LIKE '%Doom%' THEN 'Dr. Doom'
END AS villain,
name as movie_name,
first_appearance as movie_year
FROM Marvel_Dataset
WHERE (name like '%thanos%' or name like '%doom%')
and first_appearance != ''
) villains_table
order by movie_year asc
"""
cursor.execute(sql_query_bad_characters)
villains_data = cursor.fetchall()
villains_df = pd.DataFrame(villains_data, columns=['Category','Villain','Movie Name','Movie Year'])
villains_df = villains_df.sort_values(by='Movie Year', ascending=True)
villains_df
Category | Villain | Movie Name | Movie Year | |
---|---|---|---|---|
0 | DC | Joker | Joker (New Earth) | 1940, June |
1 | DC | Joker | Bizarro Joker (New Earth) | 2007, November |
2 | Marvel | Dr. Doom | Foot of Doom (Earth-616) | Apr-07 |
3 | Marvel | Dr. Doom | Box of Doom (Earth-616) | Apr-52 |
4 | Marvel | Dr. Doom | Thulsa Doom (Earth-616) | Apr-72 |
5 | Marvel | Dr. Doom | Doomsman II (Earth-616) | Apr-75 |
6 | Marvel | Dr. Doom | Victor von Doom (Clone) (Earth-616) | Apr-78 |
9 | Marvel | Thanos | X (Thanosi) (Earth-616) | Aug-02 |
10 | Marvel | Thanos | Omega (Thanosi) (Earth-616) | Aug-02 |
7 | Marvel | Thanos | Armour (Thanosi) (Earth-616) | Aug-02 |
8 | Marvel | Thanos | Warrior (Thanosi) (Earth-616) | Aug-02 |
11 | Marvel | Dr. Doom | Robert von Doom (Earth-616) | Aug-90 |
12 | Marvel | Dr. Doom | Mechadoom (Earth-616) | Aug-91 |
13 | Marvel | Dr. Doom | Robert Bruce Banner (Doom's clone) (Earth-616) | Dec-11 |
14 | Marvel | Dr. Doom | Amanda von Doom (Earth-616) | Dec-11 |
15 | Marvel | Dr. Doom | Symbol of Doom (Earth-616) | Dec-43 |
16 | Marvel | Dr. Doom | Doomsday Man (Earth-616) | Feb-70 |
17 | Marvel | Dr. Doom | Victor von Doom (Counter-Earth) (Earth-616) | Feb-73 |
18 | Marvel | Thanos | Thanos (Earth-616) | Feb-73 |
19 | Marvel | Dr. Doom | Braggadoom (Earth-616) | Jan-76 |
20 | Marvel | Dr. Doom | Victor von Doom (Earth-616) | Jul-62 |
21 | Marvel | Dr. Doom | Peddler of Doom (Earth-616) | Jun-44 |
22 | Marvel | Thanos | Thanos (Doppelganger) (Earth-616) | Jun-92 |
23 | Marvel | Dr. Doom | Vandoom's Monster (Earth-616) | Mar-61 |
34 | Marvel | Thanos | Aries (Thanos' Zodiac) (Earth-616) | May-12 |
33 | Marvel | Thanos | Gemini (Thanos' Zodiac) (Earth-616) | May-12 |
31 | Marvel | Thanos | Pisces (Thanos' Zodiac) (Earth-616) | May-12 |
30 | Marvel | Thanos | Taurus (Thanos' Zodiac) (Earth-616) | May-12 |
29 | Marvel | Thanos | Sagittarius (Thanos' Zodiac) (Earth-616) | May-12 |
32 | Marvel | Thanos | Libra (Thanos' Zodiac) (Earth-616) | May-12 |
27 | Marvel | Thanos | Scorpio (Thanos' Zodiac) (Earth-616) | May-12 |
26 | Marvel | Thanos | Leo (Thanos' Zodiac) (Earth-616) | May-12 |
25 | Marvel | Thanos | Capricorn (Thanos' Zodiac) (Earth-616) | May-12 |
24 | Marvel | Thanos | Virgo (Thanos' Zodiac) (Earth-616) | May-12 |
28 | Marvel | Thanos | Aquarius (Thanos' Zodiac) (Earth-616) | May-12 |
35 | Marvel | Dr. Doom | Nedra (Doom Maidens) (Earth-616) | May-13 |
36 | Marvel | Dr. Doom | Victor von Doom (Heroes Reborn) (Earth-616) | Nov-96 |
37 | Marvel | Dr. Doom | Doombringer (Earth-616) | Nov-97 |
38 | Marvel | Dr. Doom | Cynthia von Doom (Earth-616) | Oct-71 |
39 | Marvel | Dr. Doom | Voice of Doom (Earth-616) | Sep-44 |
40 | Marvel | Dr. Doom | Mask of Doom (Earth-616) | Sep-55 |
41 | Marvel | Dr. Doom | Werner von Doom (Earth-616) | Sep-64 |
42 | Marvel | Dr. Doom | Stallion of Doom (Earth-616) | Sep-66 |
43 | Marvel | Dr. Doom | Victor von Doom (Impostor) (Earth-616) | Sep-92 |
Insight(s): The above DataFrame shows the information about villains in movies, including their category (DC or Marvel), the name of the villain, the movie name, and the movie year. By analyzing this data, we can gain insights into the presence and portrayal of villains in both DC and Marvel movies, their appearances in different movies over the years, and potentially identify recurring or iconic villains in the superhero cinematic universe.
Concluding Remarks¶
Concluding Remarks:
- The analysis of the superhero dataset using GridDB and Jaydebeapi has provided valuable insights into various aspects of the superhero world.
- The dataset encompassed information from both DC and Marvel universes, allowing for a comprehensive analysis of characters, traits, identities, and more.
- By leveraging the power of GridDB, a distributed in-memory database, and Jaydebeapi, a Python package for JDBC connectivity, we were able to efficiently query and manipulate the dataset for analysis.
- The analysis revealed interesting patterns and trends. We explored character demographics such as gender distribution and living status, gaining insights into the portrayal of characters in both DC and Marvel universes.
- We also examined physical traits such as eye color and hair color, identifying the top traits among characters in each universe. This analysis shed light on the diversity and uniqueness of superheroes and villains.
- Furthermore, we delved into the identities of characters, categorizing them as secret, public, or unknown. This exploration provided insights into how characters present themselves to the world and maintain their alter egos.
- The analysis of villains in movies highlighted the presence and significance of these antagonistic characters in both DC and Marvel cinematic universes.
This analysis serves as a valuable resource for understanding the characteristics, traits, and dynamics of superheroes and villains, contributing to the broader appreciation of the superhero genre. Overall, the utilization of GridDB and Jaydebeapi allowed for seamless data retrieval, analysis, and visualization, enabling a comprehensive exploration of the superhero dataset.
If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.