import os
import pandas as pd
import numpy as np
from IPython.display import Image
import jaydebeapi
import urllib.parse
import sys
import matplotlib.pyplot as plt
import seaborn as sns
import time
from tabulate import tabulate
import plotly.express as px
os.chdir("C:\\GridDB") #Use your working directory here
About the Datasets¶
For this purpose, we download two datasets.
- The first dataset to be downloaded is the 'Toughest Sport by Skill' dataset. This can be downloaded from the website - https://data.world/coreyhermanson/toughest-sport-by-skill. As shown in the image below, download the dataset using the 'download' icon -
##Specify the path to your image file
image_path = 'Download_dataset_snapshot.png'
width = 500
## Display the image
Image(filename=image_path, width=width)
The file that gets downloaded has the filename 'toughestsport.xlsx'.
The second dataset to be downloaded is the '2017 ESPN World Fame 100' dataset. This can also be downloaded from data.world using the URL - https://data.world/wesmagee/2017-espn-world-fame-100. Use the download icon similar to the dataset above. The downloaded dataset has the file name '2017 ESPN World Fame 100.xlsx'.
The skills are as below -
- ENDURANCE: The ability to continue to perform a skill or action for long periods of time.
- STRENGTH: The ability to produce force.
- POWER: The ability to produce strength in the shortest possible time.
- SPEED: The ability to move quickly.
- AGILITY: The ability to change direction quickly.
- FLEXIBILITY: The ability to stretch the joints across a large range of motion.
- NERVE: The ability to overcome fear.
- DURABILITY: The ability to withstand physical punishment over a long period of time.
- HAND-EYE COORDINATION: The ability to react quickly to sensory perception.
- ANALYTIC APTITUDE: The ability to evaluate and react appropriately to strategic situations.
Loading the data into a DataFrame¶
toughestsport = pd.read_excel('toughestsport.xlsx')
world_famous_sports = pd.read_excel('2017 ESPN World Fame 100.xlsx')
Understanding the GridDB Data Model¶
In GridDB, a container is a fundamental data storage unit used to organize and store data. It is analogous to a table in a relational database or a collection in NoSQL databases. Containers are a core concept in GridDB's data model and serve as the primary means of structuring and managing data within the database.
GridDB, while being a NoSQL database, incorporates concepts and features that are reminiscent of traditional relational databases (RDBMS) to some extent. Here are some of the RDBMS-like concepts and features you can find in GridDB:
- Views: GridDB supports the concept of "Materialized Views." These are similar to views in RDBMS in that they allow you to create predefined queries or projections of data. Materialized views store the results of a query in a separate table, which can be useful for optimizing query performance and simplifying data access.
- Primary Keys: Like RDBMS, GridDB allows you to define primary keys for containers (similar to tables in RDBMS). Primary keys enforce uniqueness constraints and facilitate efficient data retrieval through indexed access.
- Indexes: GridDB provides indexing capabilities similar to RDBMS. You can create indexes on columns in containers to speed up data retrieval and query performance.
- ACID Transactions: GridDB offers ACID (Atomicity, Consistency, Isolation, Durability) transaction support, which is a characteristic often associated with RDBMS. ACID transactions ensure data consistency and reliability in GridDB.
- SQL-Like Query Language: GridDB supports a SQL-like query language called "TQL" (Time-Series Query Language). This language allows you to query and manipulate data in a way that is familiar to users of RDBMS SQL.
- Partitions: GridDB allows you to partition data across multiple nodes in a cluster. This is a concept akin to sharding in some NoSQL databases but also shares similarities with horizontal partitioning in RDBMS for distributing data and achieving scalability.
- Stored Procedures: GridDB supports the creation of stored procedures, which are similar to stored procedures in RDBMS. Stored procedures allow you to define custom server-side logic for data manipulation and processing.
While GridDB incorporates these RDBMS-like features, it still retains its NoSQL characteristics, such as schema flexibility (especially within containers), support for semi-structured and unstructured data, and scalability for high-velocity data streams (common in IoT and time-series data). This hybrid approach makes GridDB versatile, suitable for various use cases, and capable of accommodating both structured and semi-structured data while offering transactional support and query capabilities similar to RDBMS when needed. To learn more about GridDB's data model, refer to this resource.
GridDB supports two types of containers namely 'Collections' and 'Time Series' Containers. While time series containers are used for time-sensitive analysis like IoT, sensor applications; collections can be used for all types of other analysis - from historic analysis to applications requiring complex data relationships and varied data structures. The choice between Time Series and Collections containers depends on the nature of your data and the specific analysis needs of your application. Refer to this resource to learn more. In this case, we use Collections.
Connecting to GriDB Cloud using JayDeBe API¶
notification_provider = "https://dbaasshareextconsta.blob.core.windows.net/dbaas-share-extcon-blob/trial1602.json?sv=2015-04-05&sr=b&st=2023-03-14T00%3A00%3A00.0000000Z&se=2073-03-14T00%3A00%3A00.0000000Z&sp=r&sig=h2VJ0xAqsnRsqWV5CAS66RifPIZ1PDCJ0x%2FiXb2FOhA%3D"
np_encode = urllib.parse.quote(notification_provider)
cluster_name = "XX" ## Specify the cluster name here
cn_encode = urllib.parse.quote(cluster_name)
database_name = "XX" ## Specify the database name here
dbn_encode = urllib.parse.quote(database_name)
sslMode = "&sslMode=PREFERRED" #sslMode should be PREFERRED and connectionRoute should be PUBLIC
sm_encode = urllib.parse.quote(sslMode)
username = 'XX' # Put your username and password here
password = 'XX' # Put your username and password here
#Construct the JDBC URL to be used to connect to the GridDB database. The format is -
# jdbc:gs:///clusterName/databaseName?notificationProvider=theNotificationProviderURL&sslMode=PREFERRED
url = "jdbc:gs:///" + cn_encode + "/" + dbn_encode + "?notificationProvider=" + np_encode + sm_encode
#print("JDBC URL is " + url)
conn = jaydebeapi.connect("com.toshiba.mwcloud.gs.sql.Driver",
url,
{'user': username, 'password': password,
'connectionRoute':'PUBLIC',
'loginTimeout': '20'}
, "gridstore-jdbc-5.2.0.jar") #ensure to provide the correct location of the gridstore-jdbc JAR library
print('success!')
success!
Creating Containers in GridDB¶
GridDB uses standard DDL statements similar to traditional relational databases along with the JayDeBeAPI. Unlike the WebAPI where the container creation should be passed as a JSON request, the JayDeBeAPI makes things simple by supporting Native SQL operations. Refer to this resource to learn more - DDL Operations in GridDB.
sql_query1 = (f"""
CREATE TABLE toughestsport
(
Sport VARCHAR(55),
END FLOAT,
STR FLOAT,
PWR FLOAT,
SPD FLOAT,
AGI FLOAT,
FLX FLOAT,
NER FLOAT,
DUR FLOAT,
HAN FLOAT,
ANA FLOAT,
TOTAL FLOAT,
RANK numeric
)
""")
with conn.cursor() as cursor:
cursor.execute(sql_query1)
columns_list = world_famous_sports.columns.to_list()
world_famous_sports.rename(columns={'PY Rank':'PY_Rank','Last Name':'Last_Name','First Name':'First_Name','Instgram Followers':'Instagram_Followers_million','Facebook Followers':'Facebook_Followers_million','Twitter Followers':'Twitter_Followers_million','Endorsements.1':'Endorsements__raw','Instagram Followers.1':'Instagram_Followers_raw','Facebook Followers.1':'Facebook_Followers_raw',"Twitter.Followers.1":"Twitter.Followers_raw",'Instgram Followers.1':'Instgram_Followers_raw','Twitter Followers.1':'Twitter_Followers_raw'},inplace=True)
Data Cleaning¶
world_famous_sports['PY_Rank'].unique()
world_famous_sports.loc[world_famous_sports["PY_Rank"] == "NR", "PY_Rank"] = 0
world_famous_sports['PY_Rank'].unique()
world_famous_sports['PY_Rank'] = world_famous_sports['PY_Rank'].astype('int')
sql_query2 = (f"""
CREATE TABLE world_famous_sports
(
Rank int,
PY_Rank int,
Last_Name varchar(50),
First_Name varchar(50),
Sport varchar(50),
Country varchar(50),
Team varchar(60),
Endorsements DOUBLE,
Instagram_Followers_million DOUBLE,
Facebook_Followers_million DOUBLE,
Twitter_Followers_million DOUBLE,
Endorsements__raw long,
Instgram_Followers_raw DOUBLE,
Facebook_Followers_raw long,
Twitter_Followers_raw DOUBLE
)
""")
with conn.cursor() as cursor:
cursor.execute(sql_query2)
Row Registration in GridDB¶
Here, we use the 'INSERT SQL' command in GridDB to insert rows into each of the GridDB containers. Each row is taken in as a tuple and loaded into the container 'commodity_trade_stats' in GridDB Cloud. To learn more about DML operations in GriDB, refer to this resource - GridDB DML Operations.
# Prepare the SQL statement for insertion
sql_query3 = "INSERT INTO toughestsport (SPORT,END,STR,PWR,SPD,AGI,FLX,NER,DUR,HAN,ANA,TOTAL,RANK) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)"
# Create a list of tuples containing the data to be inserted
data_to_insert = [tuple(row) for _, row in toughestsport.iterrows()]
# Use a loop to insert each row of data and record the time taken
cursor = conn.cursor()
try:
start_time = time.time() # Record the start time
for row in data_to_insert:
cursor.execute(sql_query3, row)
end_time = time.time() # Record the end time
execution_time = end_time - start_time
print(f"Time taken for insertion: {execution_time:.6f} seconds")
except Exception as e:
# Handle any exceptions that may occur during execution
print("Error:", e)
# Commit the changes
conn.commit()
# Prepare the SQL statement for insertion
sql_query4 = "INSERT INTO world_famous_sports (Rank, PY_Rank, Last_Name, First_Name, Sport, Country, Team, Endorsements, Instagram_Followers_million, Facebook_Followers_million, Twitter_Followers_million, Endorsements__raw, Instgram_Followers_raw, Facebook_Followers_raw, Twitter_Followers_raw) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"
# Create a list of tuples containing the data to be inserted
data_to_insert = [tuple(row) for _, row in world_famous_sports.iterrows()]
# Use a loop to insert each row of data and record the time taken
cursor = conn.cursor()
try:
start_time = time.time() # Record the start time
for row in data_to_insert:
#print(row)
cursor.execute(sql_query4, row)
end_time = time.time() # Record the end time
execution_time = end_time - start_time
print(f"Time taken for insertion: {execution_time:.6f} seconds")
except Exception as e:
# Handle any exceptions that may occur during execution
print("Error:", e)
# Commit the changes
conn.commit()
Now that we have loaded both containers, we are good to start analyzing the data in GridDB.
What are the Top 5 toughest sports in terms of toughness rank?¶
sql_query5 = '''SELECT SPORT,Rank from toughestsport ORDER BY RANK ASC LIMIT 5'''
# Create a cursor to execute the query
cursor = conn.cursor()
# Execute the query
cursor.execute(sql_query5)
# Fetch the column names
column_names = [desc[0] for desc in cursor.description]
# Fetch the query results
results = cursor.fetchall()
# Convert the results to a Pandas DataFrame
Top_5_Toughest_Sports = pd.DataFrame(results, columns=column_names)
# Close the cursor and connection
cursor.close()
title = "Top 5 Toughest Scores"
table = tabulate(Top_5_Toughest_Sports, headers='keys', tablefmt='grid',showindex='Never')
print(title)
print(table)
Top 5 Toughest Scores +------------+--------+ | Sport | RANK | +============+========+ | Boxing | 1 | +------------+--------+ | Ice Hockey | 2 | +------------+--------+ | Football | 3 | +------------+--------+ | Basketball | 4 | +------------+--------+ | Wrestling | 5 | +------------+--------+
Insights:
- Boxing has been ranked 1st as the toughest sport followed by Ice Hockey,Football,Basketball and Wrestling in terms of toughness.
- These rankings are based on an assessment of the level of physical and mental demand that each sport places on athletes, with higher ranks indicating higher perceived toughness.
What are the 5 easiest sports in terms of toughness rank?¶
sql_query6 = '''SELECT SPORT,Rank from toughestsport ORDER BY RANK desc LIMIT 5'''
# Create a cursor to execute the query
cursor = conn.cursor()
# Execute the query
cursor.execute(sql_query6)
# Fetch the column names
column_names = [desc[0] for desc in cursor.description]
# Fetch the query results
results = cursor.fetchall()
# Convert the results to a Pandas DataFrame
Easiest_Sports = pd.DataFrame(results, columns=column_names)
# Close the cursor and connection
cursor.close()
title = "Top 5 Easiest Sports"
table = tabulate(Easiest_Sports, headers='keys', tablefmt='grid',showindex='Never')
print(title)
print(table)
Top 5 Easiest Sports +-----------+--------+ | Sport | RANK | +===========+========+ | Fishing | 60 | +-----------+--------+ | Billiards | 59 | +-----------+--------+ | Shooting | 58 | +-----------+--------+ | Bowling | 57 | +-----------+--------+ | Curling | 56 | +-----------+--------+
Insight(s): Fishing, Billiards, Shooting, Bowling and Curling are the Top 5 easiest sports.
Let's delve into a more in-depth analysis.
Strongest and Weakest Skills: Identify which sports excel in specific skills and which ones struggle.¶
For example, which sport has the highest "END" (Endurance) score, and which has the lowest? Are there common strengths or weaknesses across sports?
sql_query7 = '''SELECT END, STR, PWR, SPD, AGI, FLX, NER, DUR, HAN, ANA from toughestsport ORDER BY RANK desc LIMIT 5'''
# Create a cursor to execute the query
cursor = conn.cursor()
# Execute the query
cursor.execute(sql_query7)
# Fetch the column names
column_names = [desc[0] for desc in cursor.description]
# Fetch the query results
results = cursor.fetchall()
# Convert the results to a Pandas DataFrame
skills_matrix = pd.DataFrame(results, columns=column_names)
# Close the cursor and connection
cursor.close()
# Calculate the correlation matrix
correlation_matrix = skills_matrix[['END', 'STR', 'PWR', 'SPD', 'AGI', 'FLX', 'NER', 'DUR', 'HAN', 'ANA']].corr()
# Print the correlation matrix
#print(correlation_matrix)
# Get the upper triangular part of the correlation matrix
upper_triangle = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape), k=1).astype(np.bool))
# Create a heatmap of the upper triangular part
plt.figure(figsize=(10, 8))
sns.heatmap(upper_triangle, annot=True, cmap="coolwarm", linewidths=0.5)
plt.title('Upper Triangular Correlation Heatmap')
plt.show()
C:\Users\mg_su\AppData\Local\Temp/ipykernel_6952/1629317424.py:7: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
Insight(s): There is a strong positive correlation between Hand-eye coordination and Nerve (the ability to overcome fear) . Athletes who excel in hand-eye coordination may also tend to be good at overcoming fear (nerve) in high-pressure situations. This could be because both skills are essential for performing well in many sports, especially those that require precision and composure under stress, such as archery or marksmanship. It might also suggest that sports that emphasize hand-eye coordination tend to develop nerve as a byproduct of that training. To know more, we can try looking into whether there are specific sports where this relationship is more pronounced. Another strong positive correlation is between Flexibility and Analytical Aptitude; and between Endurance and Durability .
While looking at the negative correlations, there is a strong negative correlation between the following:
- Durability and Analytical Aptitude
- Endurance and Analytical Aptitude
- Strength and Analytical Aptitude
- Hand-eye coordination and Analytical Aptitude.
- Agility and Nerve (the ability to overcome fear).
- Agility and Hand-eye coordination
Are there specific sports where the strong positive correlations are more pronounced?¶
We now delve into HAN and NER; FLEX AND ANA; and END AND DUR to see if there are specific sports where these correlations are more pronounced. For this, we use CASE WHEN Statements. Refer to this GridDB resource to learn more.
sql_query8 = (f'''
SELECT Sport,
CASE WHEN (HAN >= 7 AND NER >= 7) THEN 'Yes'
ELSE 'No'
END AS "HAN-NER Relationship",
CASE WHEN (FLX >= 6 AND ANA >= 6) THEN 'Yes'
ELSE 'No'
END AS "FLX-ANA Relationship",
CASE WHEN (END >= 7 AND DUR >= 7) THEN 'Yes'
ELSE 'No'
END AS "END-DUR Relationship"
FROM toughestsport
''')
# Create a cursor to execute the query
cursor = conn.cursor()
# Execute the query
cursor.execute(sql_query8)
# Fetch the column names
column_names = [desc[0] for desc in cursor.description]
# Fetch the query results
results = cursor.fetchall()
# Convert the results to a Pandas DataFrame
sports_positive_correlations = pd.DataFrame(results, columns=column_names)
# Close the cursor and connection
cursor.close()
subset_sports = sports_positive_correlations[['Sport','HAN-NER Relationship']]
subset_sports[subset_sports['HAN-NER Relationship'] == 'Yes']
Sport | HAN-NER Relationship | |
---|---|---|
0 | Boxing | Yes |
31 | Auto Racing | Yes |
Insight(s): The relationship between Hand-Eye Coordination and the ability to overcome fear (Nerve) are more pronounced in Boxing and Auto Racing.
subset_sports = sports_positive_correlations[['Sport','FLX-ANA Relationship']]
subset_sports[subset_sports['FLX-ANA Relationship'] == 'Yes']
Sport | FLX-ANA Relationship | |
---|---|---|
4 | Wrestling | Yes |
5 | Martial Arts | Yes |
Insight(s): The relationship between Flexibility and the Analytical Aptitude are more pronounced in Wrestling and Martial Arts.
subset_sports = sports_positive_correlations[['Sport','END-DUR Relationship']]
subset_sports[subset_sports['END-DUR Relationship'] == 'Yes']
Sport | END-DUR Relationship | |
---|---|---|
0 | Boxing | Yes |
1 | Ice Hockey | Yes |
3 | Basketball | Yes |
Insight(s): The relationship between Endurance and Durability are more pronounced in Boxing, Ice Hockey and Basketball.
Who are the sportspersons with the most endorsements?¶
GridDB supports standard string concatenation through the use of the '||' (pipe operator).
sql_query9 = (f'''
SELECT First_Name || ' ' || Last_Name AS Full_Name,
Sport,SUM(COALESCE(endorsements, 0)) AS Endorsements
FROM world_famous_sports
WHERE endorsements is not null
GROUP BY 1,2
ORDER BY 3 DESC
LIMIT 10
''')
# Create a cursor to execute the query
cursor = conn.cursor()
# Execute the query
cursor.execute(sql_query9)
# Fetch the column names
column_names = [desc[0] for desc in cursor.description]
# Fetch the query results
results = cursor.fetchall()
# Convert the results to a Pandas DataFrame
Top_Endorsements = pd.DataFrame(results, columns=column_names)
# Close the cursor and connection
cursor.close()
Top_Endorsements = Top_Endorsements.dropna()
# Assuming you have a DataFrame named Top_Endorsements
# Extract the columns you want to plot
full_name = Top_Endorsements["Full_Name"]
endorsements = Top_Endorsements["Endorsements"]
sport = Top_Endorsements["Sport"]
# Create a color map for sports
sports = sport.unique()
colors = plt.cm.tab20(np.arange(len(sports)))
# Map sports to colors
sport_to_color = dict(zip(sports, colors))
# Map each sport to its corresponding color
bar_colors = [sport_to_color[s] for s in sport]
# Create a bar chart
plt.figure(figsize=(10, 6))
bars = plt.bar(full_name, endorsements, color=bar_colors)
# Add labels and title
plt.xlabel("Full Name")
plt.ylabel("Endorsements")
plt.title("Endorsements by Sport")
# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha="right")
# Add values on top of the bars
for bar, endorsement in zip(bars, endorsements):
plt.text(
bar.get_x() + bar.get_width() / 2,
endorsement + 1, # Adjust the vertical position of the text
str(endorsement), # Convert the value to a string
ha="center",
va="bottom",
fontsize=8, # Adjust the font size
)
# Create a color legend
legend_labels = [plt.Line2D([0], [0], color=sport_to_color[s], lw=4, label=s) for s in sports]
plt.legend(handles=legend_labels, title="Sport", loc="upper right")
# Show the plot
plt.tight_layout()
plt.show()