Exploring Alzheimer's: Cognitive Decline and Mental Distress through the Lens of GridDB¶
A progressive brain disorder, Alzheimer's disease often leads to the degeneration of brain cells. It is the most common cause of dementia, characterized by memory loss, cognitive decline, and behavioral changes. Being a brain disorder, Alzheimer's disease gradually impairs one's ability to perform daily activities. If you’ve watched the award-winning movie 'Still Alice', you will know how deep is the personal impact of Alzheimer's disease. While risk factors of Alzheimer's include age, genetics, and lifestyle choices, it is crucial to delve deeper to understand and manage the symptoms effectively.
In this article, we will perform a comprehensive analysis of Alzheimer's risk factors using GridDB and Python. By leveraging GridDB’s robust data management and analytical capabilities, researchers can efficiently analyze large datasets, uncover valuable insights, and better understand the patterns and correlations associated with Alzheimer's disease. In this case, we use the GridDB WebAPI. To learn more, refer to the official Web API - GridDB resource.
#Libraries for basic data manipulation
import os
import pandas as pd
import numpy as np
#Libraries for the webAPI requests
import http
http.client.HTTPConnection.debuglevel = 0
import json
import requests
import pyarrow as pa
import pyarrow.csv as csv
import time
## Setting
from IPython.display import Image, Markdown,HTML, display
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
#warnings.filterwarnings("ignore", category=pd.errors.SettingWithCopyWarning)
os.environ['OMP_NUM_THREADS'] = '1' # to run KMeans clustering without memory leak issues
# Visualizations & Unsupervised Learning
import folium
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import re # Importing the regex module for parsing
import json
# Mention your working directory here
os.chdir("C:\\XX\\")
About the Dataset¶
The dataset being used is from the U.S. Department of Health & Human Services. To access the dataset, navigate to the URL - https://catalog.data.gov/dataset/alzheimers-disease-and-healthy-aging-data. Use the 'Comma Separated Values' option to download the .csv file.
# Specify the path to your image file
current_directory = os.getcwd()
# Construct the image path using os.path.join()
image_path = os.path.join(current_directory, 'Intro_Fig2.Dowloading the Alzheimers dataset.png')
width = 700
display(Markdown("### Downloading the Alzheimers' Dataset"))
# Display the image
display(Image(filename=image_path, width=width))
Downloading the Alzheimers' Dataset¶
The Alzheimer's Disease and Healthy Aging Data dataset is a comprehensive resource compiled by the CDC. It focuses on the health and well-being of older Americans and includes a wide range of data on demographics, health conditions, and behaviors. The dataset is sourced from national surveys like BRFSS, NHANES, and NHIS. It emphasizes Alzheimer's disease, covering prevalence, incidence, risk factors, and outcomes. This data is valuable for identifying trends and exploring risk factors related to Alzheimer's. It is also useful in identifying at-risk populations and tracking the effectiveness of public health interventions for older adults.
Dataset Columns¶
- RowId: A unique identifier for each row in the dataset, used to distinguish between different records.
- YearStart: The starting year in which the data was collected. This helps in identifying the time period of the data.
- YearEnd: The ending year of data collection. It is usefli for datasets that span mlitiple years.
- LocationAbbr: The abbreviation for the geographic location (e.g., state or territory). This allows quick identification of where the data was collected.
- LocationDesc: A description or flil name of the location corresponding to the abbreviation, providing more detailed information about the location.
- Datasource: The source from which the data was collected (e.g., Behavioral Risk Factor Surveillance System (BRFSS)). This helps in understanding the origin and context of the data.
- Class: The broad category of health information that the data point belongs to (e.g., Mental Health). It groups related data points together.
- Topic: A more specific subject within the class (e.g., Frequent mental distress). This narrows down the focus within the broader class.
- Question: The specific question asked during data collection or the metric being measured (e.g., "In the past 30 days, how often did you feel mentally distressed?"). This provides context for the data values.
Containers in GridDB¶
In GridDB, there are two types of containers: Time-Series Containers, designed for storing time-stamped data and optimized for time-based queries, and Collection Containers, which are more flexible and can hold various data types without time constraints, allowing for general-purpose data storage and retrieval.
Choosing a Collection Container over a Time-Series Container may be preferable when you require greater flexibility in data structure or would need to store heterogeneous data types. In this case, we will be using a 'Collection' container. Hence, in this case, we will store our data in a Collection Container. Here is a useful resource to better understand the GridDB model architecture.
Creating a Data Structure for the Container in GridDB¶
The dataset is first loaded into Python as a dataframe. It is then loaded into a GridDB 'Collection' container. We create a function to map out the datatypes from the dataframe to the structure required for a Collection container in GridDB.
#Read the CSV dataset file that was downloaded in the earlier step
Alzheimer_Data = pd.read_csv('../Alzheimer_s_Disease_and_Healthy_Aging_Data.csv')
#Subset the data to include the data that we want to use
Alzheimer_subset = Alzheimer_Data[
(Alzheimer_Data['Topic'] == 'Frequent mental distress') |
(Alzheimer_Data['Topic'].str.contains('Cognitive Decline', case=False, na=False))]
# Defining a function to map pandas dtypes to GridDB types
def map_dtype_to_griddb(pandas_dtype):
"""
# Maps pandas data types to GridDB-compatible data types.
e.g. pandas_dtype (str): Data type of a pandas column; Returns str: The corresponding GridDB data type.
"""
dtype_map = {
'int64': 'INTEGER',
'float64': 'DOUBLE',
'object': 'STRING', # for strings/objects
'datetime64[ns]': 'TIMESTAMP',
'bool': 'BOOL'
}
return dtype_map.get(str(pandas_dtype), 'STRING') # Default to STRING if no match
# Function to create the GridDB collection structure from a DataFrame
def create_griddb_structure(df, container_name, container_type="COLLECTION", rowkey=False):
# Initializing the container structure
container_structure = {
"container_name": container_name,
"container_type": container_type,
"rowkey": rowkey,
"columns": []
}
# Looping through columns in the DataFrame and mapping their data types to GridDB types
for col_name, col_dtype in df.dtypes.items():
column_structure = {
"name": col_name,
"type": map_dtype_to_griddb(col_dtype)
}
container_structure["columns"].append(column_structure)
return container_structure
# Create the GridDB container structure
griddb_container_structure = create_griddb_structure(Alzheimer_subset, "Alzheimer_Analysis", container_type="COLLECTION")
data_obj1 = griddb_container_structure
Connecting to GridDB using WebAPI¶
We use the WebAPI to connect to GridDB. When using the WebAPI, we can make HTTP requests with the 'requests' library in Python to interact with the GridDB cluster. To create a container using the WebAPI, we'll need to send a POST request with a JSON payload that defines the container's structure. If you are interested in learning about the setup of the WebAPI, refer to these pages from the Github repo - WebAPI .
#Construct an object to hold the request headers (ensure that you replace the XXX placeholder with the correct value that matches the credentials for your GridDB instance)
header_obj = {"Authorization":"Basic XXX","Content-Type":"application/json; charset=UTF-8","User-Agent":"PostmanRuntime/7.29.0"}
#Construct the base URL based on your GridDB cluster you'd like to connect to (ensure that you replace the placeholders in the URL below with the correct values that correspond to your GridDB instance)
base_url = 'https://[host]:[port]/griddb/v2/[clustername]/dbs/[database_name]/'
Running the Request to create a GridDB container¶
#Set up the GridDB WebAPI URL
url = base_url + 'containers'
#Invoke the GridDB WebAPI with the headers and the request body (constructed in the earlier step)
x = requests.post(url, json = data_obj1, headers = header_obj)
Loading the Container with Data (Row Registration)¶
The process of loading data into a container is also known as 'Row Registration' . To read more about it, refer to the GridDB resource here. We use the 'to_json' function in the Pandas library to construct the JSON structure required for row registration.
# Step 1: Convert the DataFrame to a JSON format with column names (key-value pairs)
Alzheimer_subset_json = Alzheimer_subset.to_json(orient='values')
# Parse the JSON string into a Python list of dictionaries
request_body_Alzheimer_Data = Alzheimer_subset_json
#Setup the URL to be used to invoke the GridDB WebAPI to register rows in the container created previously
url = base_url + 'containers/Alzheimer_Analysis/rows'
#Invoke the GridDB WebAPI using the request constructed
x = requests.put(url, data=request_body_Alzheimer_Data, headers=header_obj)
Mental Distress by State¶
Top 10 locations with the highest percentages of frequent mental distress recorded.¶
The query below fetches the top 10 locations based on the highest percentage of frequent mental distress records in the Alzheimer_Analysis table, calculating the percentage as a fraction of the total number of records for each location. As you can see, GridDB supports CASE WHEN statements, string matching and aggregate functions. Refer to this official GridDB resource to know more.
sql_query1 = (f"""
SELECT LocationDesc, Geolocation,
(COUNT(CASE
WHEN LOWER(Topic) LIKE '%distress%' THEN 1
END
) * 100.0 / COUNT(*)
) AS Percentage_Frequent_Mental_Distress
FROM Alzheimer_Analysis
GROUP BY LocationDesc, Geolocation
ORDER BY Percentage_Frequent_Mental_Distress DESC
LIMIT 10;
""")
#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'
# Construct the request body
request_body = json.dumps([
{
"type": "sql-select",
"stmt": sql_query1
}
])
# Validate the constructed request body
#print(request_body)
#Invoke the GridDB WebAPI
data_req1 = requests.post(url, data=request_body, headers=header_obj)
#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req1.json()
Mental_Distress_by_state = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"],myJson[0]["columns"][2]["name"]])
# Ensure the Geolocation column is of type string and handle NaN values
Mental_Distress_by_state['Geolocation'] = Mental_Distress_by_state['Geolocation'].astype(str)
# Extract latitude and longitude from Geolocation column
Mental_Distress_by_state['Longitude'] = Mental_Distress_by_state['Geolocation'].apply(
lambda x: float(x.split(' ')[1].strip('()')) if isinstance(x, str) and 'POINT' in x else None
) # Longitude
Mental_Distress_by_state['Latitude'] = Mental_Distress_by_state['Geolocation'].apply(
lambda x: float(x.split(' ')[2].strip('()')) if isinstance(x, str) and 'POINT' in x else None
) # Latitude
# Create a map centered around the United States
us_latitude = 37.0902 # Latitude for the U.S. center
us_longitude = -95.7129 # Longitude for the U.S. center
# Create a map centered around the United States, filtering out any NaN values
valid_locations = Mental_Distress_by_state.dropna(subset=['Latitude', 'Longitude'])
if not valid_locations.empty:
m = folium.Map(location=[us_latitude, us_longitude], zoom_start=4) # Adjust zoom level as needed
# Add markers to the map
for idx, row in valid_locations.iterrows():
folium.Marker(
location=[row['Latitude'], row['Longitude']],
popup=f"{row['LocationDesc']}<br>Percentage: {row['Percentage_Frequent_Mental_Distress']}%",
tooltip=row['LocationDesc']
).add_to(m)
# Display the title above the map
display(Markdown("## Mental Distress by State - Geographical Distribution"))
# Display the states with markers
states_with_markers = ', '.join(valid_locations['LocationDesc'].tolist())
display(Markdown(f"**Top States with people reporting Frequent Mental Distress:** {states_with_markers}"))
# Display the map in Jupyter Notebook directly
display(m)
else:
print("No valid locations to display.")
Mental Distress by State - Geographical Distribution¶
Top States with people reporting Frequent Mental Distress: Virgin Islands, Guam, Montana, Massachusetts, West Virginia, North Dakota, South Dakota, Wyoming, New Hampshire, Kentucky
Cognitive Decline by State¶
In the query below, we use a subquery or nested query style. The outer query/main query selects LocationDesc, Geolocation, and Data_Value (renamed as Cognitive_Decline_Percentage) from the Alzheimer_Analysis table, where the Topic is related to "Cognitive Decline" and the Data_Value exceeds a certain threshold. The inner query calculates the average of Data_Value for records where either Class or Topic contains "cognitive decline". The result of this subquery is then used as the threshold in the outer query. This style is commonly used in relational database queries when we are required to compare values in a dataset against an aggregate function (like an average) or a calculated result. Refer to this resource to learn more.
sql_query2 = (f"""
SELECT LocationDesc, Geolocation,Data_Value as Cognitive_Decline_Percentage
FROM Alzheimer_Analysis
WHERE Topic LIKE '%Cognitive Decline%'
AND Data_Value > (
SELECT AVG(Data_Value)
FROM Alzheimer_Analysis
WHERE (LOWER(Class) LIKE '%cognitive decline%'
OR LOWER(Topic) LIKE '%cognitive decline%')
);
""")
#Setup the URL to be used to invoke the GridDB WebAPI to retrieve data from the container
url = base_url + 'sql'
# Construct the request body
request_body = json.dumps([
{
"type": "sql-select",
"stmt": sql_query2
}
])
# Validate the constructed request body
#print(request_body)
#Invoke the GridDB WebAPI
data_req2 = requests.post(url, data=request_body, headers=header_obj)
#Process the response received and construct a Pandas dataframe with the data from the response
myJson = data_req2.json()
Cognitive_Decline = pd.DataFrame(myJson[0]["results"], columns=[myJson[0]["columns"][0]["name"], myJson[0]["columns"][1]["name"],myJson[0]["columns"][2]["name"]])
Since the 'LocationDesc' has both states and broader regions in the US (like West, Northeast, etc.), we create a separate column to map out individual states to broader U.S. regions.
# Mapping of states and regions
state_to_region = {
'Idaho': 'West',
'Arizona': 'West',
'Oklahoma': 'South',
'Tennessee': 'South',
'South': 'South',
'Kentucky': 'South',
'Wyoming': 'West',
'Maryland': 'Northeast',
'Midwest': 'Midwest',
'West': 'West',
'Arkansas': 'South',
'Delaware': 'Northeast',
'Illinois': 'Midwest',
'Rhode Island': 'Northeast',
'Connecticut': 'Northeast',
'Washington': 'West',
'Pennsylvania': 'Northeast',
'Texas': 'South',
'Oregon': 'West',
'Ohio': 'Midwest',
'Georgia': 'South',
'New Jersey': 'Northeast',
'New Mexico': 'West',
'New York': 'Northeast',
'South Carolina': 'South',
'United States, DC & Territories': 'DC & Territories',
'California': 'West',
'Michigan': 'Midwest',
'Virginia': 'South',
'District of Columbia': 'Northeast',
'South Dakota': 'Midwest',
'Hawaii': 'West',
'Indiana': 'Midwest',
'Vermont': 'Northeast',
'Iowa': 'Midwest',
'Kansas': 'Midwest',
'New Hampshire': 'Northeast',
'Utah': 'West',
'Alabama': 'South',
'Nevada': 'West',
'Louisiana': 'South',
'Maine': 'Northeast',
'Nebraska': 'Midwest',
'Alaska': 'West',
'Mississippi': 'South',
'Wisconsin': 'Midwest',
'North Carolina': 'South',
'West Virginia': 'South',
'Florida': 'South',
'Missouri': 'Midwest',
'North Dakota': 'Midwest',
'Minnesota': 'Midwest',
'Colorado': 'West',
'Montana': 'West',
'Massachusetts': 'Northeast',
'Puerto Rico': 'DC & Territories',
'Guam': 'DC & Territories',
'U.S. Virgin Islands': 'DC & Territories',
'American Samoa': 'DC & Territories',
'Northern Mariana Islands': 'DC & Territories',
'Midway Atoll': 'DC & Territories',
'Baker Island': 'DC & Territories',
'Howland Island': 'DC & Territories',
'Jarvis Island': 'DC & Territories',
'Johnston Atoll': 'DC & Territories',
'Kingman Reef': 'DC & Territories',
'Palmyra Atoll': 'DC & Territories'
}
# Step 1: Create a Region Column in Cognitive_Decline DataFrame
Cognitive_Decline['Region'] = Cognitive_Decline['LocationDesc'].map(state_to_region)
# Step 2: Aggregate the data by Region
region_data = Cognitive_Decline.groupby('Region')['Cognitive_Decline_Percentage'].mean().reset_index()
# Step 3: Extract Data for Plotting
categories = region_data['Region']
values = region_data['Cognitive_Decline_Percentage']
# Handle NaN values if any
values = values.fillna(0)
# Number of categories
N = len(categories)
# Creating a list of angles for the plot
angles = np.linspace(0, 2 * np.pi, N, endpoint=False).tolist()
values = values.tolist()
# Ensure the plot is circular
angles += angles[:1]
values += values[:1]
# Initializing the radar chart
fig, ax = plt.subplots(figsize=(6, 6), subplot_kw=dict(polar=True))
# Plotting the data
ax.fill(angles, values, color='skyblue', alpha=0.4)
ax.plot(angles, values, color='blue', linewidth=2)
ax.set_xticks(angles[:-1])
ax.set_xticklabels(categories)
plt.title('Average Cognitive Decline Percentage by Region', size=15, color='darkblue', y=1.1)
plt.show()