About the Dataset
BRFSS stands for Behavioural Risk Factor Surveillance system. The objective of the BRFSS is to assess behavioural risk factors for non-institutionalized adults (age>=18) residing in the United States. To collect these health instances, a total number of 50 states within the US orchestrates telephone-based surveys which selects a random adult from the household.
Based on this information, we could say that Random Sampling is followed. Therefore, the conclusion obtained from this sample could be generalized to a larger and relevant population.
However, since we’re conducting surveys, which means it would also depend on the adult whether or not to volunteer, we can not say with certainty that the assignment is completely random. Establishing a causal relation between two parameters based on this data, thus, would not be appropriate.
There are a total of 330 parameters in the dataset. These parameters are responsible for noting down the age, physical health, mental health, medication, etc. The total number of instances present in the dataset are 4,91,775.
We would be using this public dataset for the year 2013 which can be downloaded here from Kaggle. The website has also provided more recent versions of this dataset from 2015 both in CSV and JSON formats, nevertheless, this tutorial would be extensible to any dataset.
The following tutorial is carried out in RStudio (R version 4.0.2) on Windows 10. However, the syntax would remain the same irrespective of the Operating System used.
Install and Load the necessary packages
install.packages("dplyr") install.packages("ggplot2") library(dplyr) library(ggplot2)
Load the Dataset
The dataset needs to be in the same directory as your R Working directory, otherwise, the below-mentioned command would output an error message.
.RData is an extension of datasets particularly belonging to R.
Get your working directory using:
In case your working directory is different from the location of your dataset, set your working directory using the following command. Alternatively, move the dataset to the above output location.
Where dir is a character string specifying your desired path. For more information on how to set your path, see here.
An easy alternative – GridDB
The above-mentioned dataset is nearly 600 MB. While this may not seem a big deal, uploading such an enormous amount of data on RCloud does not result in a fruitful outcome. Instead, the server takes a lot of time uploading this dataset due to which it eventually times out. This chunk of information is important because the tutorial we have carried out needs to be done using your local system. This does not leave room for scalability, especially if you’re working in an organization which more often than not involves collaborating with multiple people.
To solve this problem, we have a highly scalable and just the right solution for your data – GridDB. GridDB is a highly scalable, reliable and relatively faster tool for your data storage. It also supports a number of programming languages including Java, C, Python, etc.
Filtering and Visualization
Our foremost research focus for this tutorial is to visualize the distribution of a person’s general health and whether or not it varies with a person’s marital status. This research question would focus on two parameters from this dataset: General Health parameter (denoted by genhlth in the dataset) and the marital status (denoted by marital in the dataset).
General Health of a person is subdivided into 5 main categories – Excellent, Very Good, Good, Fair, Poor and for some of the entries, this information might not be available (
Similarly, the possible values for the marital status are: Married, Divorced, Widowed, Separated, Never Married, or a member of an unmarried couple. Again, some of the participants might not have responded and those entries would have been filled with
Let us first review what is the distribution of people under the general health category.
brfss2013 %>% group_by(genhlth) %>% summarise(count = n())
Gives the following output:
## # A tibble: 6 x 2 ## genhlth count ##
## 1 Excellent 85482 ## 2 Very good 159076 ## 3 Good 150555 ## 4 Fair 66726 ## 5 Poor 27951 ## 6 1985
We can see that 1985 people have not responded to the question at hand. Hence, we will be filtering out such entries.
brfss2013 %>% filter(genhlth != "NA") %>% group_by(genhlth) %>% summarise(count = n())
The output should look something like this now.
## # A tibble: 5 x 2 ## genhlth count ##
## 1 Excellent 85482 ## 2 Very good 159076 ## 3 Good 150555 ## 4 Fair 66726 ## 5 Poor 27951
We shall now have a look at the other parameter of interest: Marital Status. Again, we will be filtering out any missing entries.
brfss2013 %>% filter(marital!="NA") %>% group_by(marital) %>% summarise(count = n())
## # A tibble: 6 x 2 ## marital count ##
## 1 Married 253329 ## 2 Divorced 70376 ## 3 Widowed 65745 ## 4 Separated 10662 ## 5 Never married 75070 ## 6 A member of an unmarried couple 13173
Now, for visualizing every possible combination, we would need to aggregate this data based on these two parameters somehow. For this purpose, the dplyr library offers a function count which will be used to count people with excellent health and compare their marital status.
agg = count(brfss2013, genhlth, marital) %>% filter(genhlth!="NA", marital!="NA")
Let’s see what agg has stored.
head(agg) --> ## genhlth marital n ## 1 Excellent Married 49682 ## 2 Excellent Divorced 10049 ## 3 Excellent Widowed 7007 ## 4 Excellent Separated 1221 ## 5 Excellent Never married 14419 ## 6 Excellent A member of an unmarried couple 2364
Now that we have the count, let’s see the proportion of married men/women to their health.
ggplot(agg %>% filter(marital=="Married")) + geom_bar(aes(x = genhlth, y = n), stat = "identity")
The above command gives the following output:
Similarly, you could visualize this proportion for people with marital status other than married. Go ahead, plot some fancy plots with this simple code!
In this tutorial, we analyzed the public BRFSS dataset using the R programming language. We discussed some limitations of using a huge dataset on your local system and an easy, much faster alternative.
Following our discussion, we also saw how to filter out the missing information in R after which we plotted a Bar plot based upon two parameters using the ‘ggplot’ library.