Introduction
In this document, we’ll first go over some notes about setting up GridDB and connecting to it from R. Then, we’ll ingest some gene expression data from The Cancer Genome Atlas (TCGA), and query the GridDB backend using dplyr
to generate simple summary statistics.
Prerequisites
You can follow along by cloning the source code from here: https://github.com/griddbnet/Blogs/tree/gene_analysis
git clone https://github.com/griddbnet/Blogs.git --branch gene_analysis
Installing and starting GridDB
To install GridDB we simply follow the instructions provided in the documentation. On an Ubuntu machine, the easiest is to download the deb
package and install with dpkg -i
. Thereafter, we need to start the GridDB service with sudo systemctl start gridstore
. After this, we are ready to connect to the service from R in order to load and query data.
Installing R dependencies
Running the code below would require installation (if not already available) of several R packages. Below, I am using the attachment
package to detect the dependencies used by this Quarto markdown document.
attachment::att_from_rmd("01.Rmd")
One can install these dependencies via the usual install.packages
method, except for Bioconductor
packages that would require BiocManager::install
:
deps <- attachment::att_from_rmd("01.Rmd")
install.packages(deps)
or, in an Ubuntu
environment (as it might be the case in a cloud setup), afaster approach might be to get pre-compiled versions using Ubuntu
's package manager:
# replace `PKG1/2/3` with the required package names
sudo apt install r-cran-PKG1 r-cran-PKG2 r-cran-PKG3 ...
Connecting R to a local GridDB cluster
As documented previously, we'll use the RJDBC
package (Java database connectivity) to connect to the GridDB cluster. As the cluster is running locally (as would be the case if you followed the installation and getting started guide), be sure to use the localhost IP address (127.0.0.1
) and the port 20001
.
For ease of use down the line, we can wrap these two steps, creating a jdbc
driver and a GridDB connection, in a function called connect_to_griddb
. Note that we are loading the entire namespaces of RJDBC
and DBI
, which might be OK for an exploratory interactive session, but for more serious projects, we'd of course create an R package and import only the needed functions explicitly with importFrom
or use an approach such as box
to create modules with specific imports.
suppressPackageStartupMessages({
library(RJDBC)
library(DBI)
library(dplyr)
library(tidyr)
library(purrr)
library(qs)
})
connect_to_griddb <- function() {
drv <- JDBC(driverClass = "com.toshiba.mwcloud.gs.sql.Driver",
# Point this to your gridstore jar
classPath = "~/src/jdbc/bin/gridstore-jdbc.jar")
conn <-
dbConnect(drv,
"jdbc:gs://127.0.0.1:20001/myCluster/public",
"admin",
"admin")
return(conn)
}
Data
We'll work with some gene expression data from human cancer studies as published by The Cancer Genome Atlas (TCGA) program. In particular, we'll focus on data packaged by the Bioconductor package GSEABenchmarkeR
and intended to use for benchmarking gene-expression studies. The data were downloaded, pre-formatted into rectangular sets and saved as a qs
archive for faster saving and loading.
A quick peak at the data sets:
TCGA <- qs::qread("data/expressions.qs")
sapply(TCGA, nrow) |> summary()
We have expression data for 10 cancers, with a mean of 1.2M observations (rows). Each data set has 5 columns,
TCGA[[1]] |> head() |> knitr::kable()
and the represented cancers are
sapply(TCGA, "[[", 1, 5) |> unlist() |> knitr::kable()
Loading our data into GridDB
To ingest the data, we first create a connection to the GridDB cluster using our function defined above. Then, we define two functions, one to create a table and another to insert data into the table. These functions will allow us to cycle overa list of data frames and load each data frame into the database seamlessly with base R's *apply
family of functions or purrr
's map_*
family of functions.
griddb <- connect_to_griddb()
create_table <- function(table_name) {
dbSendUpdate(
griddb,
sprintf(
"CREATE TABLE IF NOT EXISTS %s (Gene STRING, Sample STRING, Expression INTEGER, Code STRING, Name STRING);",
table_name
)
)
}
insert_table <- function(conn, name, df, append = TRUE) {
for (i in seq_len(nrow(df))) {
dbWriteTable(conn, name, df[i, ], append = append)
}
}
# using `invisible` here to hide non-informative output
invisible(lapply(names(TCGA), create_table))
DBI::dbListTables(griddb)
# 6 minutes for 10K rows per table. So 6 min for 100K rows
# 1h for 100K rows per table. So 1h for 1M rows.
invisible(pbapply::pblapply(seq_along(TCGA), function(i)
insert_table(
conn = griddb,
name = names(TCGA)[[i]],
df = TCGA[[i]][1:100000, ],
append = TRUE
))
)
DBI::dbDisconnect(griddb)
griddb <- connect_to_griddb()
iter <- setNames(names(TCGA), names(TCGA))
map_dfr(iter, function(x) {
griddb |> tbl(x) |> collect() |> nrow()
}) |> knitr::kable()
Data analysis
Some summary simple stats. Nothing biologically meaningful at this point. The main idea is to use the familiar R packages like dplyr
, tidyr
and purrr
toquery the fast GridDB backend database. The R code below, could well have been developed for a MariaDB
or PostgresSQL
database, but it will work equally well with a GridDB setup.
Genes per cancer dataset that have average expression > 2000
high_expression_genes <- purrr::map_dfr(iter, .id = "Cancer",
function(x) {
griddb |>
tbl(x) |>
group_by(Gene) |>
summarise(Mean_expression = mean(Expression, na.rm = TRUE)) |>
filter(Mean_expression >= 2000) |>
collect()
}
)
dim(high_expression_genes)
head(high_expression_genes) |> knitr::kable()
Genes per cancer dataset that have average expression < 300
low_expression_genes <- purrr::map_dfr(iter, .id = "Cancer",
function(x) {
griddb |>
tbl(x) |>
group_by(Gene) |>
summarise(Mean_expression = mean(Expression, na.rm = TRUE)) |>
filter(Mean_expression <= 300) |>
collect()
}
)
dim(low_expression_genes)
tail(low_expression_genes) |> knitr::kable()
Genes shared among all cancer data sets
unique_genes_per_dataset <- purrr::map(iter,
function(x) {
griddb |>
tbl(x) |>
collect() |>
pull(Gene) |>
unique()
}
)
shared_genes <- Reduce(intersect, unique_genes_per_dataset)
shared_genes
Median expression per cancer data set of the first 10 genes shared for all data sets
first10 <- shared_genes[1:10]
iter2 <- expand.grid(iter, first10, stringsAsFactors = FALSE)
first10_expression <-
purrr::map2_dfr(iter2$Var1, iter2$Var2, function(.x, .y) {
griddb |>
tbl(.x) |>
filter(Gene == .y) |>
collect()
})
first10_expression |>
group_by(Code, Name, Gene) |>
summarise(Mean_expression = mean(Expression, na.rm = TRUE)) |>
ungroup() |>
select(-Name) |>
pivot_wider(names_from = Code, values_from = Mean_expression) |>
knitr::kable()
Clean up
DBI::dbDisconnect(griddb)
Conclusion
In this vignette, we saw how we can use GridDB database from R. We ingested one million rows of gene expression data in ten tables in about 1 hour and then queried the data seamlessly using familiar tools like DBI
and dplyr
.
Appendix
R / Bioconductor
code used to download and preprocess the gene expression data.
# This has a lot of dependencies and can take a few minutes to install
# BiocManager::install("GSEABenchmarkeR")
library(GSEABenchmarkeR)
tcga <- loadEData("tcga", nr.datasets = 10)
cancer_abbreviations <- names(tcga)
cancer_names <- c("BLCA" = "Bladder Urothelial Carcinoma",
"BRCA" = "Breast Invasive Carcinoma",
"COAD" = "Colon Adenocarcinoma",
"HNSC" = "Head and Neck Squamous Cell Carcinoma",
"KICH" = "Kidney Chromophobe",
"KIRC" = "Kidney Renal Clear Cell Carcinoma",
"KIRP" = "Kidney Renal Papillary Cell Carcinoma",
"LIHC" = "Liver Hepatocellular Carcinoma",
"LUAD" = "Lung Adenocarcinoma",
"LUSC" = "Lung Squamous Cell Carcinoma")
expressions <- lapply(seq_along(tcga), function(i) {
tcga[[i]]@assays@data@listData |>
as.data.frame() |>
tibble::rownames_to_column(var = "Gene") |>
tidyr::pivot_longer(cols = -Gene, names_to = "Sample", values_to = "Expression") |>
mutate(cancer = cancer_abbreviations[[i]], cancer_name = cancer_names[[i]])
})
qs::qsave(expressions, "expressions.qs")
Session info
sessionInfo()
If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.