Analysis with R and BigQuery

Computing on the BigQuery side; making correlation matrices

In this example, we are going to compute a correlation matrix (or co-expression) entirely on the BigQuery side. Since we’re in R-land, we can easily visualize the matrix as a heatmap. We are also going to query for gene lists using Gene Ontology annotation, and access the End Points API to get a list of samples.

Loading libraries

require(bigrquery, quietly = TRUE) || install.packages('bigrquery',verbose = FALSE)
require(httpuv, quietly = TRUE) || install.packages('httpuv',verbose=FALSE)
require(ggplot2, quietly = TRUE) || install.packages('ggplot2',verbose = FALSE)

The ISB-CGC R Examples Package

The collection of examples can be found at: https://github.com/isb-cgc/examples-R

To install the package, we use the devtools package.

require(devtools, quietly = TRUE) || install.packages('devtools',verbose = FALSE)
require(ISBCGCExamples, quietly = TRUE) || install_github("isb-cgc/examples-R", build_vignettes=F)

If you needed to install any of the packages, remember, you still need to import it!

Your project ID

You will be using your own project ID. At certain points in the code, it will be necessary to complete the code.

my_cloud_project   = "your_project_id"
tcga_data_set      = "TCGA_hg38_data_v0"

First query

Now let’s see if things are working.

bigrquery::list_tables("isb-cgc", tcga_data_set)

Using BigQuery to compute correlation matrices

A correlation matrix using gene expression is going to take the correlation between each pair of genes. Usually, this matrix can then be clustered to discover gene modules.

In this query, we’re going to use the BigQuery CORR function (one of the many built-in mathematical functions) to compute a Pearson’s correlation.

You can find all the function information at: https://cloud.google.com/bigquery/query-reference

The general structure of the query is going to be:

select
  genes and correlation
from
  subtable with genes and samples of interest
joined to
  same subtable of genes and samples

q <- "
    SELECT
      a.gene_name AS gene1,
      b.gene_name AS gene2,
      CORR(a.HTSeq__FPKM,
        b.HTSeq__FPKM) AS corr
    FROM (
      SELECT
        *
      FROM
        `isb-cgc.TCGA_hg38_data_v0.RNAseq_Gene_Expression`
      WHERE
        gene_name IN ('APLN',
          'CCL26',
          'IL19',
          'IL37')
        AND project_short_name = 'TCGA-COAD' ) AS a
    JOIN (
      SELECT
        *
      FROM
        `isb-cgc.TCGA_hg38_data_v0.RNAseq_Gene_Expression`
      WHERE
        gene_name IN ('APLN',
          'CCL26',
          'IL19',
          'IL37')
        AND project_short_name = 'TCGA-COAD' ) AS b
    ON
      a.aliquot_barcode = b.aliquot_barcode
    GROUP BY
      gene1,
      gene2"

corrs <- query_exec(q,my_cloud_project, use_legacy_sql=F)

# transform to a matrix, and give it rownames
library(tidyr)
corrmat <- spread(corrs, gene1, corr)
rownames(corrmat) <- corrmat$gene2

# visualize the matrix
library(pheatmap)
pheatmap(corrmat[,-1])

It’s easy to make a couple changes to this query, enabling a correlation matrix per study. Try it!

Getting a list of high variance genes

When we make queries from R, the results come back as a data.frame. Let’s use the GO annotation, and get a list of genes that are related to the immune system. The GO Annotation table is found in the genome_reference data set, and GO:0006955 references the immune response.

q <- "
SELECT
  DB_Object_Symbol
FROM
  `isb-cgc.genome_reference.GO_Annotations`
WHERE
  GO_ID = 'GO:0006955'"

query_exec(q, my_cloud_project, use_legacy_sql=F)

That query returns 472 genes. But let’s suppose we want the top 50 by coefficient of variance.

q <- "
    SELECT
      gene_name,
      STDDEV(HTSeq__FPKM+1) / AVG(HTSeq__FPKM+1) AS cv
    FROM
      `isb-cgc.TCGA_hg38_data_v0.RNAseq_Gene_Expression`
    WHERE
      gene_name IN (
      SELECT
        DB_Object_Symbol
      FROM
        `isb-cgc.genome_reference.GO_Annotations`
      WHERE
        GO_ID = 'GO:0006955')
      AND project_short_name = 'TCGA-BRCA'
    GROUP BY
      gene_name
    ORDER BY
      cv DESC
    LIMIT
      50"

result <- query_exec(q, my_cloud_project, use_legacy_sql=F)
genes <- result$gene_name

Now we have a list of genes that we can carry to further analysis.

Getting a list of samples from the endpoints

From R we can access the cohorts we created using the web app. To do that we use the End Points API. The API is essentially a set of html requests. A small wrapper is included as part of the isb-cgc examples-R package.

https://github.com/isb-cgc/examples-R/blob/master/inst/doc/Working_With_Barcode_Lists.md

To get started, import the ISBCGCExamples library.

library(ISBCGCExamples)

The first step is creating a token. This token contains your authentication status, and lets the service know about what information is available to you.

my_token <- isb_init()

To get a listing of the previously created cohorts, we can use the list_cohorts function that takes a token, and returns a list with items including ‘count’, ‘items’, ‘kind’, and ‘etag’. The count shows the number of saved cohorts and the items contains information about the cohorts.

# first get a list of my saved cohorts.
my_cohorts <- list_cohorts(my_token)
names(my_cohorts)

# to get the names of my saved cohorts
lapply(my_cohorts$items, function(x) x$name)

Now that we have the cohort IDs, we can collect the various barcodes contained in the cohort. These include patient barcodes, sample barcodes, and aliquot barcodes. To do this, we can use the barcodes_from_cohort function.

HERE I’m using my cohort #4, but change this to whatever you have saved.

# get the cohort IDs
my_cohort_id <- lapply(my_cohorts$items, function(x) x$id)[[4]]

# then ping the endpoints with the cohort ID
my_barcodes <- cohort_barcodes(my_cohort_id, my_token)
names(my_barcodes)

The object returned from barcodes_from_cohort is again a list, this time with elements ‘cohort_id’, ‘sample_count’, ‘case_count’, ‘cases’, and ‘samples’. The cases and samples elements are also lists, but lists of cases or sample barcodes.

samples <- unlist(my_barcodes$samples)
# 836 samples

Programmatically constructing Queries

One of the great things about working in a scripting environment, is that our analysis – the queries – we write, can be constructed programmatically. That makes it easy to apply the same structured queries to many questions. But also we can incorporate long lists of samples or genes into a query.

#function for formatting lists..
sqf <- function(x) {
    paste("('",paste(x, collapse="','"),"')", sep="")
}

q <- paste("
SELECT
  a.gene_name as gene1,
  b.gene_name as gene2,
  CORR(a.HTSeq__FPKM, b.HTSeq__FPKM) as corr
FROM (
  SELECT
    *
  FROM
    `isb-cgc.TCGA_hg38_data_v0.RNAseq_Gene_Expression`
  WHERE
    gene_name IN ", sqf(genes), "
    AND sample_barcode IN ", sqf(samples), "
  ) AS a
JOIN (
  SELECT
    *
  FROM
    `isb-cgc.TCGA_hg38_data_v0.RNAseq_Gene_Expression`
  WHERE
    gene_name IN ", sqf(genes), "
    AND sample_barcode IN ", sqf(samples), "
  ) AS b
ON
  a.aliquot_barcode = b.aliquot_barcode
GROUP BY
  gene1,
  gene2", sep=" ")

corrs <- query_exec(q,my_cloud_project, use_legacy_sql=F)

# transform to a matrix, and give it rownames
library(tidyr)
corrmat <- spread(corrs, gene1, corr)
rownames(corrmat) <- corrmat$gene2

# visualize the matrix
library(pheatmap)
pheatmap(corrmat[,-1])

From Lists to Matrices

Transform gexp_affected_genes_df into a gexp-by-samples feature matrix

gexp_fm = tidyr::spread(gexp_affected_genes,HGNC_gene_symbol,normalized_count)

gexp_fm[1:5,1:5]

Have feedback or corrections? Please email us at feedback@isb-cgc.org. Follow us on BlueSky and X!