.. _R-workshop: *************** Analysis with R *************** Differential gene expression associated with HPV integration ############################################################ In this example, we will study the potential effects of HPV integration on the expression of recurrent target genes in CESC and HNSC tumors. This example demonstrates using R to issue BigQuery queries involving multiple tables across multiple data sets. We will also show users how to bring in their own data to use in conjunction with the TCGA data already available as BigQuery tables. In this exercise, we will reproduce some figures from Tang et. al. [1] to visualize altered expression of host genes frequently targeted by HPV. References: Tang et. al. The landscape of viral expression and host gene fusion and adaptation in human cancer. Nature Communications 4, Article number:2513|doi:10.1038/ncomms3513 Loading libraries ================= .. code-block:: r require(bigrquery,quietly = TRUE) || install.packages('bigrquery',verbose = FALSE) require(httpuv, quietly = TRUE) || install.packages('httpuv',verbose=FALSE) require(ggplot2,quietly = TRUE) || install.packages('ggplot2',verbose = FALSE) Your project ID =============== You will be using your own project ID. At certain points in the code, it will be necessary to complete the code. .. code-block:: r main_cloud_project = "isb-cgc" my_cloud_project = "your_project_id" tcga_data_set = "TCGA_hg38_data_v0" First query =========== Now let's see if things are working. .. code-block:: r bigrquery::list_tables(main_cloud_project, tcga_data_set) In this tutorial, we will be investigating two studies using two existing BigQuery tables. Additionally, we're going to BYOD "Bring your own data". .. code-block:: r study=c('TCGA-CESC','TCGA-HNSC') clinical_table = "`isb-cgc.TCGA_bioclin_v0.Clinical`" Constructing Queries ==================== One of the great things about working in a scripting environment, is that our analysis -- the queries -- we write, can be constructed programmatically. That makes it easy to apply the same structured queries to many questions. In the next code block is an example of how to do that. .. code-block:: r sqlQuery = paste("SELECT case_barcode, project_short_name AS project, hpv_calls, hpv_status ", "FROM ", clinical_table, " WHERE project_short_name in (",paste(shQuote(study),collapse = ','),")",sep="") sqlQuery hpv_table = query_exec(sqlQuery,project = my_cloud_project) dim(hpv_table) head(hpv_table) # We can do some quality control ... # Assert that if hpv_calls is NA, hpv_status is Negative stopifnot((is.na(hpv_table$hpv_calls) && hpv_table$hpv_status=="Negative") || !is.na(hpv_table$hpv_calls)) # Let's explore the cohort ggplot(data=hpv_table, aes(x=hpv_status, fill=project)) + geom_bar(stat="count", position=position_dodge()) Uploading Data ============== The exact location of HPV integration is not available in the TCGA data, so instead we'll get a list of frequently targeted genes that was published with this paper: *Ka-Wei Tang et. al. The Landscape of viral expression and host gene fusion and adaptation in human cancer. doi:10.1038/ncomms3513* (Supplementary Data 2: Integration analysis results) We will access the data from our workshop bucket using the command line or from the Google Cloud Console. Using the cloud console, go to https://console.cloud.google.com and find the workshop bucket. Using the google command line tool: .. code-block:: none gsutil cp gs://isb-cgc-workshop/data/Larsson/ncomms3513-s3.tsv . gsutil cp gs://isb-cgc-workshop/data/Larsson/ncomms3513-s3_Schema.json . Now the data is in our directory, but we need to transform it into a BQ table. To do that, we need to create a data set in our project. We can do this from within the BigQuery web UI by clicking on the little blue triangle next to your project ID on the left. Or we can do this on the command line using the bq command line tool. .. code-block:: none gcloud init bq help bq ls bq mk workspace bq load --source_format CSV --field_delimiter "\t" --schema ncomms3513-s3_Schema.json workspace.ncomms3513_s3 ncomms3513-s3.tsv Integrating with the expression data ==================================== Now we can directly query our own data, and start to combine it with other tables. Let's try it out! This next query is going to select the genes that were associated with HPV integration in CESC and HNSC tumors. .. code-block:: r sqlQuery = " SELECT Overlapping_genes, Cancer FROM `isb-cgc-04-0030.workspace.ncomms3513_s3` WHERE Cancer IN ('TCGA-CESC','TCGA-HNSC') AND Overlapping_genes <> 'Intergenic' GROUP BY Cancer, Overlapping_genes " affected_genes = query_exec(sqlQuery,project = my_cloud_project) head(affected_genes) table(affected_genes$Cancer) Next, with those offen affected genes, we will query gene expression data. .. code-block:: r query <- " SELECT project_short_name, gene_name, AVG(HTSeq__FPKM) as mean_expression FROM `isb-cgc.TCGA_hg38_data_v0.RNAseq_Gene_Expression` WHERE project_short_name IN ('TCGA-CESC','TCGA-HNSC') AND gene_name IN ( SELECT Overlapping_genes AS gene_name FROM `isb-cgc-04-0030.workspace.ncomms3513_s3` WHERE Cancer IN ('TCGA-CESC','TCGA-HNSC') AND Overlapping_genes <> 'Intergenic' GROUP BY gene_name ) GROUP BY project_short_name, gene_name ORDER BY mean_expression" # running the query. mean_affected_genes = query_exec(query, project = my_cloud_project) # we'll create some more meaningful x-axis labels mean_affected_genes$xlabel <- paste0(mean_affected_genes$project_short_name, "_", mean_affected_genes$gene_name) # Now we can visualize it. qplot(data=mean_affected_genes, x=factor(x = xlabel, ordered = T, levels = xlabel), y=mean_expression, col=project_short_name) + theme(axis.text.x = element_text(angle = 90, hjust = 1, size=4)) + xlab("Project_Gene") Computing Statistics ==================== Instead, if we want to get the actual gene expression values, we could query for that, and retrieve it as a data.frame. .. code-block:: r sqlQuery = " SELECT case_barcode, sample_barcode, project_short_name, gene_name, HTSeq__FPKM FROM `isb-cgc.TCGA_hg38_data_v0.RNAseq_Gene_Expression` WHERE project_short_name IN ('CESC','HNSC') AND gene_name IN ( SELECT Overlapping_genes as gene_name FROM `isb-cgc-04-0030.workspace.ncomms3513_s3` WHERE Cancer IN ('TCGA-CESC','TCGA-HNSC') AND Overlapping_genes <> 'Intergenic' GROUP BY gene_name ) " gexp_affected_genes = query_exec(sqlQuery,project = my_cloud_project) #view results head(gexp_affected_genes) # a couple different ways to look at the results #qplot(data=gexp_affected_genes, x=project_short_name, y=HTSeq__FPKM, col=gene_name, geom="boxplot") #qplot(data=gexp_affected_genes, x=project_short_name, y=log2(HTSeq__FPKM), col=gene_name, geom="boxplot") qplot(data=gexp_affected_genes, x=log2(HTSeq__FPKM+1), col=gene_name, geom="density") + facet_wrap(~ project_short_name) Not all the samples listed in the clinical data have gene expression data, however. Let's filter the hpv_table to match the samples to those in gexp_affected_genes .. code-block:: r require(tidyr,quietly = TRUE) || install.packages('tidyr',verbose = FALSE) require(dplyr,quietly = TRUE) || install.packages('dplyr',verbose = FALSE) require(broom,quietly = TRUE) || install.packages('broom',verbose = FALSE) # let's get rid of 'indeterminate' samples hpv_table = dplyr::filter(hpv_table, hpv_status != "Indeterminate", case_barcode %in% gexp_affected_genes$case_barcode) T-tests ======= Now, we are going to perform t.tests on expression by hpv_status and study. .. code-block:: r gxps <- merge(x=gexp_affected_genes, y=hpv_table, by=c("project_short_name","case_barcode")) # Performing a t-test between hpv+ and hpv- by project_short_name and gene res0 <- gxps %>% group_by(project_short_name, gene_name) %>% do(tidy(t.test(log2(HTSeq__FPKM+1) ~ hpv_status, data=.))) %>% ungroup() %>% arrange(desc(statistic)) # These are the top 5 results ... top5 <- select(top_n(res0, 5, statistic), project_short_name, gene_name) # Let's subset the data by the top 5 results... res1 <- merge(x=top5, y=gxps) %>% mutate( Project_Gene = paste0(project_short_name, "_", gene_name)) # now we can plot the results... ggplot(res1, aes(x=Project_Gene, y=log2(HTSeq__FPKM+1), fill=hpv_status)) + geom_boxplot() Making BigQueries ================= Previously, we downloaded data and performed some work on it. But another way to work is to compute as much as possible in the cloud, and use R to visualize summary results. Please see: https://cloud.google.com/bigquery/query-reference .. code-block:: r sqlQuery = " SELECT case_barcode, sample_barcode, project_short_name, gene_name, HTSeq__FPKM FROM `isb-cgc.TCGA_hg38_data_v0.RNAseq_Gene_Expression` WHERE project_short_name = 'TCGA-CESC' AND case_barcode IN ( SELECT case_barcode FROM `isb-cgc.TCGA_bioclin_v0.Clinical` WHERE hpv_status = 'Positive' ) AND gene_name IN ( SELECT Overlapping_genes AS gene_name FROM `isb-cgc-04-0030.workspace.ncomms3513_s3` WHERE Cancer = 'TCGA-CESC' AND Overlapping_genes <> 'Intergenic' GROUP BY gene_name ) " q1 = query_exec(sqlQuery,project = cloud_project_workshop) dim(q1) Now lets make a small change, and get gene expression for subjects that are hpv negative. .. code-block:: r sqlQuery = " SELECT case_barcode, sample_barcode, project_short_name, gene_name, HTSeq__FPKM FROM `isb-cgc:TCGA_hg38_data_v0.RNAseq_Gene_Expression` WHERE project_short_name = 'TCGA-CESC' AND case_barcode IN ( SELECT case_barcode FROM `isb-cgc.TCGA_bioclin_v0.Clinical` WHERE hpv_status = 'Negative' ) AND gene_name IN ( SELECT Overlapping_genes AS gene_name FROM `isb-cgc-04-0030.workspace.ncomms3513_s3` WHERE Cancer = 'TCGA-CESC' AND Overlapping_genes <> 'Intergenic' GROUP BY gene_name ) " q2 <- query_exec(sqlQuery,project = cloud_project_workshop) dim(q2) Now we merge the previous two queries, and compute T statistics using BigQuery built in functions, SQRT, MEAN, STDDEV, POW, COUNT, and LOG2. Please see: https://cloud.google.com/bigquery/query-reference .. code-block:: r q <- " WITH ## we use the 'WITH' keyword in order to create a few ## preliminary tables that we will use later on in the ## query -- this helps in writing 'modular' SQL that ## is easier to read -- ## start by getting the list of genes of interest from ## the paper's table (will result in a list of 106 genes) geneList AS ( SELECT Overlapping_genes AS gene_name FROM `isb-cgc-04-0030.workspace.ncomms3513_s3` WHERE Overlapping_genes <> 'Intergenic' GROUP BY gene_name ), ## next, get the identifiers (barcodes) for all HPV+ cases ## (will result in a list of 383 cases) posCaseList AS ( SELECT case_barcode FROM `isb-cgc.TCGA_bioclin_v0.Clinical` WHERE hpv_status = 'Positive' ), ## now do the same for all HPV- cases ## (will result in a list of 664 cases) negCaseList AS ( SELECT case_barcode FROM `isb-cgc.TCGA_bioclin_v0.Clinical` WHERE hpv_status = 'Negative' ) ## ## Now we being our main SELECT statement, which actually ## wraps a pair of SELECTs that are JOINed together ## SELECT p.gene_name AS gene, p.project_short_name, p.x AS x, p.sx2 AS sx2, p.nx AS nx, o.y AS y, o.sy2 AS sy2, o.ny AS ny, (p.x-o.y) / SQRT((p.sx2/p.nx) + (o.sy2/o.ny)) AS T FROM ( # first the gene expression summaries for hpv+ tumors SELECT project_short_name, gene_name, AVG(LOG(HTSeq__FPKM+1,2)) AS y, POW(STDDEV(LOG(HTSeq__FPKM+1,2)),2) AS sy2, COUNT(case_barcode) AS ny FROM `isb-cgc.TCGA_hg38_data_v0.RNAseq_Gene_Expression` WHERE project_short_name = 'TCGA-CESC' AND case_barcode IN ( SELECT case_barcode FROM posCaseList ) AND gene_name IN ( SELECT gene_name FROM geneList ) GROUP BY project_short_name, gene_name HAVING ny>0 AND sy2>0) AS o JOIN ( # Then we get the gene expression summaries from hpv- SELECT project_short_name, gene_name, AVG(LOG(HTSeq__FPKM+1,2)) AS x, POW(STDDEV(LOG(HTSeq__FPKM+1,2)),2) AS sx2, COUNT(case_barcode) AS nx FROM `isb-cgc.TCGA_hg38_data_v0.RNAseq_Gene_Expression` WHERE project_short_name = 'TCGA-CESC' AND case_barcode IN ( SELECT case_barcode FROM negCaseList ) AND gene_name IN ( SELECT gene_name FROM geneList ) GROUP BY project_short_name, gene_name HAVING nx>0 AND sx2>0) AS p ON p.gene_name = o.gene_name AND p.project_short_name = o.project_short_name GROUP BY gene, project_short_name, x, sx2, nx, y, sy2, ny, T ORDER BY T DESC " t_test_result <- query_exec(q, project = cloud_project_workshop) head(t_test_result) # and we can see the same results in the previously done work. res0 Extras ====== Transform gexp_affected_genes_df into a gexp-by-samples feature matrix .. code-block:: r gexp_fm = tidyr::spread(gexp_affected_genes,gene_name,HTSeq__FPKM) gexp_fm[1:5,1:5]