Statistical Notebooks

Integrated statistical analysis and exploration of multiple genomic and clinical data types provides researchers with a great possibility to expand our current knowledge of cancer. ISB-CGC offers a great source of diverse data types including gene expression, somatic mutations, clinical data, etc. We have developed a series of notebooks that use BigQuery to compute the statistical associations between different combinations of the data types available in ISB-CGC.

Bioinformatics notebooks

Significant correlations and their p-values using BigQuery Python  
One-way ANOVA with BigQuery Python R
Score gene sets in BigQuery Python R
Nearest Centroid Classification using BigQuery Python R

Standard pairwise statistics

The following table lists notebooks that compute associations between pairs of data types available in ISB-CGC. They assess the statistical significance for an association using rank-ordered data and a statistical test appropriate to each data type pair depending on categorical or numerical categorization. The Regulome Explorer inspired notebook is a special notebook that allows computation of associations between all possible data types available in the TCGA dataset; more details are below.

Data type Data type Statistical test/notebook
Gene expression Clinical Kruskal-Wallis score
Gene expression Somatic mutation T-test score
Gene expression Gene expression Spearman Correlation
Somatic mutation Clinical Chi Square test
Somatic mutation Somatic Mutation Fisher’s exact test
All types All types Regulome Explorer inspired notebook

Regulome Explorer Inspired Notebook

Regulome Explorer is a well-established web tool for the exploration and visualization of associations between clinical and molecular features of TCGA data. Regulome Explorer was developed in 2012 in close collaboration between the Institute for Systems Biology and the MD Anderson Cancer Center. It enables users to search and visualize precomputed statistical data filtered according to user-specified parameters. Although Regulome Explorer’s broad functionality and high-quality graphics make it a valuable tool for exploring and visualizing 20 of the 33 TCGA data sets, it does not yet contain analysis of recent releases of TCGA and cannot be easily applied to data sets other than TCGA.

We developed a more flexible version, replicating capabilities of Regulome Explorer, as a Python notebook that uses Google Cloud resources. Rather than working with precomputed, fixed cohorts and fixed results, statistical analyses are dynamically performed in the cloud, with user defined patient cohorts. Moreover, the notebook can be extended so that users can analyze additional data sets available as part of the ‘ISB-CGC BigQuery ecosystem’ such as TCGA, TARGET, CCLE, COSMIC, and others. The notebook can be accessed in Regulome Explorer inspired notebook.


Have feedback or corrections? Please email us at feedback@isb-cgc.org.