How To Get Started on ISB-CGC

The ISB-CGC provides both interactive (through a web application) and programmatic access to data hosted by institutes such as the Genomic Data Commons GDC of the National Cancer Institute (NCI), and the Wellcome Trust Sanger Institute, leveraging many aspects of the Google Cloud Platform.

More about ISB-CGC, ISB-CGC Main Landing Page and FAQS.

I. Data Access and Google Cloud Project Setup

II. Accessing and Analyzing Data via BigQuery

  • BigQuery is Google’s native big data analysis tool. It is a serverless, highly scalable data warehouse tool that allows researchers to find meaningful insights from data using standard SQL queries CHEAPLY, and FAST!
  • ISB-CGC has leveraged this powerful tool and uploaded multiple cancer genomics datasets into BigQuery tables that are open to the public. ISB-CGC Datasets in BigQuery and the always freshly updated Data Release Notes and Future Plans.
  • To obtain access to the ISB-CGC open access project tables in BigQuery, users can link these tables to your GCP project as described here.
  • To obtain access to the ISB-CGC controlled access project tables in BigQuery, users can link these tables to your GCP project as described here.
  • ISB-CGC provides quickstart guides, tutorials and examples in both R and Jupyter notebooks for BigQuery in the Tutorials and Community Notebooks sections of the documentation page.
  • Every month, ISB-CGC provides an example analysis of cancer genomics data using BigQuery in our Query of the Month blog.

III. Accessing and Analyzing Data Stored in Google Cloud Storage

  • All open-access data on ISB-CGC are stored in a publically available GCS bucket (gs://isb-cgc-open).
  • All controlled-access data are stored in Google Cloud Storage (GCS) in their original form as obtained from the GDC.
  • To access controlled data, users must first be authenticated by NIH (via the ISB-CGC web-app). Upon successful authentication, user dbGaP authorization will be verified. These two steps are required before the user’s Google identity is added to the access control list (ACL) for the controlled data. At this time, this access must be renewed every 24 hours.
  • Summary of programs, data types and data formats available
  • Working with large-scale data hosted by the ISB-CGC in Google Cloud Storage requires some familiarity with tools such as the Google Cloud SDK, Google Compute Engine, Virtual Machines and Docker.

Have feedback or corrections? Please email us at