Programs and Data Sets

Between the ISB-CGC and the Genomic Data Commons (GDC), there are many cancer data sets available on the Google Cloud Platform. ISB-CGC hosts some high-level clinical, biospecimen and molecular data in a series of carefully curated datasets and tables in BigQuery and radiology and pathology images in Google Cloud Storage. The GDC hosts several more data sets that include low-level sequencing data.

The ISB-CGC started with The Cancer Genome Atlas (TCGA) data sets but has expanded to include other data sets from programs such as Therapeutically Applicable Research To Generate Effective Treatments (TARGET) program. Accompanying the NCI data sets, ISB-CGC hosts several data sets from programs such as Catalogue Of Somatic Mutations In Cancer (COSMIC) from the Wellcome Trust Sanger Institute. We are always interested in adding new data sets, so if you have any suggestions or requests for additional data, please let us know (feedback@isb-cgc.org).

NCI Genomic Data Commons

The NCI’s Genomic Data Commons (GDC) provides the cancer research community with a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine.

Reference Data Sets

ISB-CGC hosts a series of reference tables in BigQuery with information that describes or annotates the human (or other) genome(s) or is necessary to work with data generated by specific platforms.

Storage Platforms

As part of its mission, the ISB-CGC has been exploring the best ways to use available cloud technologies to provide access to the data. To this end, the data is made available using these three main Google Cloud Platform technologies:

Google BigQuery

Google BigQuery (BQ) is a massively-parallel analytics engine that is ideal for working with data that is essentially tabular in nature. This includes the high-level clinical, biospecimen, and molecular data from the main NCI programs. It is also where we store a large amount of metadata about files that are more appropriately stored in Google Cloud Storage, as well as genome reference sources (e.g. GENCODE, miRBase, etc.). All of these datasets and tables are completely open access and available to the research community.

Google Cloud Storage

Google Cloud Storage (GCS) is a cloud-based object-store that is used to store other types of (typically binary) data which is typically processed by custom software pipelines. The data hosted by GDC is contained within Google Cloud Storage. The ISB-CGC Web App

Google Genomics

Google Genomics (GG) provides a storage platform and a way to work with sequence-level data which can also be worked through the Global Alliance for Genomics and Health (GA4GH) APIs. GA4GH is a policy-framing and technical standards-setting organization, seeking to enable responsible genomic data sharing within a human rights framework. GA4GH tools can be found here.

Security

It is recommended that you review important information about data security and data access.

A note about legacy and harmonized data sets

Programs like TCGA that predate the Genomic Data Commons will have both legacy data sets (data as originally generated by the program) and harmonized data sets created by the Genomic Data Commons. While these data sets do have much in common, as part of the GDC harmonization process several changes can occur including removal or addition of cases and samples or changes in terminology. One of the goals of the ISB-CGC is to stay current with changes introduced by GDC and therefore you may find differences between legacy data and harmonized data.


Have feedback or corrections? Please email us at feedback@isb-cgc.org.