Pan-Cancer Atlas BigQuery Data

The Pan-Cancer Atlas BigQuery data set was produced in collaboration with the TCGA research network, the GDC, and the NCI. This rich data set allows for an integrated examination of the full set of tumors characterized in the robust TCGA dataset and provides a new way to explore and analyze the processes driving cancer.

The availability of Pan-Cancer Atlas data in BigQuery enables easy integration of this resource with other public data sets in BigQuery, including other open-access datasets made available by the ISB-CGC (see this and that for more details on other publicly accessible BigQuery data sets).

About Pan-Cancer Atlas Data

The Pan-Cancer Atlas BigQuery tables (accessed here) mirror most of the files shared by the Pan-Cancer Atlas initiative on the GDC PanCanAtlas Publications page.

The tables are generally unmodified uploads of the files in the GDC Pan-Cancer Atlas. The Filtered_* tables were annotated as appropriate with ParticipantBarcode, SampleBarcode, AliquotBarcode, SampleTypeLetterCode, SampleType and TCGA Study. Subsequently the tables were filtered using the Pan-Cancer Atlas whitelist (which is the list of TCGA barcodes included in the Pan-Cancer Atlas). Two exceptions are the (public) MC3 MAF file and the TCGA-CDR resource.

Use of the tables starting with Filtered_* is recommended.

See examples of statistical Jupyter notebooks using the Pan-Cancer Atlas data here.

Adding the Pan-Cancer Atlas tables to your workspace

If you are new to using ISB-CGC Google BigQuery data sets, see the Quickstart Guide to learn how to obtain a Google identity and how to set up a Google Cloud Project.

To add public BigQuery data sets and tables to your “view” in the Google BigQuery Console you need to know the name of the GCP project that owns the dataset(s). To add the publicly accessible ISB-CGC datasets (project name: isb-cgc) which includes the Pan-Cancer Atlas data set ( dataset name: pancancer_atlas) follow these steps.

You should now be able to see and explore all of the Pan-Cancer Atlas tables and also tables of other ISB-CGC data sets. Clicking on the blue triangle next to a dataset name will open it and show the list of tables in the data set. Clicking on a table name will open up information about the table in main panel, where you can view the Schema, Details, or a Preview of the table.

Additional projects with public BigQuery data sets which you may want to explore (repeating the same process will add these to your BigQuery side-panel) include genomics-public-data and google.com:biggene.

You can also search for and learn about Pan-Cancer Atlas tables through the ISB-CGC BigQuery Table Search UI. Type ‘pancancer’ in the Search box in the upper right-hand corner to filter for them.

Pan-Cancer Atlas BiqQuery Query Example

Ready to query? Follow the steps below to run a query in the Google BigQuery Console. More details are here.

Let’s query using the MC3 somatic mutation table.

  • Click on COMPOSE NEW QUERY button.
  • Paste the sample query below into the text-box.
  • Within a second or two you should see a green circle with a checkmark below the lower right corner of the New Query text-box. – If instead you see a red circle with an exclamation mark, click on it to see what your Syntax Error is.
  • Once you do have the green circle, you can click on it to see a message like: “Valid: This query will process 76.3 MB when run.”
  • To execute the query, click on RUN!
WITH
mutCounts AS (
  SELECT
     COUNT(DISTINCT( Tumor_SampleBarcode )) AS CaseCount,
     Hugo_Symbol,
     HGVSc
  FROM
     `isb-cgc.pancancer_atlas.Filtered_MC3_MAF_V5_one_per_tumor_sample`
  GROUP BY
     Hugo_Symbol,
     HGVSc
),
mutRatios AS (
  SELECT
     HGVSc,
     Hugo_Symbol,
     CaseCount,
     (CaseCount/SUM(CaseCount) OVER (PARTITION BY Hugo_Symbol)) AS ratio
  FROM
     mutCounts
)
SELECT  *
FROM
   mutRatios
WHERE
   CaseCount>=10
   AND ratio>=0.2
   AND HGVSc is not null
ORDER BY
   ratio DESC

Have feedback or corrections? Please email us at feedback@isb-cgc.org.