Pan-Cancer Atlas BigQuery Data¶
The Pan-Cancer Atlas BigQuery data set was produced in collaboration with the TCGA research network, the GDC, and the NCI. This rich data set allows for an integrated examination of the full set of tumors characterized in the robust TCGA dataset and provides a new way to explore and analyze the processes driving cancer.
The availability of Pan-Cancer Atlas data in BigQuery enables easy integration of this resource with other public data sets in BigQuery, including other open-access datasets made available by the ISB-CGC (see this and that for more details on other publicly accessible BigQuery data sets).
About Pan-Cancer Atlas Data¶
The tables are generally unmodified uploads of the files in the GDC Pan-Cancer Atlas. The Filtered_* tables were annotated as appropriate with ParticipantBarcode, SampleBarcode, AliquotBarcode, SampleTypeLetterCode, SampleType and TCGA Study. Subsequently the tables were filtered using the Pan-Cancer Atlas whitelist (which is the list of TCGA barcodes included in the Pan-Cancer Atlas). Two exceptions are the (public) MC3 MAF file and the TCGA-CDR resource.
Use of the tables starting with Filtered_* is recommended.
See examples of statistical Jupyter notebooks using the Pan-Cancer Atlas data here.
Adding the Pan-Cancer Atlas tables to your workspace¶
If you are new to using ISB-CGC Google BigQuery data sets, see the Quickstart Guide to learn how to obtain a Google identity and how to set up a Google Cloud Project.
To add public BigQuery data sets and tables to your “view” in the Google BigQuery Console you need to know the name of the GCP project that owns the dataset(s).
To add the publicly accessible ISB-CGC datasets (project name:
isb-cgc-bq) which includes the Pan-Cancer Atlas data set ( dataset name:
follow these steps. (Note that these tables also exist in project
isb-cgc, but that ISB-CGC is migrating current data to project
isb-cgc-bq. If you are using the pancancer_atlas tables in
isb-cgc, they are still available for you.)
You should now be able to see and explore all of the Pan-Cancer Atlas tables and also tables of other ISB-CGC data sets. Clicking on the blue triangle next to a dataset name will open it and show the list of tables in the data set. Clicking on a table name will open up information about the table in main panel, where you can view the Schema, Details, or a Preview of the table.
Additional projects with public BigQuery data sets which you may want to explore (repeating the same process will add these to your BigQuery side-panel) include genomics-public-data and google.com:biggene.
You can also search for and learn about Pan-Cancer Atlas tables through the ISB-CGC BigQuery Table Search UI. Type ‘pancancer’ in the Search box in the upper right-hand corner to filter for them.
Pan-Cancer Atlas BigQuery Query Example¶
Ready to query? Follow the steps below to run a query in the Google BigQuery Console. More details are here.
Let’s query using the MC3 somatic mutation table.
Click on COMPOSE NEW QUERY button.
Paste the sample query below into the text-box.
Within a second or two you should see a green circle with a checkmark below the lower right corner of the New Query text-box. – If instead you see a red circle with an exclamation mark, click on it to see what your Syntax Error is.
Once you do have the green circle, you can click on it to see a message like: “Valid: This query will process 76.3 MB when run.”
To execute the query, click on RUN!
WITH mutCounts AS ( SELECT COUNT(DISTINCT( Tumor_SampleBarcode )) AS CaseCount, Hugo_Symbol, HGVSc FROM `isb-cgc-bq.pancancer_atlas.Filtered_MC3_MAF_V5_one_per_tumor_sample` GROUP BY Hugo_Symbol, HGVSc ), mutRatios AS ( SELECT HGVSc, Hugo_Symbol, CaseCount, (CaseCount/SUM(CaseCount) OVER (PARTITION BY Hugo_Symbol)) AS ratio FROM mutCounts ) SELECT * FROM mutRatios WHERE CaseCount>=10 AND ratio>=0.2 AND HGVSc is not null ORDER BY ratio DESC