******************************* Pan-Cancer Atlas BigQuery Data ******************************* `The Pan-Cancer Atlas BigQuery data set `_ was produced in collaboration with the `TCGA research network `_, the `GDC `_, and the `NCI `_. This rich data set allows for an integrated examination of the full set of tumors characterized in the robust TCGA dataset and provides a new way to explore and analyze the processes driving cancer. The availability of Pan-Cancer Atlas data in BigQuery enables easy integration of this resource with other public data sets in BigQuery, including other open-access datasets made available by the ISB-CGC (see `this `_ and `that `_ for more details on other publicly accessible BigQuery data sets). About Pan-Cancer Atlas Data ########################### The Pan-Cancer Atlas BigQuery tables (`accessed here `_) mirror most of the files shared by the Pan-Cancer Atlas initiative on the `GDC PanCanAtlas Publications page `_. The tables are generally unmodified uploads of the files in the `GDC Pan-Cancer Atlas `_. The Filtered_* tables were annotated as appropriate with ParticipantBarcode, SampleBarcode, AliquotBarcode, SampleTypeLetterCode, SampleType and TCGA Study. Subsequently the tables were filtered using the Pan-Cancer Atlas whitelist (which is the list of TCGA barcodes included in the Pan-Cancer Atlas). Two exceptions are the (public) *MC3 MAF file* and the *TCGA-CDR resource*. Use of the tables starting with Filtered_* is recommended. See examples of statistical Jupyter notebooks using the Pan-Cancer Atlas data `here `_. Adding the Pan-Cancer Atlas tables to your workspace #################################################### If you are new to using ISB-CGC Google BigQuery data sets, see the `Quickstart Guide `_ to learn how to obtain a Google identity and how to set up a Google Cloud Project. To add public BigQuery data sets and tables to your "view" in the `Google BigQuery Console `_ you need to know the name of the GCP project that owns the dataset(s). To add the publicly accessible ISB-CGC datasets (project name: ``isb-cgc-bq``) which includes the Pan-Cancer Atlas data set ( dataset name: ``pancancer_atlas``) follow these steps_. (Note that these tables also exist in project ``isb-cgc``, but that ISB-CGC is migrating current data to project ``isb-cgc-bq``. If you are using the pancancer_atlas tables in ``isb-cgc``, they are still available for you.) .. _steps: http://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/progapi/bigqueryGUI/LinkingBigQueryToIsb-cgcProject.html You should now be able to see and explore all of the Pan-Cancer Atlas tables and also tables of other ISB-CGC data sets. Clicking on the blue triangle next to a dataset name will open it and show the list of tables in the data set. Clicking on a table name will open up information about the table in main panel, where you can view the Schema, Details, or a Preview of the table. Additional projects with public BigQuery data sets which you may want to explore (repeating the same process will add these to your BigQuery side-panel) include genomics-public-data and google.com:biggene. You can also search for and learn about Pan-Cancer Atlas tables through the `ISB-CGC BigQuery Table Search UI `_. Type 'pancancer' in the **Search** box in the upper right-hand corner to filter for them. Pan-Cancer Atlas BigQuery Query Example ####################################### Ready to query? Follow the steps below to run a query in the Google BigQuery Console. More details are `here `_. * `Login `_ to your Google account (`Chrome `_ is the preferred browser); * Go to the `BigQuery Console `_. Let's query using the MC3 somatic mutation table. * Click on **COMPOSE NEW QUERY** button. * Paste the sample query below into the text-box. * Within a second or two you should see a green circle with a checkmark below the lower right corner of the New Query text-box. -- If instead you see a red circle with an exclamation mark, click on it to see what your Syntax Error is. * Once you do have the green circle, you can click on it to see a message like: "Valid: This query will process 76.3 MB when run." * To execute the query, click on **RUN**! .. code-block:: sql WITH mutCounts AS ( SELECT COUNT(DISTINCT( Tumor_SampleBarcode )) AS CaseCount, Hugo_Symbol, HGVSc FROM `isb-cgc-bq.pancancer_atlas.Filtered_MC3_MAF_V5_one_per_tumor_sample` GROUP BY Hugo_Symbol, HGVSc ), mutRatios AS ( SELECT HGVSc, Hugo_Symbol, CaseCount, (CaseCount/SUM(CaseCount) OVER (PARTITION BY Hugo_Symbol)) AS ratio FROM mutCounts ) SELECT * FROM mutRatios WHERE CaseCount>=10 AND ratio>=0.2 AND HGVSc is not null ORDER BY ratio DESC