COSMIC in BigQuery hosted by ISB-CGC

../_images/COSMIC.png

The COSMIC tables in BigQuery, produced in collaboration with the Wellcome Trust Sanger Institute, provide a new way to explore and understand the mutations driving cancer. The availability of COSMIC in BigQuery enables easy integration of this resource with other public datasets in BigQuery, including other open-access datasets made available by the ISB-CGC (see this and that for more details on other publicly accessible BigQuery datasets).

COSMIC Release v81 update: Two new BigQuery datasets now contain all of the tables available for download from the COSMIC ftp site (not just the “Mutant” table which was included for the v80 release). The availability of these additional tables will support many more types of queries – please explore them at (after registering for access as described below):

Details about the underlying COSMIC export files used to create these BigQuery tables can be found in README files for GRCh38 and GRCh37.

Getting Started

Register for access to COSMIC in BigQuery:

  • if you are already a registered user of COSMIC, you will need to go to your account page and add a valid “Google identity” in the Google ID box: when you are signed in to COSMIC, your name in the upper-right corner is a pull-down menu from which you can access your Account Settings;
  • if the Email Address that you initially used when registering for COSMIC is already a valid Google identity, you may simply re-enter the same email address into the Google ID box;
  • if you are not sure whether your institutional (or other) email address is a Google identity, you can check by entering it in the Google password-assistance page; or by asking your IT staff;
  • if you are not currently a registered COSMIC user, you will first need to register, agree to the Terms and Conditions, and supply a valid Google identity in the Google ID box;

Once you have completed these steps, ISB-CGC will obtain the Google identity that you provided and you will be given “viewer” access to the COSMIC tables in BigQuery. You will also be added to an exploratory Google Cloud Platform (GCP) project called isb-cgc-cosmic which will allow you to run queries at no cost to you.

A few important notes:

  • When you register with COSMIC, you create a password for your COSMIC account – which is associated with whatever email address you provided. This password is your COSMIC password, please avoid reusing any other password.
  • If you are not sure what a “Google ID” is, it is the name associated wth a “Google account” – this includes any gmail address. If you do not already have a Google account, you can create one.
  • If you mistype your Google ID, or enter a string that is not a valid Google ID, you will not be able to access the COSMIC tables in BigQuery. Google IDs are not being automatically verified at this time, so please double-check that the Google ID you provided is correct.
  • Avoid using an alias: eg the base account tb@mylab.org might have a longer-form alias like thomas.brown@mylab.org – please enter the ‘base’ name;

Interactive Web-based Exploration

NB: After going through the registration process described above, there will be a short delay before your Google identity is granted the necessary access to BigQuery and the COSMIC data resources. If you get an error when running the sample query in this section, please wait 10-15 minutes and then try again. If you are still not successful, please verify that the Google ID you have provided is a valid Google account. If you are still not able to run the sample query given below, please contact us at feedback@isb-cgc.org.

  • login to your Google account (Chrome is the preferred browser);
  • go to the BigQuery web UI – if you see a welcome screen inviting you to Create a Project then your ISB-CGC registration process is not yet complete;
../_images/COSMIC-sql-00.png
  • click on the big red COMPOSE QUERY button in the upper left corner;
  • click on the Show Options button below the New Query text-box;
  • un-check the Use Legacy SQL check-box (the bottom-most “option”);
  • click on the Hide Options button;
  • paste the sample query below into the New Query text-box;
  • within a second or two you should see a green circle with a check-mark below the lower-right-corner of the New Query text-box – if instead you see a red circle with an exclamation mark, click on it to see what your Syntax Error is;
  • once you do have the green circle, you can click on it to see a message like: “Valid: This query will process 131 MB when run.”
  • to execute the query, click on RUN QUERY !
WITH
  mutCounts AS (
  SELECT
    COUNT(DISTINCT(ID_tumour)) AS CaseCount,
    Mutation_AA,
    Gene_name
  FROM
    `isb-cgc.COSMIC.grch37_v80`
  GROUP BY
    Mutation_AA,
    Gene_name ),
  mutRatios AS (
  SELECT
    Mutation_AA,
    Gene_name,
    CaseCount,
    (CaseCount/SUM(CaseCount) OVER (PARTITION BY Gene_name)) AS ratio
  FROM
    mutCounts )
SELECT
  *
FROM
  mutRatios
WHERE
  CaseCount>=1000
  AND ratio>=0.10
  AND NOT ( Mutation_AA LIKE "%?%" )
ORDER BY
  Gene_name,
  ratio DESC

About the COSMIC BigQuery Tables

The COSMIC BigQuery tables are based on the “CosmicMutantExport” files downloaded from the Sanger ftp site. This file is a tab-separated table containing all COSMIC point mutations from targeted and genome-wide screens. The ISB-CGC COSMIC dataset in BigQuery currently includes the latest COSMIC release (v80) as well as the previous release (v79) for both GRCh37 and GRCh38.

BigQuery Usage Costs

More details about BigQuery costs can be found in the Google documentation. There are two basic types of costs: storage costs and usage costs. ISB-CGC is hosting these COSMIC tables in BigQuery and is paying for the storage costs (with support from NCI). The size of each COSMIC table is less than 1.5 GB and therefore costs less than $0.25 per year to store.

The main costs associated with using BigQuery are the query costs. BigQuery is a cloud-based massively parallel analytic engine which can scan terabytes of data in seconds. Query costs start at $5 (USD) per TB of data scanned, but can be higher for more computationally intensive queries (eg those that include complex user-defined-functions).

For the sample query above, we saw that clicking on the check-mark in the green circle produced this message: Valid: This query will process 131 MB when run. The cost of this specific query can be estimated using this information: ($5/TB) x (131 MB / (1000000 MB/TB)) = $0.000655. This cost is very (perhaps suprisingly) low, but it is always important to think carefully about your queries and to make them as efficient as possible. If you want to derive summary information about all ~20,000 genes, for example, you could do that with a single query that might cost a few pennies, or you might write a less-clever query that returns information only about a single gene and then programmatically loop over all genes, running that single-gene query 20,000 times. Your overall query costs using this less-clever approach, instead of being a few pennies would be several hundred dollars! This latter approach would also take significantly more time.

As your queries become more complex and you begin to join in other resources such as the ISB-CGC genomic-reference or molecular-data tables, the amount of data processed by a single query may increase into the GB or even TB range.

During this introductory period (for at least the next 6 months), all registered COSMIC users will be added to the isb-cgc-cosmic Google Cloud Platform (GCP) project so that they will be able to perform exploratory queries at no cost. (These costs will be paid by ISB-CGC, again with funding from NCI.) Please note that users who perform large numbers of queries and incur significant costs will be removed from the isb-cgc-cosmic GCP project and will be required to create their own GCP projects prior to performing additional queries. If you want to be able to upload your own data to BigQuery or save the results of your queries as new BigQuery tables, you will need to have your own GCP project. (All new GCP users are welcome to take advantage of the Google free trial which includes up to $300 in funding to be used over a period of 60 days.)

Additional Public BigQuery Datasets

There are many public BigQuery datasets containing genomic information, and you can combine any of these resources into your SQL queries on the COSMIC tables – all you need is the name of the table.

In the example query above, the table being queried is isb-cgc.COSMIC.grch37_v80; a complete BigQuery table name has three components:

  • the first part of the name (isb-cgc) is the Google Cloud Platform (GCP) project name;
  • the second part (COSMIC) is the dataset name; and
  • the third part (grch37_v80) is the table name.

To add public BigQuery datasets and tables to your “view” in the BigQuery web UI you need to know the name of the GCP project that owns the dataset(s). To add the publicly accessible ISB-CGC datasets (project name: isb-cgc) follow these steps.

You should now be able to see and explore all of the ISB-CGC public datasets, including the COSMIC dataset, if you are a registered COSMIC user. Clicking on the blue triangle next to a dataset name will open it and show the list of tables in the dataset. Clicking on a table name will open up information about the table in main panel, where you can view the Schema, Details, or a Preview of the table.

Additional projects with public BigQuery datasets which you may want to explore (repeating the same process will add these to your BigQuery side-panel) include genomics-public-data and google.com:biggene.

Additional BigQuery Documentation

The main Google BigQuery documentation can be found here.

Legacy SQL vs Standard SQL

BigQuery introduced support for Standard SQL in 2016. The previous version of SQL supported by BigQuery is now known as Legacy SQL. Note that when you first go to the BigQuery web UI, Legacy SQL will be activated by default and you will need to enable Standard SQL if you want to use Standard SQL. For simple queries, the same syntax will work in both, except for one important detail which is how you specify the table name. A simple Standard SQL query might look like:

SELECT *
  FROM `isb-cgc.COSMIC.grch37_v80`
  LIMIT 1000

whereas the same query in Legacy SQL requires square brackets around the table name and a colon between the project name and the dataset name, like this:

SELECT *
  FROM [isb-cgc:COSMIC.grch37_v80]
  LIMIT 1000

SQL functions

Standard SQL includes a large variety of built-in functions and operators including logical and statistical aggregate functions, and mathematical functions, just to name a few. User-defined functions (UDFs) are also supported and can be used to further extend the types of analyses possible in BigQuery.

Using the bq Command Line Tool

The bq command line tool is part of the cloud SDK and can be used to interact directly with BigQuery from the command line. The cloud SDK is easy to install and is available for most operating systems. You can use bq to create and upload your own tables into BigQuery (if you have your own GCP project), and you can run queries at the command-line like this:

bq query --allow_large_results \
         --destination_table="myproj:dataset:query_output" \
         --nouse_legacy_sql \
         --nodry_run \
         "$(cat myQuery.sql)"

(where myQuery.sql is a plain-text file containing the SQL, and the destination table is in an existing BigQuery dataset in your project).

Using BigQuery from R

BigQuery can be accessed from R using one of two powerful R packages: bigrquery and dplyr. Please refer to the documentation provided with these packages for more information.

Using BigQuery from Python

BigQuery client libraries are available that let you interact with BigQuery from Python or other languages. In addition, the experimental pandas.io.gbq module provides a wrapper for BigQuery.

Getting Help

Aside from the documentation, the best place to look for help using BigQuery and tips and tricks with SQL is StackOverflow. If you tag your question with google-bigquery your question will quickly get the attention of Google BigQuery experts. You may also find that your question has already been asked and answered among the nearly 10,000 questions that have already been asked about BigQuery on StackOverflow.

More SQL Examples

Let’s start with a few simple examples to get some practice using BigQuery, and to explore some of the available fields in these COSMIC tables.

../_images/COSMIC-sql-01.png

Note that all of these examples are in “Standard SQL”, so make sure that you have that enabled. (See instructions above regarding un-checking the “Legacy SQL” box in the BigQuery web UI.)

1. How many mutations have been observed in KRAS?

SELECT
  COUNT(DISTINCT(ID_sample)) AS numSamples,
  COUNT(DISTINCT(ID_tumour)) AS numTumours
FROM
  `isb-cgc.COSMIC.grch37_v80`
WHERE
  Gene_name="KRAS"

You can simply copy-and-paste any of the SQL queries on this page into the BigQuery web UI . The screen-shot shown here shows the query in the “New Query” box, and the results down below. Just click on the “RUN QUERY” button to run the query. Notice the green check-mark indicating that the query looks good.

2. What other information is available about these KRAS mutant tumours?

In addition to answering the question above, this next query also illustrates usage of the WITH construct to create an intermediate table on the fly, and then use it in a follow-up SELECT:

WITH
  t1 AS (
  SELECT
    ID_tumour,
    Primary_site,
    Primary_histology,
    Mutation_AA,
    Mutation_Description,
    FATHMM_prediction,
    Sample_source
  FROM
    `isb-cgc.COSMIC.grch37_v80`
  WHERE
    Gene_name="KRAS"
  GROUP BY
    ID_tumour,
    Primary_site,
    Primary_histology,
    Mutation_AA,
    Mutation_Description,
    FATHMM_prediction,
    Sample_source )
SELECT
  COUNT(*) AS n,
  Primary_site,
  Primary_histology,
  Mutation_AA,
  Mutation_Description,
  FATHMM_prediction,
  Sample_source
FROM
  t1
GROUP BY
  Primary_site,
  Primary_histology,
  Mutation_AA,
  Mutation_Description,
  FATHMM_prediction,
  Sample_source
ORDER BY
  n DESC

3. What are the most frequently observed mutations and how often do they occur?

WITH
  t1 AS (
  SELECT
    ID_tumour,
    Gene_name,
    Mutation_AA,
    Mutation_Description
  FROM
    `isb-cgc.COSMIC.grch37_v80`
  GROUP BY
    ID_tumour,
    Gene_name,
    Mutation_AA,
    Mutation_Description )
SELECT
  COUNT(*) AS n,
  Gene_name,
  Mutation_AA,
  Mutation_Description
FROM
  t1
GROUP BY
  Gene_name,
  Mutation_AA,
  Mutation_Description
HAVING
  n >=1000
ORDER BY
  n DESC

4. Joining COSMIC to Kaviar and Ensembl – all in BigQuery!

Now let’s try something a bit more complicated! We’re going to query the COSMIC database and then join the intermediate results with the Kaviar table and the Ensembl gene-set. (Note that since most of the available TCGA data is based on GRCH37/hg19, we’re still frequently using those older reference sources, but newer gene-sets are also available in the ISB-CGC genome_reference dataset in BigQuery.

Kaviar is a large database of known variants which is also available in BigQuery, hosted by the ISB-CGC. In the complex query below, we will extract a subset of commonly observed mutations in cancer from COSMIC and then see how many of them have also been observed in “normal” genomes (Kaviar excludes cancer genomes but includes some data from cell lines and individuals affected by disease.)

WITH
  --
  -- *COSMIC_t1*
  -- Our first subquery intermediate table extracts just the sample-name, nucleotide-change
  -- and genomic coordinates from the COSMIC table for all single-nucleotide mutations.
  -- The resulting intermediate table contains ~3.7M rows
  COSMIC_t1 AS ( SELECT
    -- some of the TCGA identifiers are 12-characters long and some 15 -- this CASE statement
    -- just strips off the additional 3 characters from the longer identifiers
    (CASE
        WHEN (Sample_name LIKE 'TCGA-%' AND CHAR_LENGTH(Sample_name)>12) THEN SUBSTR(Sample_name,1,12)
        ELSE Sample_name END) AS Sample_name,
    -- here we split off just the nucleotide-change, eg "G>T"
    SUBSTR(Mutation_CDS,-3,3) AS COSMIC_nucChange,
    -- here we're splitting up the genomic coordinate into it's three component parts:
    SPLIT(Mutation_genome_position,':')[OFFSET(0)] AS chr,
    CAST(SPLIT(SPLIT(Mutation_genome_position,':')[OFFSET(1)],'-')[OFFSET(0)] AS INT64) AS startPos,
    CAST(SPLIT(SPLIT(Mutation_genome_position,':')[OFFSET(1)],'-')[OFFSET(1)] AS INT64) AS endPos
  FROM
    `isb-cgc.COSMIC.grch37_v80`
  WHERE
    Mutation_genome_position IS NOT NULL
    AND GRCh=37
    AND SUBSTR(Mutation_CDS,-2,1)='>'
  GROUP BY
    Sample_name,
    Mutation_CDS,
    Mutation_genome_position ),
  --
  -- *COSMIC_t2*
  -- Next, we want to count up how frequently these mutations have been observed, and keep
  -- only those mutations that are observed in at least 100 samples in COSMIC: this brings
  -- our number of "interesting" mutations down to just 167, with caseCounts ranging from
  -- over 40,000 down to 100.
  COSMIC_t2 AS (
  SELECT
    COUNT(*) AS caseCounts,
    COSMIC_nucChange,
    chr,
    startPos,
    endPos
  FROM
    COSMIC_t1
  GROUP BY
    COSMIC_nucChange,
    chr,
    startPos,
    endPos
  HAVING
    caseCounts>=100 ),
  --
  -- *fromKaviar*
  -- Now we want to bring the Kaviar database into our analysis: we're going to extract most of the
  -- columns from the Kaviar table, while adjusting the 0-based coordinates and keeping only the
  -- single-nucleotide variants that were seen at least 10 times.
  -- The resulting intermediate table has ~33.5M rows.
  fromKaviar AS (
  SELECT
    reference_name AS chr,
    (start_pos+1) AS startPos,
    (end_pos+0) AS endPos,
    reference_bases,
    alternate_bases,
    MAX(AC) AS AC,
    MAX(AF) AS AF,
    MAX(AN) AS AN
  FROM
    `isb-cgc.genome_reference.Kaviar_160204_Public_hg19`
  WHERE
    (end_pos-start_pos)=1
    AND CHAR_LENGTH(reference_bases)=1
    AND CHAR_LENGTH(alternate_bases)=1
  GROUP BY
    reference_name,
    start_pos,
    end_pos,
    reference_bases,
    alternate_bases
  HAVING
    AC>=10 ),
  --
  -- *join1*
  -- Now we're going to join the table of frequent COSMIC variants to the intermediate Kaviar table,
  -- requring that the genomic coordinates and the nucleotides match.
  join1 AS (
  SELECT
    c.caseCounts AS caseCounts,
    c.COSMIC_nucChange AS nucChange,
    c.chr AS chr,
    c.startPos AS startPos,
    c.endPos AS endPos,
    k.AC AS Kaviar_AC,
    k.AF AS Kaviar_AF,
    k.AN AS Kaviar_AN
  FROM
    COSMIC_t2 c
  JOIN
    fromKaviar k
  ON
    c.chr=k.chr
    AND c.startPos=k.startPos
    AND c.endPos=k.endPos
    -- just in case the reference and the alternate have been swapped,
    -- we check for both kinds of matches:
    AND ( (reference_bases=SUBSTR(c.COSMIC_nucChange,1,1)
        AND alternate_bases=SUBSTR(c.COSMIC_nucChange,3,1))
      OR (reference_bases=SUBSTR(c.COSMIC_nucChange,3,1)
        AND alternate_bases=SUBSTR(c.COSMIC_nucChange,1,1)) ) ),
  --
  -- *Ensembl*
  -- Before we finish, we want to also pull in some information from Ensembl,
  -- so we're going to select a few columns from the Ensembl_GRCh37_75 table
  -- (also publicly available in BigQuery).  This subquery will create a
  -- table with information about ~132k exons:
  Ensembl AS (
  SELECT
    gene_name,
    exon_id,
    seq_name,
    start,
    `end`
  FROM
    `isb-cgc.genome_reference.Ensembl_GRCh37_75`
  WHERE
    exon_number IS NOT NULL
    AND feature="exon"
    AND transcript_source="ensembl"
  GROUP BY
    gene_name,
    exon_id,
    seq_name,
    start,
    `end` )
  --
  -- In our final step, we will join the results of the earlier join with the
  -- Ensembl reference information obtained above.
  -- We're down to just a handful of mutations which, for the most part occur
  -- frequently in COSMIC and quite rarely in Kaviar.
SELECT
  caseCounts,
  nucChange,
  chr,
  startPos AS pos,
  Kaviar_AC,
  Kaviar_AF,
  Kaviar_AN,
  gene_name,
  exon_id
FROM
  join1 j
JOIN
  Ensembl r
ON
  j.chr=r.seq_name
  AND r.start<=j.startPos
  AND r.`end`>=j.endPos
ORDER BY
  caseCounts DESC,
  Kaviar_AF DESC

Our final result includes only 8 mutations that are found relatively frequently in COSMIC and are also found (though generally rarely) in Kaviar. Notice the frequently-mutated gene PRSS3 (which encodes a member of the trypsin family of serine proteases), and the major histocompatibility gene HLA-A.

../_images/COSMIC-Kaviar-sql-01.png

You can also click on the “Explanation” button to see diagnostic information about the completed query’s execution plan. This feature is similar to the EXPLAIN statement available in some other query engines. You can often use this information to improve query performance.

../_images/COSMIC-Kaviar-sql-02.png


Stay-tuned, more examples coming soon!

If you have a specific use-case that you need help with, feel free to contact us!


Have feedback or corrections? You can file an issue here or email us at feedback@isb-cgc.org.