BigQuery Syntax

Legacy SQL vs Standard SQL

BigQuery introduced support for Standard SQL in 2016. The previous version of SQL supported by BigQuery is now known as Legacy SQL.

Note that when you first go to the BigQuery web UI, Standard SQL will be activated by default and you will need to enable Legacy SQL if you want to use Legacy SQL. For simple queries, the same syntax will work in both, except for one important detail which is how you specify the table name. A simple Standard SQL query might look like:

SELECT *
  FROM `isb-cgc.TCGA_hg38_data_v0.Somatic_Mutation_DR10`
  LIMIT 1000

whereas the same query in Legacy SQL requires square brackets around the table name and a colon between the project name and the dataset name, like this:

SELECT *
  FROM [isb-cgc:TCGA_hg38_data_v0.Somatic_Mutation_DR10]
  LIMIT 1000

(Although please note that you can use the “Preview” feature in the BigQuery web UI, at no cost, instead of doing a SELECT * which will do a full table scan!)

Query Syntax Examples

Simple Query Examples

Let’s start with a few simple examples to get some practice using BigQuery. Note that all of these examples are in “Standard SQL”. You can simply copy-and-paste any of the SQL queries on this page into the BigQuery web UI https://console.cloud.google.com/bigquery.

1. How many mutations have been observed in KRAS?

SELECT
  COUNT(DISTINCT(sample_barcode_tumor)) AS numSamples
FROM
  `isb-cgc.TCGA_hg38_data_v0.Somatic_Mutation_DR10`
WHERE
  Hugo_Symbol="KRAS"

The following screen-shot below shows the query in the “Query Editor” box, and the results down below. Just click on the “RUN QUERY” button to run the query. Notice the green check-mark indicating that the SQL query syntax looks good.

../../../_images/SimpleSQLExample1.png

2. What other information is available about these KRAS mutant tumours?

In addition to answering the question above, this next query also illustrates usage of the WITH construct to create an intermediate table on the fly, and then use it in a follow-up SELECT:

WITH
  t1 AS (
  SELECT
    project_short_name,
    sample_barcode_tumor,
    Hugo_Symbol,
    Variant_Classification,
    Variant_Type,
    SIFT,
    PolyPhen
  FROM
    `isb-cgc.TCGA_hg38_data_v0.Somatic_Mutation_DR10`
  WHERE
    Hugo_Symbol="KRAS"
  GROUP BY
    project_short_name,
    sample_barcode_tumor,
    Hugo_Symbol,
    Variant_Classification,
    Variant_Type,
    SIFT,
    PolyPhen )
SELECT
  COUNT(*) AS n,
  Hugo_Symbol,
  Variant_Classification,
  Variant_Type,
  SIFT,
  PolyPhen
FROM
  t1
GROUP BY
  Hugo_Symbol,
  Variant_Classification,
  Variant_Type,
  SIFT,
  PolyPhen
ORDER BY
  n DESC
../../../_images/SimpleSQLExample2.png

3. What are the most frequently observed mutations and how often do they occur?

WITH
  t1 AS (
  SELECT
    sample_barcode_tumor,
    Hugo_Symbol,
    Variant_Classification,
    Variant_Type,
    SIFT,
    PolyPhen
  FROM
    `isb-cgc.TCGA_hg38_data_v0.Somatic_Mutation_DR10`
  GROUP BY
    sample_barcode_tumor,
    Hugo_Symbol,
    Variant_Classification,
    Variant_Type,
    SIFT,
    PolyPhen )
SELECT
  COUNT(*) AS n,
  Hugo_Symbol,
  Variant_Classification,
  Variant_Type,
  SIFT,
  PolyPhen
FROM
  t1
GROUP BY
  Hugo_Symbol,
  Variant_Classification,
  Variant_Type,
  SIFT,
  PolyPhen
ORDER BY
  n DESC
../../../_images/SQLSimpleExample3.png

Querying from more than one table (Joining)

Q: For bladder cancer patients that have mutations in the CDKN2A (cyclin-dependent kinase inhibitor 2A) gene, what types of mutations are they, what is their gender, vital status, and days to death - and for 3 downstream genes (MDM2 (MDM2 proto-oncogene), TP53 (tumor protein p53), CDKN1A (cyclin-dependent kinase inhibitor 1A)), what are the gene expression levels for each patient?

This question was chosen as an interesting example because the p53/Rb pathway is commonly involved in bladder cancer (see TCGA Network paper “Comprehensive Molecular Characterization of Urothelial Bladder Carcinoma”, Figure 4).

This is a complex question that requires information from four tables. We will build up this complex query in legacy SQL three steps. Change the query settings to legacy SQL.

Step 1

Finding the patients with bladder cancer that have mutations in the CDKN2A gene, and displaying the patient ID and the type of mutation

SELECT
  mutation.case_barcode,
  mutation.Variant_Type
FROM
  [isb-cgc.TCGA_hg19_data_v0.Somatic_Mutation_DCC] AS mutation
WHERE
  mutation.Hugo_Symbol = 'CDKN2A'
  AND project_short_name = 'TCGA-BLCA'
GROUP BY
  mutation.case_barcode,
  mutation.Variant_Type
ORDER BY
  mutation.case_barcode
../../../_images/BigQueryExample1.png

We now have the list of patients that have a mutation in the CDKN2A gene and the type of mutation.

Notice that we have named the “isb-cgc:TCGA_hg19_data_v0.Somatic_Mutation_DCC” table “mutation” using the AS statement. This is useful for easier reading and composing of complex queries.

Step 2

Bringing in the patient data from the ISB-CGC TCGA Clinical table so that we can see each patient’s gender, vital status and days to death.

SELECT
  case_list.mutation.case_barcode AS case_barcode,
  case_list.mutation.Variant_Type AS Variant_Type,
  clinical.gender,
  clinical.vital_status,
  clinical.days_to_death
FROM
  /* this will get the unique list of cases having the TP53 gene mutation in BRCA cases*/ (

  SELECT
    mutation.case_barcode,
    mutation.Variant_Type
  FROM
    [isb-cgc.TCGA_hg19_data_v0.Somatic_Mutation_DCC] AS mutation
  WHERE
    mutation.Hugo_Symbol = 'CDKN2A'
    AND project_short_name = 'TCGA-BLCA'
  GROUP BY
    mutation.case_barcode,
    mutation.Variant_Type
  ORDER BY
    mutation.case_barcode,
    ) AS case_list /* end case_list */
JOIN
  [isb-cgc.TCGA_bioclin_v0.Clinical] AS clinical
ON
  case_list.case_barcode = clinical.case_barcode
../../../_images/BigQueryExample2.png

We now have combined information from two tables through a join. Notice in particular the join syntax, and the fact that for the join (inner join by default), the fields that are identiical between the mutation table and the clinical table is “case_barcode”.

Step 3

Show the gene expression levels for the 4 genes of interest, and order them by case id (Case Barcode) and gene name (HGNC_gene_symbol).

SELECT
  genex.case_barcode AS case_barcode,
  genex.sample_barcode AS sample_barcode,
  genex.aliquot_barcode AS aliquot_barcode,
  genex.HGNC_gene_symbol AS HGNC_gene_symbol,
  case_list.Variant_Type AS Variant_Type,
  genex.gene_id AS gene_id,
  genex.normalized_count AS normalized_count,
  genex.project_short_name AS project_short_name,
  clinical_info.clinical.gender AS gender,
  clinical_info.clinical.vital_status AS vital_status,
  clinical_info.clinical.days_to_death AS days_to_death
FROM ( /* This will get the clinical information for the cases*/
  SELECT
    case_list.mutation.Variant_Type AS Variant_Type,
    case_list.mutation.case_barcode AS case_barcode,
    clinical.gender,
    clinical.vital_status,
    clinical.days_to_death
  FROM
    /* this will get the unique list of casess having the CDKN2A gene mutation in bladder cancer BLCA cases*/ (

    SELECT
      mutation.case_barcode,
      mutation.Variant_Type
    FROM
      [isb-cgc.TCGA_hg19_data_v0.Somatic_Mutation_DCC] AS mutation
    WHERE
      mutation.Hugo_Symbol = 'CDKN2A'
      AND project_short_name = 'TCGA-BLCA'
    GROUP BY
      mutation.case_barcode,
      mutation.Variant_Type
    ORDER BY
      mutation.case_barcode,
      ) AS case_list /* end case_list */
  INNER JOIN
    [isb-cgc.TCGA_bioclin_v0.Clinical] AS clinical
  ON
    case_list.case_barcode = clinical.case_barcode /* end clinical annotation */ ) AS clinical_info
INNER JOIN
  [isb-cgc.TCGA_hg19_data_v0.RNAseq_Gene_Expression_UNC_RSEM] AS genex
ON
  genex.case_barcode = case_list.case_barcode
WHERE
  genex.HGNC_gene_symbol IN ('MDM2',
    'TP53',
    'CDKN1A',
    'CCNE1')
ORDER BY
  case_barcode,
  HGNC_gene_symbol
../../../_images/BigQueryExample3.png

We have now gotten all the data together in one table for further analysis.

Note that the final join surrounds the previous join top and bottom. This is common method of doing joins.

You can either download the results from a query in either CSV or JSON format, or save it for further analysis in Google BigQuery by the “Save as Table” button. As the next section describes, large queries continuing to combine multiple tables in a gene query may be limited by cost and resources, saving results as intermediate tables is a solution to these issues.

Saving Query Results to other BigQuery Tables

You can easily save query results in intermediate tables in your project, allowing others to view and use them. Details from Google on how to do that is here. If your query gets too complex it can take too long to run. Creating intermediate result tables can be a good approach to obtain the same result more quickly and at a lower cost.

SQL Functions

Standard SQL includes a large variety of built-in functions and operators including logical and statistical aggregate functions, and mathematical functions, just to name a few. User-defined functions (UDFs) are also supported and can be used to further extend the types of analyses possible in BigQuery.

Using the bq Command Line Tool

The bq command line tool is part of the cloud SDK and can be used to interact directly with BigQuery from the command line. The cloud SDK is easy to install and is available for most operating systems. You can use bq to create and upload your own tables into BigQuery (if you have your own GCP project), and you can run queries at the command-line like this:

bq query --allow_large_results \
         --destination_table="myproj:dataset:query_output" \
         --nouse_legacy_sql \
         --nodry_run \
         "$(cat myQuery.sql)"

(where myQuery.sql is a plain-text file containing the SQL, and the destination table is in an existing BigQuery dataset in your project).

Using BigQuery from R

BigQuery can be accessed from R using one of two powerful R packages: bigrquery and dplyr. Please refer to the documentation provided with these packages for more information.

Using BigQuery from Python

BigQuery client libraries are available that let you interact with BigQuery from Python or other languages. In addition, the pandas.io.gbq module provides a wrapper for BigQuery.

Getting Help

Aside from the documentation, the best place to look for help using BigQuery and tips and tricks with SQL is StackOverflow. If you tag your question with google-bigquery your question will quickly get the attention of Google BigQuery experts. You may also find that your question has already been asked and answered among the nearly 10,000 questions that have already been asked about BigQuery on StackOverflow.


Have feedback or corrections? Please email us at feedback@isb-cgc.org.