Variant Data SQL Query Examples¶
Here are examples of how to leverage SQL queries to analyze the variant data in our tables on the Google Cloud Console. They include some SQL snippets which emulate the commands from VCFTools. VCFTools is a suite of functions for processing, validating, analyzing, and more on VCF files. Because this is controlled data, it can only be queried if you have authorized access; see here for more information.
Note
Best practice to keep costs down for queries is to avoid using ‘*’ and instead select specific columns.
Note
In our VCF tables, the POS column is an integer; no quotes are necessary when querying for this column. The CHROM column is a string and requries quotes when querying, for example, CHROM = ‘chr1’.
Emulating VCFTools¶
These SQL queries replicate the functionality of the listed VCFTools command.
—chr: Filter by position by including or excluding sites of interest; for example, searching for positions found at chromosome 22.
SELECT
CHROM, POS, REF, ALT
FROM
`isb-cgc-cbq.TARGET_versioned.vcf_hg38_gdc_r22`
WHERE
CHROM = 'chr22'
LIMIT 1000
—remove-filter-all: Removes the sites which do not have the tag PASS in the FILTER column.
SELECT
CHROM, POS, REF, ALT, FILTER
FROM
`isb-cgc-cbq.TARGET_versioned.vcf_hg38_gdc_r22`
WHERE
FILTER = 'PASS'
LIMIT 1000
—maxDP: This function requires the “DP” tag to exist in the FORMAT column. The option will locate genotypes less than or equal to the “–maxDP” value.
SELECT
CHROM, POS, REF, ALT, DP_Tumor
FROM
`isb-cgc-cbq.TARGET_versioned.vcf_hg38_gdc_r22`
WHERE
AND DP_Tumor <= '50'
LIMIT 1000
In-Depth Queries¶
In this query, let’s find information for patients who have ALL-P2 and a Thymine mutation at position 161550724 on Chromosome 1.
SELECT
CHROM, POS, REF, ALT, project_short_name
FROM
`isb-cgc-cbq.TARGET_versioned.vcf_hg38_gdc_r22`
WHERE
project_short_name = "TARGET-ALL-P2"
AND CHROM = "chr1"
AND POS = 161550724
AND ALT = "T"
In this query, let’s look at chromosome 1. We want to find positions between twenty thousand and five million. We are interested in chromosome and position from a specific project and with a certain analysis workflow type. In this case, we want to look at the project TARGET-WT. These are patients who are diagnosed with Wilms tumor. For the analysis workflow type, we are interested in MuTect2.
SELECT
CHROM,POS,REF,ALT,GT_TUMOR,GT_NORMAL
FROM
`isb-cgc-cbq.TARGET_versioned.vcf_hg38_gdc_r22`
WHERE
CHROM = 'chr1'
AND POS BETWEEN 20000 and 5000000
AND project_short_name = "TARGET-WT"
AND analysis_workflow_type = "MuTect2"
The query below returns the ref and alt alleles found between base positions 20,000 and 5,000,000 on chromosome 1 along with genotype information for whole genome tumor and normal samples (using filter analysis_workflow_type like %LiftOver%) across all TARGET projects.
SELECT
CHROM,POS,REF,ALT,project_short_name, GT_TUMOR,GT_NORMAL
FROM
`isb-cgc-cbq.TARGET_versioned.vcf_hg38_gdc_r22`
WHERE
CHROM = 'chr1'
AND POS BETWEEN 20000 and 5000000
AND analysis_workflow_type like "%LiftOver%"
We demonstrate a join in the query below between the TARGET VCF table and the TARGET RNAseq table to get information for the TARGET-ALL-P3 to identify mutations in the FOXD4 gene.
SELECT
CHROM,POS,REF,ALT,vcf.project_short_name, HTSeq__FPKM, GT_TUMOR,GT_NORMAL
FROM
`isb-cgc-cbq.TARGET_versioned.vcf_hg38_gdc_r22` as vcf
join `isb-cgc-bq.TARGET.RNAseq_hg38_gdc_current` as rna
on rna.case_barcode = vcf.case_barcode
WHERE
vcf.project_short_name = "TARGET-ALL-P3"
AND gene_name = "FOXD4"
ORDER By CHROM