Reference Data

ISB-CGC Hosted Reference Data

To facilitate working with the TCGA data tables that the ISB-CGC is hosting in BigQuery, additional reference data tables have also been created, others are hosted by Google Genomics, and suggestions for more are welcome at

Genome Reference Data

Reference data that describes or annotates the human (or other) genome(s) is described in this section. Reference data hosted by the ISB-CGC in BigQuery tables are available in the isb-cgc.genome_reference data set. Tables based on gene-sets such as Ensembl and GENCODE can be used to find the genomic coordinates and identifiers for genes of interest, to perform queries that join tables with gene-symbol based data to tables with genomic-coordinate based data or tables that use other gene identifiers, for example.

For additional details about each of these tables, please use the BigQuery web UI to access each of these tables and look at the information on the Details page. (Look for the Details button between the Schema and Preview buttons, beneath the table name.)

  • Ensembl:
    • GRCh37: Release 75, the final build of the Ensembl gene-set mapped to GRCh37
    • GRCh38: Release 87, the most recent Ensembl gene-set mapped to GRCh38
    • GRCh37: Release 19, the final build of the GENCODE gene-set mapped to GRCH37
    • GRCh38: Releases 22, 23, and 24 from GENCODE are all available (because the TCGA data has been reprocessed by at least one center using each of these three different releases)
  • Gene Ontology Consortium: Tables based on GO annotations and the GO ontology.
  • Kaviar: The latest hg19- and hg38-based Kaviar databases are available. Kaviar is a compilation of SNVs, indels, and complex variants observed in humans, designed to facilitate testing for the novelty and frequency of observed variants.
  • liftOver_hg19_to_hg38: This table provides a mapping of each hg19 position to the corresponding position in hg38, and can be used to perform a liftOver operation in BigQuery.
  • miRBase:
    • GRCh37: The human portion of version 20 of the miRBase database; including genomic coordinates for human microRNAs.
    • GRCh38: The human portion of version 21 of the miRBase database; including genomic coordinates for human microRNAs.
  • miRTarBase: The recently updated miRTarBase database (release 6.1)
  • Reactome:
    -Ensembl2Reactome - miRBase2Reactome

Platform Reference Data

Some reference data is necessary to work with data generated by specific platforms such as the Illumina DNA Methylation array, or the Affymetrix Genome-Wide Human SNP Array 6.0. This section will provide links to existing sources of information elsewhere on the web, or will describe additional resources that are hosted by the ISB-CGC. If there are additional platform reference sources that you would like to see hosted in BigQuery tables, please let us know at

  • DNA Methylation Platform:
    • Most of the DNA Methylation data produced by the TCGA project was obtained using the Illumina Infinium HumanMethylation450 (aka 450k) BeadChip array. Some of the earlier tumor types were assayed on the older, 27k array.
    • Although additional details can be found at the Illumina webpage, we have uploaded the platform annotation information into the BigQuery table isb-cgc.platform_reference.methylation_annotation
    • Each CpG locus is uniquely identified as described in this technical note and this unique identifier can be used to look up and cross-reference data between the TCGA DNA methylation data table and the platform annotation table.
    • The original Illumina-provided CpG coordinates have been “lifted over” from hg19 to hg38
  • Genome-Wide SNP Array:
    • The technical documentation for the Affymetrix Genome-Wide Human SNP Array 6.0 array can be found here.

Other Reference Data Sources

Google Genomics maintains a list of publicly available data sets, including Reference Genomes, the Illumina Platinum Genomes, information about the Tute Genomics Annotation table, etc.

Have feedback or corrections? Please email us at