Reference Data

ISB-CGC Hosted Reference Data

To facilitate working with the TCGA and other program data tables that the ISB-CGC is hosting in BigQuery, additional reference data tables have been created. Others are hosted by Google Cloud Life Sciences. Suggestions for more are welcome at feedback@isb-cgc.org.

For additional details about each of these tables, please use the BigQuery Table Search. To find the reference tables, select Genomic Reference Database under Category.

Genome Reference Data

Reference data that describes or annotates the human or other genomes is described in this section. Reference data hosted by the ISB-CGC in BigQuery tables are available in the isb-cgc.genome_reference data set. Tables based on gene-sets such as Ensembl and GENCODE can be used to find the genomic coordinates and identifiers for genes of interest, to perform queries that join tables with gene-symbol based data to tables with genomic-coordinate based data or tables that use other gene identifiers, for example.

Program/Source Description
ClinVar
  • ClinVar contains reports of the relationships among human variations and phenotypes.
  • GRCh37
  • GRCh38
Cytoband/UCSC
  • Cytoband to Genomic Coordinate Conversion
  • liftOver_hg19_to_hg38 - This table provides a mapping of each hg19 position to the corresponding position in hg38, and can be used to perform a liftOver operation in BigQuery.
dbSNP
  • dbSNP contains human single nucleotide variations, microsatellites, and small-scale insertions and deletions along with publication, population frequency, molecular consequence, and genomic and RefSeq mapping information for both common variations and clinical mutations
  • B150 GRCH37P13
  • B151 GRCH37P13
Ensembl
  • GRCh37: Release 75, the final build of the Ensembl gene-set mapped to GRCh37
  • GRCh38: Release 87, the most recent Ensembl gene-set mapped to GRCh38
GENCODE
  • GRCh37: Release 19, the final build of the GENCODE gene-set mapped to GRCH37
  • GRCh38: Releases 22, 23, and 24 from GENCODE are all available (because the TCGA data has been reprocessed by at least one center using each of these three different releases)
Gene Ontology Consortium
  • Tables based on GO annotations and the GO ontology.
Genome-Wide SNP Array
  • The technical documentation for the Affymetrix Genome-Wide Human SNP Array 6.0 array can be found here.
gnomAD
  • gnomAD aggregates and harmonizes both exome and genome sequencing data from a wide variety of large-scale sequencing projects.
  • GRCH37
ICD
Infinium
  • Infinium EPIC HG19 and HG38 Manifests
  • Infinium HM27 HG19 and HG38 Manifests
  • Infinium HM450 HG19 and HG38 Manifests
ISB-CGC
  • Gene Names Mapping: Data was loaded from multiple sources including NCBI, HGNC, ENSEMBL in Feb 2018 to simplify mapping between HGNC IDs, HGNC symbols, Entrez Gene IDs, Ensembl Gene IDs, Pubmed IDs,and RefSeq IDs.
Kaviar
  • The latest hg19- and hg38-based Kaviar databases are available. Kaviar is a compilation of SNVs, indels, and complex variants observed in humans, designed to facilitate testing for the novelty and frequency of observed variants.
miRBase
  • GRCh37: The human portion of version 20 of the miRBase database; including genomic coordinates for human microRNAs.
  • GRCh38: The human portion of version 21 of the miRBase database; including genomic coordinates for human microRNAs.
  • GRCh38: The human portion of version 22 of the miRBase database; including genomic coordinates for human microRNAs.
miRTarBase
Reactome
  • Ensembl2Reactome
  • miRBase2Reactome
UniProtKB
  • UniProtKB is the central hub for the collection of functional information on proteins, with accurate, consistent and rich annotation.
  • UniProtKB Mapping

Platform Reference Data

Some reference data is necessary to work with data generated by specific platforms such as the Illumina DNA Methylation array. The platform_reference data set contains information on the Illumina DNA Methylation Platform.

Program/Source Description
GDC
  • HG38 DNA Methylation - Most of the DNA Methylation data produced by the TCGA project was obtained using the Illumina Infinium HumanMethylation450 (aka 450k) BeadChip array. Some of the earlier tumor types were assayed on the older, 27k array.
Infinium
  • Illumina DNA Methylation Annotation - Platform annotation information has been uploaded into BigQuery; each CpG locus is uniquely identified as described in this technical note and this unique identifier can be used to look up and cross-reference data between the TCGA DNA methylation data table and the platform annotation table.
Cytoband/UCSC
  • DNA Methylation Annotation Liftover to HG38 Coordinates - The original Illumina-provided CpG coordinates have been “lifted over” from hg19 to hg38.

Genotype Tissue Expression (GTEx) Project Data

The GTEx_v7 data set contains tables with molecular and clinical data (gene read, gene expression, sample attributes, subject phenotype) loaded from the Genotype-Tissue Expression (GTEx) Project Data Portal on November 2017. See the GTEx Portal for more information.

University of California Santa Cruz (UCSC) TOIL RNA-seq recompute project Data

The Toil_recompute data set contains data made available by the UCSC TOIL RNA-seq recompute project. The goal of the project was to process ~20,000 RNA-seq samples to create a consistent meta-analysis of four datasets free of computational batch effects. This is best used to compare TCGA cohorts to TARGET or GTEx cohorts. For more details, see the Zena Browser Data Pages.

Other Reference Data Sources

Google Cloud Life Sciences maintains a list of publicly available data sets, including Reference Genomes, the Illumina Platinum Genomes, information about the Tute Genomics Annotation table, etc.


Have feedback or corrections? Please email us at feedback@isb-cgc.org.