Reference Data
ISB-CGC Hosted Reference Data
To facilitate working with the TCGA and other program data tables that the ISB-CGC is hosting in BigQuery, additional reference data tables have been created. Others are hosted by Google Cloud Life Sciences. Suggestions for more are welcome at feedback@isb-cgc.org.
For additional details about each of these tables, please use the BigQuery Table Search. To find the reference tables, select Genomic Reference Database under Category.
Genome Reference Data
Reference data that describes or annotates the human or other genomes is described in this section.
Reference data hosted by the ISB-CGC in BigQuery tables are available in the isb-cgc.genome_reference
data set. Tables based on
gene-sets such as Ensembl and GENCODE can be used to find the genomic coordinates and identifiers
for genes of interest, to perform queries that join tables with gene-symbol based data
to tables with genomic-coordinate based data or tables that use other gene identifiers, for example.
Program/Source |
Description |
---|---|
ClinVar |
|
Cytoband/UCSC |
|
dbSNP |
|
Ensembl |
|
GENCODE |
|
Gene Ontology Consortium |
|
Genome-Wide SNP Array |
|
gnomAD |
|
ICD |
|
Infinium |
|
ISB-CGC |
|
Kaviar |
|
miRBase |
|
miRTarBase |
|
Reactome |
|
UniProtKB |
|
Platform Reference Data
Some reference data is necessary to work with data generated by specific platforms such as the Illumina DNA Methylation array. The platform_reference data set contains information on the Illumina DNA Methylation Platform.
Program/Source |
Description |
---|---|
GDC |
|
Infinium |
|
Cytoband/UCSC |
|
Genotype Tissue Expression (GTEx) Project Data
The GTEx_v7 data set contains tables with molecular and clinical data (gene read, gene expression, sample attributes, subject phenotype) loaded from the Genotype-Tissue Expression (GTEx) Project Data Portal on November 2017. See the GTEx Portal for more information.
University of California Santa Cruz (UCSC) TOIL RNA-seq recompute project Data
The Toil_recompute data set contains data made available by the UCSC TOIL RNA-seq recompute project. The goal of the project was to process ~20,000 RNA-seq samples to create a consistent meta-analysis of four datasets free of computational batch effects. This is best used to compare TCGA cohorts to TARGET or GTEx cohorts. For more details, see the Zena Browser Data Pages.
Other Reference Data Sources
Google Cloud Life Sciences maintains a list of publicly available data sets, including Reference Genomes, the Illumina Platinum Genomes, information about the Tute Genomics Annotation table, etc.