CPTAC Data Set

About the NCI Clinical Proteomic Tumor Analysis Consortium

The National Cancer Institute’s Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis or proteogenomics.

About the NCI Clinical Proteomic Tumor Analysis Consortium Data Set

CPTAC data consists of whole-genome sequencing, whole-exome sequencing, RNA sequencing, and miRNA sequencing. The program analyzed more than 700 cases. The Genomic Data Commons (GDC) currently has controlled VCF, TSV, and BAM data available. The Project ID in the GDC Data Portal is CPTAC-2 and CPTAC-3.

For more information on the CPTAC data, please refer to these sites:

Accessing the NCI Clinical Proteomic Tumor Analysis Consortium Data on the Cloud

Besides accessing the files on the GDC Data Portal, you can also access them from the GDC Google Cloud Storage Bucket, which means that you don’t need to download them to perform analysis. ISB-CGC stores the cloud file locations in tables in the isb-cgc-bq.GDC_case_file_metadata data set in BigQuery.

  • To access these metadata files, go to the Google BigQuery console.
  • Perform SQL queries to find the CPTAC files. Here is an example:
SELECT active.*, file_gdc_url
FROM `isb-cgc-bq.GDC_case_file_metadata.fileData_active_current` as active, `isb-cgc-bq.GDC_case_file_metadata.GDCfileID_to_GCSurl_current` as GCSurl
WHERE program_name = 'CPTAC'
AND active.file_gdc_id = GCSurl.file_gdc_id

Accessing the CPTAC Data in Google BigQuery

ISB-CGC has CPTAC data, such as clinical and protein expression, stored in Google BigQuery tables. Information about these tables can be found using the ISB-CGC BigQuery Table Search with CPTAC selected for filter PROGRAM. To learn more about this tool, see the ISB-CGC BigQuery Table Search documentation.

The CPTAC tables are in project isb-cgc-bq and isb-cgc. To learn more about how to view and query tables in the Google BigQuery console, see the ISB-CGC BigQuery Tables documentation.

  • Data set isb-cgc-bq.CPTAC contains the latest tables for each data type.
  • Data set isb-cgc-bq.CPTAC_versioned contains previously released tables, as well as the most current table.
  • Data set isb-cgc.hg19_data_previews contains protein expression data.

Have feedback or corrections? Please email us at feedback@isb-cgc.org.