Data in Cloud Storage¶
At this time, all controlled-access data files are stored in Google Cloud Storage (GCS) in their original form, as obtained from the data repository. This includes these major data types and formats:
- RNA-Seq FASTQ files (unaligned reads, typically compressed tar-files)
- DNA-Seq and RNA-Seq BAM files (aligned reads)
- Genome-Wide SNP6 array CEL files
- Variant-calls in VCF files
In order to access these controlled data, a user of the ISB-CGC must first be authenticated by NIH (via the ISB-CGC web-app). Upon successful authentication, the user’s dbGaP authorization will be verified. These two steps are required before the user’s Google identity is added to the access control list (ACL) for the controlled data. At this time, this access must be renewed every 24 hours.
Summary of Data Available in GCS¶
|Format||Data Type||# of Files||Total Size|
|CEL||DNA (SNP6)||22529||1.6 TB|
Working with data in GCS¶
Please see our DIY Workshop and in particular the section on “Computing in the Cloud” for additional references and tutorial material.
Our metadata tables in BigQuery can be used to explore the available data and choose which BAM files you’re most interested in working with – before you take on processing an entire petabyte of data! Feel free to email us at email@example.com with questions.
BAM-slicing in the Cloud¶
BAM files can vary in size from close to 1 TB down to 1 MB, and frequently a researcher
is only interested in extracting a small slice of the entire sequence. This is referred
to as “BAM-slicing” and the latest release (1.4) of the
htslib library adds the capability to
perform BAM-slicing directly on BAM files in GCS to widely used tools such as
(You will need to build with
to enable support for access to data both in GCS and S3.)
This new functionality allows you to run, for example:
$ ./samtools view -X gs://gdc-ccle-open/0a109993-2d5b-4251-bcab-9da4a611f2b1/C836.Calu-3.2.bam gs://gdc-ccle-open/96b56036-b278-45cd-a7ca-265f589ff951/C836.Calu-3.2.bam.bai 7:140453130-140453140
If you want to access a controlled-access BAM file, you’ll need to provide credentials first:
$ export GCS_OAUTH_TOKEN=`gcloud auth application-default print-access-token`
$ gsutil ls -l gs://gdc-ccle-open/0a109993-2d5b-4251-bcab-9da4a611f2b1/C836.Calu-3.2.bam $ gsutil ls -l gs://gdc-ccle-open/96b56036-b278-45cd-a7ca-265f589ff951/C836.Calu-3.2.bam.bai
Other Options for BAM-slicing¶
The NCI-GDC has also implemented a BAM-slicing API on top of their data repository. This API can be accessed programmatically as documented here or interactively on any of the file-specific data-portal pages like this one for a TCGA-BRCA whole-exome BAM file. (The “BAM Slicing” button is in the upper right corner of the page.)
The GA4GH API provides another option to BAM-slicing, and has been implemented by Google on top of the database-backed Google Genomics technology. You can find more information about the GA4GH API here with information about some open-access data hosted by the ISB-CGC which you are welcome to experiment with.