Accessing Controlled Data¶
Accessing controlled data is done in two different manners, depending if you are doing it through interactive computing (e.g. the Web App or R Studio), or programmatically (e.g. a program running from a Google Virtual Machine Compute Engine you have started). In some cases you will be using your personal credentials while in other cases a “service account” will be acting on your behalf, using its own credentials. Below the methods are described. Please note, you can use both methods at the same time, they are not mutually exclusive.
Interactive Access to Controlled Data¶
Before you can access any controlled-data hosted by the ISB-CGC, you must first associate (or “link”) your Google identity (which you use to sign in to the ISB-CGC Web App and access the Google Cloud) with a valid NIH login associated with a dbGaP data-access request (either an eRA account ID or an NIH account User ID). This is done through the Web App: you will first be redirected to an NIH login page, and once you have successfully authenticated, ISB-CGC will store an association between your NIH identity and your Google identity. (Note that this should be a one-to-one association.)
Once you have authenticated, ISB-CGC will check which dataset(s) e.g TCGA controlled data and/or TARGET controlled data you have been authorized (by dbGaP) to access. ISB-CGC obtains an updated whitelist for each of the hosted datasets from dbGaP every day. If you have just recently been granted access by dbGaP, there may be a 24 hour delay before you will be able to request access to these data on ISB-CGC.
Visit electronic Research Administration (eRA) for more information on registering for a NIH eRA account. NIH staff may utilize their NIH log-in. (For additional instructions, please refer to Data Access Request Instructions, dbGap Data Access Request Portal, and Understanding Data Security). Please be sure to review the Data Use Certification Agreement for TCGA controlled data and TARGET controlled data.
Once you have authenticated to NIH via the Web App, and your dbGaP authorization has been verified, the Google identity associated with your account will have access to the controlled-data for 24 hours.
Extending Your Access by 24 hours¶
Once you have received permission to view controlled access data, your user login page will look like the screenshot below. If you need to extend your access to controlled data for another 24 hours from now (eg if you have a compute job which is using these Google credentials to access controlled data and it is still running), select the link “Extend controlled access period to 24 hours from now” (red box on figure below). Your time of access will be extended to 24 hours from the time you push the link.
Accessing Controlled Data from a GCE VM¶
This section only applies to ISB-CGC users with access to a Google Cloud Platform (GCP) project. GCP projects are automatically configured with a “Compute Engine default service account” which you can find on the IAM & Admin page of the Cloud Console. You can create additional service accounts for special purposes, but most users will be able to just use this one default service account.
When running on a Google Compute Engine (GCE) VM (virtual machine), a “service account” associated with your Google Cloud Project (GCP) is generally acting on your behalf and those are the credentials being used rather than your personal credentials. (If you want to learn more about service accounts, please refer to the Google documentation.)
In order for this service account to access controlled data, you must register it with ISB-CGC. Once this process has completed successfully, this service account will be able to access controlled data for up to 7 days.
- to allow flexibility while working with different research teams and different processes, you can have many GCPs registered with ISB-CGC, as well as many service accounts registered per GCP
- if the service account (ie any program running on a VM using the service account’s credentials) tries to access controlled data after the 7 day expiration, it will get an Access Denied error; to prevent this from causing problems with long-running jobs, you can extend access by another 7 days (see below);
Requirements for Registering a Google Cloud Project Service Account¶
To be able to register your GCP Project and at least one service account to access controlled data the following must all be true:
- You must be an owner of the GCP project (because you will need to add an ISB-CGC service account as a new project member and a DCF service account as a new project member)
- At any time, ALL members of the project MUST be authorized to use the data set (ie be a registered dbGaP “PI” or “downloader”) (see dbGap Data Access Request Portal, and Understanding Data Security for more details).
- All members of the project have signed in to the ISB-CGC Web App at least once
- All members of the project have authenticated via the NIH login page and thereby linked their NIH identity to their Google identity
- The GCP project can not be associated with an Organization
- No Google Groups or other multi-member identifiers (e.g. all authenticated Google users) have been provided with a project role
- The GCP project must have the ISB-CGC monitoring service account (SA) assigned to an Editor role
- All SAs with roles in the project must belong to the project, with the exception of the ISB-CGC monitoring SA; this means that all Google-managed SAs with project roles must belong to the project as well
- The SA you are registering cannot be the ISB-CGC monitoring SA, or SAs from other projects
- You have not created any keys for any SAs in the project
- No IDs have been assigned roles on any SAs in the project
If ANY of these requirements are not met, your GCP and ANY associated service accounts will not be able to access controlled data. An automated email will be sent to the GCP project owner(s) if data access is revoked.
Registering your Google Cloud Project Service Account¶
To register your GCP and its Service Account with ISB-CGC, select the “persona” icon next to your login name (see first image above), which takes you to the following page:
Select the “Register a Google Cloud Project” link. That takes you to the following page:
Please fill out the form following the instructions that are provided. You can “hide” the instructions by selecting the blue Instructions button. You must enter your GCP ID and enable the isb-cgc service account as an editor in your project to move on to the next step.
Please be sure to add both service accounts listed below. If you don’t add both service accounts you will run into issues viewing the controlled data in ISB-CGC.
Once you have completed these steps you will be presented at the bottom of the same page a listing of the members of your GCP you registering (see screenshot below):
Pushing the “Register” button will take you to the next screen:
Select “Register Service Account” from the drop down menu on the left of the GCP you want to add a service account to. By default, there will be the Compute Engine Default service account in the Enter the service account ID text box (see screenshot below). Additionally, select the programs you wish to gain access to by selecting the checkbox to the associated Controlled Dataset(s) you plan to access. Currently you can select either Controlled TCGA data or controlled TARGET data to gain access to.
If you receive the error message listed below, this signifies you need to enable the Default Compute Engine API for your Google Cloud Project. For more information on how to enable all the API’s you will need to work on a Google Cloud Project please go here.
Once you click the “Verify Service Account Users” at the bottom of the page, you will be presented with multiple lists. You will be presented with the Verification Results, Google Cloud Project User ISB-CGC Registration and Identity Linkages, Dataset Permissions Verification, Registered Service Account Verification Results, Google Cloud Project Verification Results, and the Google Cloud Project Service Account Verification Results (see screenshots below). All columns MUST have a green check-marks in them for each user before your service account can be registered.
If all the requirements for registering a service account are met, the account will be registered. If not, the service account will only be registered for Open Datasets. The final screen below shows the final registered data set (shown by selecting the drop-down menu beside the service account count highlighted in red).
Managing your Google Cloud Project(s) and Service Account(s)¶
Once your GCP(s) and Service Account(s) are registered, you can add or remove additional service accounts by following the instructions below. You can also extend the use of a service account for another 7 days, or reauthorize a service account after you have corrected errors that previously caused it to have its permissions revoked.
Adding additional Google Cloud Projects¶
To add additional Google Cloud Projects (GCPs) that you own to be able run programs programmatically select the “+ Register New Google Cloud Project” button from the “Registered Google Cloud Projects” page (see screenshot below).
Deleting Google Cloud Projects¶
To delete a GCP that is registered, select the “Unregister Project” button from the dropdown menu beside the project your are removing on the “Registered Google Cloud Projects” page (see screenshot below).
Adding additional service accounts to a given Google Cloud Project¶
To add additional service accounts to a given GCP reselect the “Register Service Account” from the dropdown menu beside the project that has the service account (see screenshot below).
Adjusting a Service Accounts using the Adjust Service Account page¶
To add or remove a controlled dataset from one specific service account from this feature. If you select the plus “+” sign icon next to the trash can (see screenshot below).
Deleting Service Accounts from Google Cloud Projects¶
To delete a service account from a GCP (not allowing it to be used to programmatically access controlled data), push the “trashcan” icon beside the service account (see screenshot below).
Extending Your Service Account Access by 7 Days¶
Once you have registered a Service Account, you have 7 days before the access is automatically revoked. To extend the service account access another 7 days (e.g. if your program is still running), select the “refresh” icon beside the service account (see screenshot below).
Google Cloud Project Associated to an Organization Will NOT Work with controlled data¶
If your Google Cloud Project is associated to an organization you will be unable to register the service account to controlled data. You will return an error message similar to this saying, “GCP cgc-08-0126 was found to be in organization ID 8784632854871; its service accounts cannot be registered for use with controlled data.” This is mainly to due with the fact ISB-CGC cannot see the permissions associated to the organization project therefore is a security risk. We are currently working with Google to resolve this issue.
You should think about securing controlled data within the context of your GCP project in the same way that you would think about securing controlled data that you might download to a file-server or compute-cluster at your own institution. Your responsibilities for data protection are the same in a cloud environment. For more information, please refer to the NIH Security Best Practices for Controlled-Access Data.
NIH has tried to provide as much information as possible for PIs, institutional signing officials (SOs) and the IT staff who will be supporting these projects, to make sure they understand their responsibilities.” (Ref: The Cloud, dbGaP and the NIH blog post 03.27.2015)