Accessing Controlled Data

Accessing controlled data is done in two different manners, depending if you are doing it through interactive computing (e.g. the Web App or R Studio), or programmatically (e.g. a program running from a Google Virtual Machine Compute Engine you have started). In some cases you will be using your personal credentials while in other cases a “service account” will be acting on your behalf, using its own credentials. Below the methods are described. Please note, you can use both methods at the same time, they are not mutually exclusive.

Interactive Access to Controlled Data

Before you can access any controlled-data hosted by the ISB-CGC, you must first associate (or “link”) your Google identity (which you use to sign in to the ISB-CGC Web App and access the Google Cloud) with a valid NIH login associated with a dbGaP data-access request (either an eRA account ID or an NIH account User ID). This is done through the Web App: you will first be redirected to an NIH login page, and once you have successfully authenticated, ISB-CGC will store an association between your NIH identity and your Google identity. (Note that this should be a one-to-one association.)

Once you have authenticated, ISB-CGC will check which dataset(s) e.g TCGA controlled data and/or TARGET controlled data you have been authorized (by dbGaP) to access. ISB-CGC obtains an updated whitelist for each of the hosted datasets from dbGaP every day. If you have just recently been granted access by dbGaP, there may be a 24 hour delay before you will be able to request access to these data on ISB-CGC.

Visit electronic Research Administration (eRA) for more information on registering for a NIH eRA account. NIH staff may utilize their NIH log-in. (For additional instructions, please refer to Data Access Request Instructions, dbGap Data Access Request Portal, and Understanding Data Security). Please be sure to review the Data Use Certification Agreement for TCGA controlled data and TARGET controlled data.

Once you have authenticated to NIH via the Web App, and your dbGaP authorization has been verified, the Google identity associated with your account will have access to the controlled-data for 24 hours.

For more information on applying for dbGaP authorization to access controlled data, please see our Frequently Asked Questions (FAQ) page or the “How to” Apply for Controlled Access Data Video.

Linking your NIH and Google identities

To link your NIH identity with your Google identity (ie the Google account you used to login to the ISB-CGC system), select the “persona” icon next to your login name (A in the image below) after you have signed in to the ISB-CGC Web App.

../_images/personaeicon-NIHLoginAssoc.png

You will then see the following page:

../_images/NIHAssociationPage.png

You will see a pop up describing all the steps needed to link you NIH Identity to the Data Commons Framework(DCF).

../_images/LinkNIHIDInstructions.PNG

Now you need to associate your Google identity with your NIH identity. (Your NIH identity is the one associated with your dbGaP application and authorization to work with controlled data.) To do this, select the “Associate with eRA Commons Account” link (highlighted in the diagram above, and labeled A). You will then be redirected to an NIH login page to be authenticated by NIH:

../_images/iTrust.png

If you have an eRA identification, use this to sign in through panel A (see example above). If you have an NIH PIV card, use that to sign in through panel B on this page (see above). Once you have been authenticated by NIH, and your NIH identity has been verified to be on the current dbGaP whitelist, you will have access to controlled data for 24 hours.

../_images/Gen3authPage.PNG

Select the Yes, I Authorize button at the bottom right of the page to authorize the Data Commons Framework to authorize your Google identity with controlled data.

../_images/datacommons.ioLogIn.PNG

Select the email you used to originally log into the ISB-CGC web application to finalize the authorization.

Once logged in through eRA identification you are redirected to the user details page and given Warning Notice referring to abiding by the rules and regulations provided by the DUCA Use Agreement. Please refer to image below.

../_images/warningNotice.png

Please note: the ISB-CGC system will enforce a one-to-one relationship between NIH identities and Google identities. In other words, a single NIH identity may not be used to attempt to gain access to controlled data by multiple, different Google identities. If you need to unlink your eRA account from your Google account (for example if you want to change which Google identity you use to sign in to the ISB-CGC platform), you may do so by selecting “Unlink <GoogleID> from the NIH username <eRA Commons ID>” (link B in the screen above).

In the unusual instance that your NIH identity has been registered with another Google identity (eg with another Google identity you own), you will see the screen below:

../_images/eRAlinkedtoAnotherGoogle.png

If this happens, please sign in with that other account and “unlink” your eRA from that account i (see description above). You will then be able to register your eRA account with the desired Google identity. If you are not able to resolve the issue, contact us at feedback@isb-cgc.org and we will help you resolve it.

To end your Web App session, just “Sign Out” by using the pull-down below your name (see image below, A). After you sign out from the ISB-CGC Web App, your Google identity may still be signed in to your browser, so you may want to also sign out of the browser.

../_images/SignOut.png

Extending Your Access by 24 hours

Once you have received permission to view controlled access data, your user login page will look like the screenshot below. If you need to extend your access to controlled data for another 24 hours from now (eg if you have a compute job which is using these Google credentials to access controlled data and it is still running), select the link “Extend controlled access period to 24 hours from now” (red box on figure below). Your time of access will be extended to 24 hours from the time you push the link.

../_images/24hrExtension.png

Accessing Controlled Data from a GCE VM

This section only applies to ISB-CGC users with access to a Google Cloud Platform (GCP) project. GCP projects are automatically configured with a “Compute Engine default service account” which you can find on the IAM & Admin page of the Cloud Console. You can create additional service accounts for special purposes, but most users will be able to just use this one default service account.

When running on a Google Compute Engine (GCE) VM (virtual machine), a “service account” associated with your Google Cloud Project (GCP) is generally acting on your behalf and those are the credentials being used rather than your personal credentials. (If you want to learn more about service accounts, please refer to the Google documentation.)

In order for this service account to access controlled data, you must register it with ISB-CGC. Once this process has completed successfully, this service account will be able to access controlled data for up to 7 days.

NOTES:

  • to allow flexibility while working with different research teams and different processes, you can have many GCPs registered with ISB-CGC, as well as many service accounts registered per GCP
  • if the service account (ie any program running on a VM using the service account’s credentials) tries to access controlled data after the 7 day expiration, it will get an Access Denied error; to prevent this from causing problems with long-running jobs, you can extend access by another 7 days (see below);

Requirements for Registering a Google Cloud Project Service Account

To be able to register your GCP Project and at least one service account to access controlled data the following must all be true:

  • You must be an owner of the GCP project (because you will need to add an ISB-CGC service account as a new project member and a DCF service account as a new project member)
  • At any time, ALL members of the project MUST be authorized to use the data set (ie be a registered dbGaP “PI” or “downloader”) (see dbGap Data Access Request Portal, and Understanding Data Security for more details).
  • All members of the project have signed in to the ISB-CGC Web App at least once
  • All members of the project have authenticated via the NIH login page and thereby linked their NIH identity to their Google identity
  • The GCP project can not be associated with an Organization
  • No Google Groups or other multi-member identifiers (e.g. all authenticated Google users) have been provided with a project role
  • The GCP project must have the ISB-CGC monitoring service account (SA) assigned to an Editor role
  • All SAs with roles in the project must belong to the project, with the exception of the ISB-CGC monitoring SA; this means that all Google-managed SAs with project roles must belong to the project as well
  • The SA you are registering cannot be the ISB-CGC monitoring SA, or SAs from other projects
  • You have not created any keys for any SAs in the project
  • No IDs have been assigned roles on any SAs in the project

If ANY of these requirements are not met, your GCP and ANY associated service accounts will not be able to access controlled data. An automated email will be sent to the GCP project owner(s) if data access is revoked.

Registering your Google Cloud Project Service Account

To register your GCP and its Service Account with ISB-CGC, select the “persona” icon next to your login name (see first image above), which takes you to the following page:

../_images/RegisteredGCPs.png

Select the “Register a Google Cloud Project” link. That takes you to the following page:

../_images/RegisterAGCPForm.png

Please fill out the form following the instructions that are provided. You can “hide” the instructions by selecting the blue Instructions button. You must enter your GCP ID and enable the isb-cgc service account as an editor in your project to move on to the next step.

../_images/project_info1.PNG

Please be sure to add both service accounts listed below. If you don’t add both service accounts you will run into issues viewing the controlled data in ISB-CGC.

../_images/RegisterServiceAccountsList.PNG

Once you have completed these steps you will be presented at the bottom of the same page a listing of the members of your GCP you registering (see screenshot below):

../_images/GCPMembers.png

Pushing the “Register” button will take you to the next screen:

../_images/0007projectregistered.PNG

Select “Register Service Account” from the drop down menu on the left of the GCP you want to add a service account to. By default, there will be the Compute Engine Default service account in the Enter the service account ID text box (see screenshot below). Additionally, select the programs you wish to gain access to by selecting the checkbox to the associated Controlled Dataset(s) you plan to access. Currently you can select either Controlled TCGA data or controlled TARGET data to gain access to.

../_images/RegisterAServiceAccountFirstScreen.PNG

If you receive the error message listed below, this signifies you need to enable the Default Compute Engine API for your Google Cloud Project. For more information on how to enable all the API’s you will need to work on a Google Cloud Project please go here.

../_images/EnableComputeEngineError.PNG

Once you click the “Verify Service Account Users” at the bottom of the page, you will be presented with multiple lists. You will be presented with the Verification Results, Google Cloud Project User ISB-CGC Registration and Identity Linkages, Dataset Permissions Verification, Registered Service Account Verification Results, Google Cloud Project Verification Results, and the Google Cloud Project Service Account Verification Results (see screenshots below). All columns MUST have a green check-marks in them for each user before your service account can be registered.

../_images/ServiceAcctRegTable.png ../_images/ServiceAcctRegTable2.png

If all the requirements for registering a service account are met, the account will be registered. If not, the service account will only be registered for Open Datasets. The final screen below shows the final registered data set (shown by selecting the drop-down menu beside the service account count highlighted in red).

../_images/ServiceAcctRegSuccess.png

Managing your Google Cloud Project(s) and Service Account(s)

Once your GCP(s) and Service Account(s) are registered, you can add or remove additional service accounts by following the instructions below. You can also extend the use of a service account for another 7 days, or reauthorize a service account after you have corrected errors that previously caused it to have its permissions revoked.

Adding additional Google Cloud Projects

To add additional Google Cloud Projects (GCPs) that you own to be able run programs programmatically select the “+ Register New Google Cloud Project” button from the “Registered Google Cloud Projects” page (see screenshot below).

../_images/RegisterAnotherGCP.PNG

Deleting Google Cloud Projects

To delete a GCP that is registered, select the “Unregister Project” button from the dropdown menu beside the project your are removing on the “Registered Google Cloud Projects” page (see screenshot below).

../_images/UnregisterAGCP.PNG

Adding additional service accounts to a given Google Cloud Project

To add additional service accounts to a given GCP reselect the “Register Service Account” from the dropdown menu beside the project that has the service account (see screenshot below).

../_images/0007projectregistered.PNG

Adjusting a Service Accounts using the Adjust Service Account page

To add or remove a controlled dataset from one specific service account from this feature. If you select the plus “+” sign icon next to the trash can (see screenshot below).

../_images/AdjustServiceAccount.png

Deleting Service Accounts from Google Cloud Projects

To delete a service account from a GCP (not allowing it to be used to programmatically access controlled data), push the “trashcan” icon beside the service account (see screenshot below).

../_images/DeleteServiceAccount.png

Extending Your Service Account Access by 7 Days

Once you have registered a Service Account, you have 7 days before the access is automatically revoked. To extend the service account access another 7 days (e.g. if your program is still running), select the “refresh” icon beside the service account (see screenshot below).

../_images/RefreshServiceAccount.png

Reauthorizing a Google Cloud Project(s) Service Account(s)

Your service account may have its permissions revoked (because, for example, the 7-day limit has expired, or you have added a member to the GCP who is not authorized to use that controlled data). If permissions were revoked because an unauthorized user was added to the project, the Google Cloud Project owner will be sent an email specifying the Service Account, GCP Project, and user which resulted in their access being revoked. To reauthorize the service account 1) remedy the problem that resulted in access being denied, and 2) select the “adjust” icon beside the service account (see screenshot below) and add the controlled datasets to the service account.

../_images/AdjustServiceAccount.png

Google Cloud Project Associated to an Organization Will NOT Work with controlled data

If your Google Cloud Project is associated to an organization you will be unable to register the service account to controlled data. You will return an error message similar to this saying, “GCP cgc-08-0126 was found to be in organization ID 8784632854871; its service accounts cannot be registered for use with controlled data.” This is mainly to due with the fact ISB-CGC cannot see the permissions associated to the organization project therefore is a security risk. We are currently working with Google to resolve this issue.

../_images/OrganizationFound.PNG

Your Responsibilities

You should think about securing controlled data within the context of your GCP project in the same way that you would think about securing controlled data that you might download to a file-server or compute-cluster at your own institution. Your responsibilities for data protection are the same in a cloud environment. For more information, please refer to the NIH Security Best Practices for Controlled-Access Data.

NIH has tried to provide as much information as possible for PIs, institutional signing officials (SOs) and the IT staff who will be supporting these projects, to make sure they understand their responsibilities.” (Ref: The Cloud, dbGaP and the NIH blog post 03.27.2015)


Have feedback or corrections? Please email us at feedback@isb-cgc.org.