In April 2016, Repositive sponsored a workshop organised by DNAdigest and the University of Cambridge Open Data Team on 'Finding and Accessing Human Genomic Data for Research'. At that event, I met Mae - a postdoctoral researcher, and after a quick chat it became clear that she had a wealth of experiance in applying for access to restricted datasets from both dbGaP and EGA.
Our previous blog posts on accessing data in dbGaP1 and EGA2 give a detailed overview of the process. However, as another of our blog posts explains3, the process of applying for access to restricted data is complicated and varies depending on who you are, where you are and what data you want. Therefore, I was keen to document a specific perspective of the University of Cambridge.
This blog post describes the most relevant and informative points for researchers at the University of Cambridge who want to apply for access to dbGaP or EGA data. I have written it from the perspective of Mae.
Understanding the application process
For EGA it was very straightforward to understand how to navigate the website. As soon as I found the dataset I wanted, it was obvious what the next steps were that I had to take.
Navigating through the dbGaP pages was quite hard at first, I got lost from one webpage to the next. Once I found the dataset I wanted, there was a wall: ‘login with an eRA Commons account to apply’.
"... I was confused; how does one get an eRA Commons account?"
"It’s not like things are written in steps of what comes 1st, 2nd, and 3rd, you have to scramble around until you figure it out."
To their credit, the helpdesk email responses were helpful and had a relatively quick turn-around time.
There were three main ‘things’ I needed before I could apply for access to dbGaP:
- An eRA account - only for PIs, which needs to be setup by the University’s research office.
- A signing officer - that will sign off the legal consents to guarantee the data will be used appropriately.
- An IT manager - a formally named contact for ensuring the institution/department provides the appropriate security measures for downloading and storing the data for that particular applicant only.
In the long run, once I had finally figured it out, the process was easier with dbGaP than applying for the EGA datasets, but for different reasons.
I think the process varies from institution to institution, and even from department to department. But my experience followed these steps:
- I spoke to my PI, and established that an eRA account had been previously set up, which saved us a bit more time. The process of applying for an eRA is detailed here
- I went to my Departmental Administration to see if they could sign the documents.
- They sent me to the School Research Office, which, at the University of Cambridge, are in charge of arranging legal documentation including contracts, grants, material transfer agreements (MTA) and data transfer agreements (DTA).
- The Research Office has a designated individual who processes the dbGaP applications. The DTA for dbGaP is always the same, irrelevant of the dataset, because all of the datasets are held under the same NIH legal restrictions and regulations.
- For EGA applications, each DTA must get processed separately because there is no consensus between each dataset for these ‘contracts’.
- I went to the IT helpdesk with the form – they didn’t know who was responsible for such applications.
- They sent me to a higher tier computing management team – they understood the IT-relevant bits in the DTA and confirmed that they could support those requirements, but they didn’t feel comfortable being the named individuals.
- Finally, I was forwarded further up to the head of IT for the University, who was happy to be my named individual. I know that in some departments, the departmental head of IT can serve as the named individual – as I said it varies.
In the end...
The thing to note about EGA is that, unlike dbGaP, the access to each dataset is controlled by its own specific custodian. So although the initial process was easy to understand, each application was a separate entity. The forms and legal contracts varied and this caused quite a headache and delay in the queue for revision by the Research Office.
The overall processing time was dependant on two things:
- Whether the Research Office was happy with the legal requirements, or whether they needed to clarify or negotiate the terms.
- How responsive the custodian or Research Office on the other side was.
After I got my head around it, it was relatively smooth. The Research Office knew what to expect from the consent forms and the designated dbGaP application ‘signing official’ was comfortable with the process.
I struggled with the TCGA applications, because it was confusing:
- The first TCGA dataset I wanted was within the consortium, so first I had to apply for access to the broad TCGA consortium project umbrella.
- The second dataset was present on one of the TCGA data portals but was not included in the consortium project umbrella.
So I had to do two dbGaP data applications and wait for each relevant Data Access Committee (DAC) to approve access. The data download itself was also remarkably different, because the TCGA consortium has its own download portal.
The first time around, the whole process of application submission took just over a month, with limited progress on any given day. The timings for application approval varied from dataset to dataset depending on the DAC. The quickest was 2 weeks while the longest approval took about 2 months.
The research proposal
It doesn’t have to be long, only about 2 paragraphs. With dbGaP you will also be asked to write a short version that the general public should be able to understand. Since the NIH, managing dbGAP, is publicly funded and they want the public to be able to be aware of the research they are facilitating and funding. It is pretty clear from the forms what you need to write, and examples can usually be found by looking at other people’s proposals.
It can be difficult to write, however, as you have to keep in mind that this will be read by scientifically astute individuals you do not know. You need to explain the reason for needing the data and the thrust of your research proposal, without giving away any IP that a competitor could use. This is especially true with dbGaP where all proposals are published online and in the public domain.
EGA vs dbGaP:
- With EGA it was easier to understand the application process but completing the process was a lot smoother with dbGaP (once you have figured it out).
- EGA requires you apply to each owner of the data separately, but initially it’s clear about where to go and who to contact.
- dbGaP is centralised, so all dataset contracts are the same, but it is not clear where you have to start.
Think about your storage space:
- You need to know how big the data is that you are applying for.
- Will you be able to fit it in your given folder on a server?
- If not, where will you get the space?
- Remember the space must also be protected as per the DTA you will be signing.
Think about what sort of analysis and processing you are going to do with the data once you do have it:
- Be ready so as soon as you get the data you can start work.
- Back to space again - how big will the data get once you have analysed and processed it?
- Especially if this is your first application.
- Understanding the process, finding the right people, filling in the forms and waiting for approval takes time.
- If you are under time pressure you will get frustrated, and might not get the data in time.
Understand what you need before you start the application process:
- Read the application help documentation.
- Read the legal consents and DTAs.
- Read some blog posts 1,2.
Keep in mind that you only have 1 year at a time:
- You will have to renew your application and authorisation keys each year, to show you still need access to the data – this is far easier than the first time around, but still…
- Don’t waste it.