I have previously written a series of posts detailing the issues with dbGaP's application and access procedures. But dbGaP isn't the only repository of matched genotype and phenotype data. The European Bioinformatics Institute (EBI) maintains the "European Genome-phenome Archive" (EGA).
Like dbGaP, the data stored in the EGA is only available to those that have successfully applied for access. So to round out my thrilling survey of genomic repository access procedures, I have had a look at the EGA. What follows is some practical advice on how to apply for access to EGA data and make use of it if your access request is approved.
A different kind of beast
The NCBI's dbGaP both hosts genetic data and takes responsibility for considering researcher's requests for access to it. Getting access to their system is a cumbersome procedure that involves collecting a half dozen credentials. However, once you have gone through this pain, all requests are made and considered through dbGaP directly.
The EGA takes a different approach. Although it stores the genetic and phenotypic data, it abdicates responsibility for approving or rejecting researcher's access requests to external "Data Access Committee's" (called DACs). When someone submits data to the EGA, they must also provide information about a DAC who is responsible for granting access.
What this means is that an attempt to access data from the EGA looks like this:
- Find a data set you are interested in.
- The EGA will supply you with a link to that institution's access procedure or the email of someone responsible for considering data access requests.
- You do whatever is asked of you by the DAC appropriate to your data set.
- The DAC tells the EGA to supply you with an EGA login.
- You login to the EGA with the supplied login and download the data that you have been approved to access.
In practice, this has a number of consequences. Firstly, the amount of time it takes to get access to a given data set depends very much on the procedures and responsiveness of the DAC responsible for that data set. Furthermore, if you want to access multiple data sets overseen by different DACs, you presumably need to make separate applications to each DAC. I say "presumably" because my experience is limited to applying for access to a single data set, but given the overall philosophy, I can't see how else it could work.
How long will it take?
As discussed above, the time taken to access a data set depends very strongly on the Data Access Committee responsible for it. Nevertheless, I have included a flow chart giving my rough estimates of the time taken to get access to data overseen by the Sanger DAC. I cannot really predict how indicative these estimates are, but they should at least be in the ball park for people applying via the Sanger's DAC 1.
Making your application
Once again, what you'll need to make your application will depend on the requirements of the DAC you have to apply to. If you're unsure which DAC you need to apply to or what their procedures are, the EGA provides a step by step guide here.
In my case, I applied for data for which the Sanger's DAC which required me to register an account here. Once this was done, I had only to fill in a fairly straight forward web form which constituted my research proposal.
As part of this, I had to give the details of someone senior in my organisation who needed to sign off on my request. This has the potential to be a fairly time consuming step, but it's a fairly unavoidable part of these types of controlled access procedures.
Other than that, the only thing that struck me as potentially difficult about the application procedure was that it ask you to "list your five most relevant publications". I don't know how strictly having five relevant publications is treated as a requirement, but it effectively excludes anyone early in their career from accessing data in EGA. Which may be the point, but it is still worth noting.
Accessing the data
At this point I'll assume you've successfully done whatever your DAC asked you to do and have been given access to the data you were after. Hooray! However, getting the data onto your system in a useable format is more involved than you might expect.
To proceed, you will need a valid EGA login, which should have been issued to you as part of the application procedure. It also appears to be possible to log in using a "Federated Identity" recognised by the EGA 2.
To get started downloading data, go here and log in.
Why is this so complicated?
Unfortunately, downloading the data you have access to is not as simple as clicking a link and saving the resulting files to disc. The rationale for this additional complexity is that genomic data sets are often very large and can benefit from complex systems that only make data available when it's needed and prevent these large downloads from breaking.
To me, it seems like an over-engineered solution that creates more problems than it solves. But if you want the data you've just applied for, you have to work with what you're given.
How to download files
The download procedure is essentially a three part process, where you need to:
- Log in using your EGA account details
- Request a download "link" for the dataset (or files within a dataset) that you're interested in
- Download the data using the link returned in the previous step.
Which sounds fairly straightforward. The catch is, that these steps cannot be done by a simple web interface. Instead, you have to use a command line tool written in Java. The full blown documentation can be found here.
If you are comfortable writing custom requests to web servers (using tools like curl), downloading files can be done by making a series of requests to a REST API. The documentation for the API can be found here and is pretty easy to follow provided you're comfortable with this sort of thing. However, I suspect most researchers trying to download data from EGA aren't comfortable working this way, which leaves using the Java command line tool as the only option 3.
Using the Java client
I won't go into great detail describing how to use the command line tool, because the EGA documents how to use it here. That said, there are a few things that are worth mentioning, so I'll give a brief overview. Once you have the command line tool running, you need to login by typing
login <EGA_username> <EGA_password>
Yes that's your username and password. Yes it's just displayed in the open in the command line tool. For a process that puts lots of annoying obstacles in your way in the name of security, the standard practice contains some pretty poor security practices (this being one of them). Anyway, you're logged in now, so moving on.
The next step is to make a "download request" (basically ask the client to give you a direct link to some files). You can request a full dataset using
request dataset or just some files within a dataset using
request file. To each command, you will need to supply an ID for dataset or file you want, an "encryption key" and a label for this request. For example, to request the data set "EGAD00010000498", you would type
request dataset EGAD00010000498 my_encryption_key my_first_data_request
You should think of the "encryption key" as a password you specify that you'll need to decrypt the data once you've downloaded it (so you can actually do something useful with it). The label is just a name that is meaningful to you to give to this download request.
Once you've made your request, you can download it by typing:
The last hurdle: decrypting the data
At this point, the data is finally sitting on your hard drive (or wherever else you chose to save it). But at the moment it is encrypted and so can't actually be used for anything useful. To use the data, you need to decrypt it. This is where you need the "encryption key" discussed in the previous section, that you specified when you created your download request. Using the command line client, enter
decryptkeep <filename> <encryption_key>
If everything works as it should, the file specified should now be decrypted and ready to be used! Repeat this procedure for other files and datasets as necessary.
This sounds rather limited, but actually the Sanger DAC is responsible for a quite sizeable number of important and interesting data sets hosted by the EGA. I don't know what fraction of applications go through the Sanger DAC, but I suspect it's fairly substantial. ↩
Federated Identities appear to be external Institutional logins that the EGA will let you use to authenticate you and log you in to the EGA platform. ↩
Given the REST API is well documented, there's nothing stopping a third party from writing and maintaining their own tools for working with the EGA. A quick google search didn't turn up anything, but if you know of any such tools, please let me know. ↩