Getting data out of the EGA



I’m Jeff Almeida-King, the Senior User Support Officer from the European Genome-phenome Archive (EGA). My role is to monitor, maintain and improve the EGA data flow, working with users to facilitate submissions from initial deposition intent and study design through to delivering the data securely to the customer.

Thank you to Adrian Alexa, for inviting me to write a guest blog about EGA access, in response to some comments I added to the blog by Mathew Young.

The EGA in a nutshell

The EGA is a collaborative partnership between the European Bioinformatics Institute (EBI) in Hinxton and the Centre for Genomic Regulation (CRG) in Barcelona.

We provide secure archiving and distribution services for human biomedical –omics data, including sequencing and array-based files and associated phenotypes, deemed subject to controlled access by the data providers.

As of now, we host 2887 datasets, spread across 1447 studies, that cover controls and a number of broad disease types including cancers, cardiovascular diseases and neurological and infectious diseases.



Top-level view of study disease types in EGA]



Study sub-types for Cancer at the EGA


Our data depositions vary in scope, from single one-off submissions to support a publication to large on-going submissions from international consortia such as the International Cancer Genome Project or RD-connect.

As an archive, we are used to working with large, ever-increasing volumes. We have 3.6Pb of files in distribution at the moment and are archiving new data at a rate 100Tb/month in 2016 (up year-on-year from 40Tb/month in 2013). We distribute around 500Tb/month to EGA account holders.


The Dataset concept

Clearly, we have a large data pool, representing a mixture of many different files types reflecting different –omics technologies.



File types at the EGA



Experiments at the EGA


Files at the EGA are sorted into datasets, defined as collections of related files with a common access policy. The original submitter, as part of the submission process, generates datasets to which users then apply for access after release.

This approach fulfills 3 user/submission requirements.

  1. Multi-omics data sorting
    Take the cancer study EGAS00001000978. This has two datasets: EGAD00001001039 (sequencing) and EGAD00010000644 (array data), enabling the submitter to present their data to allow the applicant to access their data of interest.





2. Data release orientated submissions
Datasets can be used to reflect time-stamped data releases, which may be useful for submitters that have funding linked to a data-sharing plan. The UK10K project, with its 60-day machine to distribution policy, is a good example of this.





3. Multiple access policies
Data within the same study may be subject to different data access criteria. For example, the files within the 1958 Birth Cohort control dataset (EGAD00000000021), used as one set of the controls for the WTCCC2 studies, can be accessed only by non-commercial enterprises. The second set of controls and WTCCC2 cases are not subject to the same access criteria.




Access: Why is the EGA different?

Each dataset falls under the governance of a Data Access Committee (DAC), to which an application must be made in order to access. Access, in this case, means 'download the files and metadata'.

The DAC contact details are clearly defined on the dataset page:





In most cases, users are able to identify the DAC and make their application without having to contact EGA Helpdesk to be re-directed.

It’s very important to note that the DAC retains ownership of the dataset/s that it governs. This includes the data files, associated metadata and access decisions for each dataset.

The role of the EGA is to act as a guardian of the data, to ensure that files are archived and distributed securely to the users approved by the DAC.

Whilst we play no role in the access decisions, we do regularly check that DACs are still ‘active’ and ensure contact details are up to date, we also provide clear guidelines on how a DAC should behave.


What is a DAC?

A DAC is the organisation responsible for making data access decisions for its dataset or collection of datasets that fall under its governance.

DACs are formed during the submission process and are generally made up one or more individuals associated with the submitter organisation or consortium (for larger coordinated submissions). The DAC must have at least two contacts, one of which will appear as the main contact on the EGA website, as the point of contact for applicants.

Typically, EGA DACs are responsible for more than one dataset and will have their own policies regarding access granularity. For example, the ICGC DAC grants annual access to all its 262 datasets by default, regardless of the dataset specified by the applicant. And, at the other end of the spectrum, the Wellcome Trust Sanger Institute DAC grants access on a dataset-by-dataset basis.

Applications to a DAC generally consist of completing an application form (online or offline), which usually requires the applicant to describe their usage plan and to provide further details of their scientific endeavour.

Applicants will also be required to complete a Data Access Agreement (DAA), sometimes called a Material Transfer Agreement (MTA).

This is a legally binding document that sets the terms and conditions for data use. For example, how the data is to be stored once downloaded or publication embargo terms. This document can vary across datasets, even for a given DAC, and is often the reason why some DACs grant access on a dataset-by-dataset basis. Here is a link to a guide for completing the ICGC application form and DAA.





Once a DAC approves an application, and in our experience, most applications from bona fide researchers are usually approved, then the DAC contact either uses the EGA DAC Admin tools to create an EGA account on our system or instructs the EGA Helpdesk to create the EGA account on the behalf of the DAC.

When an EGA account is created, the account holder is sent an email with their login details and access instructions.

It is at this point that the responsibility of providing access to data shifts from the DAC to the EGA.


Consent code access: A note on the future of data access at the EGA

Matthew Young made a very good point in his blog, regarding applying for datasets across multiple DACs. As the EGA has expanded in size, it is not an uncommon occurrence for an applicant to have to make several applications across multiple DACs in order to gain access to their datasets of interest.

We are working with DACs to try to address this admin issue with the introduction of consent code based access. This methodology works on the basis that datasets within EGA can be sorted in accordance with broad consent terms. For example, some datasets may have no access restrictions within research use (RUO), whilst others may only be used for a range of limited study cases (RS-XX).

When an applicant makes a successful application for a dataset, they will automatically gain access to all the datasets that share the same consent code, even if the datasets fall across multiple DACs.


Right, I have an EGA account. Now what?

Registered EGA users have access to an online account, accessed through the EGA Homepage. This portal may be used to display the dataset access permissions associated with the account and to download metadata (sample information and sample to file mapping files) for each dataset.





Displaying all datasets that can be accessed through your EGA account





Displaying samples and option to download metadata for a specific dataset

Log-in details also enable the user to access the EGA download microservices to query and download the files linked to each dataset that have been approved for access.


Downloading from the EGA

EGA data downloading is primarily provided via a REST API but, for added convenience, we provide a JAVA application, in the form of an EgaDemoClient
that wraps around the API to enable programmatic or command line access.

There are 3 ways to download files from the EGA:
1. EgaDemoClient in direct command or interactive mode
2. Direct access using the API
3. Using the API wrapper JAVA class

I’ll touch on all 3 methods in this blog.

In practice, method 1 should fulfill the needs of most downloaders and using the EgaDemoClient also provides added functionality, such as downloading directly to Globus Online and FUSE Layer usage to access downloaded files without having to decrypt them (BETA). More on the FUSE layer a little later.


The Common approach

Regardless of the method chosen to download files, the approach for downloading is always the same.





1. First, make a request

This means providing your log-in details and specifying either a file or dataset that you wish to download. Multiple files and datasets may be made in a single request.

When making your request you must specify a label or name for your request and provide the ‘encryption key’ you wish to use to encrypt the files.

Why do I need to specify an encryption key? Files are encrypted on the EGA side BEFORE transfer, which means that only encrypted files are in-transit. This is a very important security step, which ensures that files are securely transferred. In addition, it means that plain HTTP connections (port 80) can be used for transfers, providing better performance.



2. Download your request

Specify your request label to make your download. Data can be downloaded using FTP or UDT (through client only) and a maximum of 15 parallel streams may be set.



3. Decrypt files

Once the files have been downloaded, you decrypt them using the key specified in your original download request (1). Single or multiple files can be decrypted at once.

The files are then ready to use.


Downloading methods: EgaDemoClient (version 2.22)

System requirements are listed here.

The EgaDemoClient can be run in two modes: Interactive and direct.

Use of the EgaDemoClient in interactive mode has been well covered in Matthew Young’s blog. This mode is accessed via a login after starting the client:

EGA > login demo@test.org

Password>123pass

Once inside the shell, you have access to a host of commands that should fulfill most downloader needs. Just type: ‘instructions’ to display the full list of commands.

In addition to the commands required to request, download and decrypt files you also have access to the following commands:
datasets - to list all permitted datasets.
files dataset {datasetid} - to list all files in dataset {datasetid}
size {type} {id} - total size {type='file'|'dataset'|'request'}
downloadmetadata {dataset} - download metadata for specified dataset, if it exists

And so on….

Some nice features of the client include the command ‘testbestdownload’, which runs a series of short downloads using TCP and UDT protocols to determine the best protocol to use for your network.

You can also run bandwidths checks using the ‘testbandwidth’ to determine the optimum number of parallel streams to choose for your network to maximise download speeds.

The direct command line mode of the client provides exactly the same functionality as the interactive mode (with a few extras!) but enables commands to be run directly from the command line.

For example, listing all files in the dataset EGAD00010000498, would be like this:

java -jar EgaDemoClient.jar -p demo@test.org 123pass -lfd EGAD00010000498

(assume: user name = demo@test.org, password = 123pass)

One nice feature of the direct mode is the –debug command, used like this:

java -jar EgaDemoClient.jar -debug demo@test.org 123pass

This command checks that your network is set up to access the download API, by ensuring that JAVA can access the internet (some firewalls block this). It also checks that your network supports TCP and UDT protocols.


Quick note on the FUSE layer

This function is available only using the direct command line. The FUSE layer allows a directory of encrypted files to be mounted in an empty directory, where they can be accessed as unencrypted files. This allows for encrypted files to be used directly, without having to be decrypted first.

This function is accessible with the ‘-fuse’ option.

sudo java -jar EgaDemoClient.jar -fuse

This command scans the source directory. Encrypted files are then wrapped in an access layer to perform on-demand random-access decryption, allowing users to access the file without applying decryption first.


Downloading methods: Direct access using the API

For users downloading large volumes of data from the EGA on a frequent basis, we would recommend interaction with the API.

EGA provides a JAVA API Wrapper class to make it easy to integrate interaction with this API into Java programs, but this example will focus on direct interactions with the API.


The workflow remains the same

Let's say we wish to download all the files associated with the dataset EGAD00001000705.

(Examples using “testuser@ebi.ac.uk” as username and “testpassword” for user’s password)

I first log in using the ‘loginrequest’ form, shown below. You may also log in using basic authentication with base64-encoded credentials or using URL parameters, for increased security. More about login here.

curl -k -X POST -F loginrequest='{"username":"testuser%40ebi.ac.uk","password":"testpassword"'} -H "Accept: application/json" https://ega.ebi.ac.uk/ega/rest/access/v2/users/login

This returns a JSONArray with two elements: success or failure.

Success looks like this, with the session token returned:

"result":["success","b195b0c5-b574-43f2-9910-37d5853826ba"]

I can now use the session token to return results on the datasets my account has access to, the files within the dataset and so on. Full API documentation can be viewed here.

But right now, I just wish to make a download request for the dataset EGAD00001000705:

curl -k -X POST -F downloadrequest='{"rekey":"mykeyfordataset","downloadType":"STREAM","descriptor":"EGAD00001000705_request"'} -H "Accept:application/json" https://ega.ebi.ac.uk/ega/rest/access/v2/requests/new/datasets/EGAD00001000705?session=b195b0c5-b574-43f2-9910-37d5853826ba

I have set my encryption key as ‘mykeyfordataset’ and called my request ‘EGAD00001000705_request’.

I should see an acknowledgement:
{"numTotalResults":1,"result":["40"],"resultType":"us.monoid.json.JSONArray"}

Each file within the request is assigned a ‘ticket’. I can view the tickets associated with my request using the call:

curl -k -H "Accept: application/json" https://ega.ebi.ac.uk/ega/rest/access/v2/requests/EGAD00001000705_request?session=b195b0c5-b574-43f2-9910-37d5853826ba

This produces an ArrayList of EgaTicket JSON objects, listing all information about the files (tickets) associated with the specified request, which can be very long.

The EgaTicket object contains information about one requested ticket:

{"encryptionKey":"","fileID":"EGAF00000364923","fileName":"/EGAR00001158109/NCHPDIPG110.t.bam.gpg","fileSize":"132114588843","fileType":"EBI","label":"EGAD00001000705request","ticket":"8db5ad14-c12a-4e87-bbed-2237c8c58e52","transferTarget":"","transferType":"","user":"testuser@ebi.ac.uk"}

Each ticket can then be downloaded using the following call:

curl -H "Accept: application/octet-stream" http://ega.ebi.ac.uk/ega/rest/ds/v2/downloads/8db5ad14-c12a-4e87-bbed-2237c8c58e52

This produces a binary data stream, which is the file specified by the ticket, encrypted using the password specified at the time the request was made.


Verify your download

If a file has been successfully streamed to completion from the download server, the MD5 of the stream that was sent is stored temporarily.

Calling the ‘results’ URL for a download ticket, after the download is complete, retrieves that MD5 (and optionally the local MD5 can be submitted).

This way the MD5 of the received data file can be verified:

curl -H "Accept: application/json" http://ega.ebi.ac.uk/ega/rest/ds/v2/results/{downloadticket}[?md5={local_md5}]

If the local MD5 is provided in the request (optional), then the server will also know if the download was correct.

The result (in the "result" element) is a JSONArray containing the server MD5 and the size of the file that was sent.


Overview

The EGA is a dataset-orientated secure archive for controlled access to –omics data. Access to datasets is granted by application to the affiliated DAC and is provided through an EGA account, which enables access to the EGA downloading services.

Data may be downloaded through an EgaDemoClient or by interacting directly with the download API.

If you have any questions or comments, please feel free to contact me direct on jeff@ebi.ac.uk.


Read more posts by Repositive

Connecting the world of genomic data.