I ended the last post by drawing some comparisons between the two major repositories for submitting expression data (GEO and ArrayExpress). To continue I will look at one of the major repositories for storing controlled access data (EGA). The diagrams shown below, which are to aid your understanding of their submission processes, may hint at their greater complexity and the higher levels of frustration that users suffer when submitting data to these repositories.
EGA Submissions Overview
EGA is a controlled access repository hosted by the EBI for all types of sequence and genotype data. The NCBI equivalent of EGA is dbGaP. All data submitted to EGA must be subject to controlled access as defined by the original consent forms. EGA splits the type of data for submission into three categories, with different and specific file format requirements, as shown in this figure:
Of note, EGA now also accepts metagenomic data, using the MIxS (Minimum Information about any (x) Sequence) to describe this type of data.
Submitting sequence data to EGA involves three main steps: file preparation, file upload and metadata submission. File preparation involves using the JAVA-based client egaCryptor which encrypts each file to make it EGA compliant, and generates associated md5sum files. These files can then be uploaded using Aspera or FTP. Finally, the EGA Webin service can be used to register the study and to submit affiliated metadata to the REST server.
Submitting array-based data to EGA follows the same first two steps (file preparation and upload) as with submitting sequence data. However, the final step of submitting metadata is slightly different. While you still use the EGA Webin service, one must also complete the Array-based Format sheet (AF).
As with sequence and array-based data, submitting phenotypes to EGA involves the file preparation using the egaCryptor and file upload using Aspera or FTP. Metadata submission again uses the EGA Webin service, however, this is in combination with XML submission to the REST server.
Thoughts on the EGA submission process
Tell me about your experiences with submitting data to EGA, comment below and get the discussion going!
I found it was again reiterated that one of the major difficulties with the EGA submission process was the fact that the individual submitting the data is often not the researcher who produced the samples, but rather the researcher who performed the analysis. This resulted in communication back and forth between different people, which was confusing and often resulted in insightful or additional information being lost or left out.
"EGA data submission involved very long forms with many input fields. I wasn't always sure what was wanted for each field and it was often unclear what information to put where."
Related Blog Posts
Submitting genomic data to repositories: a necessary nightmare?!
Submitting genomic data to repositories: Gene Expression Omnibus - GEO
Submitting genomic data to repositories: ArrayExpress
Submitting genomic data to repositories: Sequence Read Archive - SRA