ArrayExpress at EMBL-EBI - quality first!

New add: Repositive is delighted to be sponsoring Amy Tang's upcoming talk with DNAdigest.org, Fri 7 July, in Cambridge!



Special thanks to Amy Tang, the Functional Genomics Curation and Training Project Leader for ArrayExpress and Expression Atlas at the EMBL-EBI, for writing this guest blog post

What is ArrayExpress?

ArrayExpress is a public, free database at EMBL-EBI for archiving functional genomics data, covering areas of transcriptomics (expression profiling), epigenetics (e.g. ChIP-seq, bisulphite-seq, ATAC-seq) and genomics (e.g. genotyping/SNP-typing). All data come from microarray and/or next-generation sequencing (NGS) technologies, either directly deposited at ArrayExpress, or imported from a similar repository called Gene Expression Omnibus at NCBI via an automatic pipeline. Of all, transcriptomics data account for over 80% of the experiment entries in ArrayExpress.


Check out the interactive experiment types released publicly in ArrayExpress since its inception. Made by Robert Petryszak.


By data, do you mean just some files from a microarray scanner or NGS fastq files?

The raw scan files and fastq files are indeed an important part of the data we archive. However, these files wouldn't be very interpretable if there is no information on the experimental conditions from which they're created, would they?
  All data therefore come with "meta-data", which by definition "describe" the actual data. The meta-data not only provide the context for data interpretation, but are also crucial for anyone who would attempt to replicate the experimental procedures, or reproduce the results with comparable experimental set-up. The minimum set of data and meta-data for a given study/experiment is defined in two data standards: MIAME for microarrays and MINSEQE for NGS, and we strive to facilitate data depositors in complying with these standards.


Hang on a minute, isn't "ArrayExpress" only for micro "arrays"?

Many people have asked us the same question before.

The answer is …. no. :-)

The name for the database was first coined in 2000 when microarray was the major technology used in functional genomics studies. Since 2008, at the advent of next-generation sequencing (NGS), ArrayExpress has been archiving sequencing data too, but we've kept our old name.

The "archiving" of NGS data is done in partnership with the European Nucleotide Archive (ENA), which is part of the Sequence Read Archive (SRA) collaboration together with NCBI (USA) and DDBJ (Japan). The partnership harnesses ENA's bespoke IT infrastructure for archiving very large data files, ArrayExpress's expertise in curating experiments, as well as the ArrayExpress web interface for intuitive presentation of meta-data. For every direct NGS submission, ArrayExpress curates the meta-data and validates the raw (often fastq) files, before brokering them (meta-data + data) to the ENA.


What can I do with ArrayExpress?

Search: find functional genomics data sets of interest by keywords (e.g. "diabetes mellitus", PubMed ID) and by technology (e.g. RNA-seq only). ArrayExpress search engine uses the Experimental Factor Ontology (EFO) to extend your query to synonyms (e.g. searching for "cerebral cortex" will automatically match "adult brain cortex") and EFO child-terms (e.g. searching for "bone" will automatically return records for "rib" or "vertebra") . It also allows you to make very specific queries, either based on named fields (e.g. the term "diabetes mellitus" must be present in sample annotation), or based on a certain concept (e.g. "the experiment's intent must be about comparing disease vs normal samples").

Submit and share: Most journals now require functional genomics data sets to be deposited at a public database such as ArrayExpress or NCBI GEO in compliance with MIAME or MINSEQE standards prior to manuscript publication. The archived but unpublished data can be kept private after submission, and then be shared publicly via the ArrayExpress website once the paper is published. We understand that many researchers find submission a tedious process, especially when they're under time-pressure, so we provide a user-friendly web-form tool called Annotare that takes a lot of the submission "pain" away. The median time from creating a submission account and submitting successfully is about 3 hours, and our regular submitters have told us that it takes them less than 30 minutes to construct each submission (including time away from the desk to grab a coffee!), once they've got used to the tool, so it really isn't that tedious. We also provide a free biocuration service for every submitted data set to promote research reproducibility (more about this below).

Data-mine: for researchers who do large-scale data mining, we provide a programmatic access service via REST or JSON API. All the web interface-based search features (e.g. EFO ontology-based query extension) are also available via these services.  

What are the differences between directly submitted and imported experiments?

  Since most researchers deposit data with us or GEO to satisfy journals' requirements, there isn't much difference in the technology type or biological questions studied in experiments in either respositories. One of the key difference is that GEO do have a much higher data volume (about 6x of ArrayExpress direct submissions). That's why we have been importing GEO experiments for some time, to make ArrayExpress an "one-stop shop" for functional genomics data. As of 6 Apr 2017, ArrayExpress contains 59285 GEO series (experiments), that is about 71% of the 83303 public series available from GEO on the same day. GEO currently does not mirror ArrayExpress experiments systematically.
  Another key difference lies in the way we handle submissions. At ArrayExpress, we pay a lot of attention to meta-data quality. We're passionate about capturing rich meta-data at the point of submission when the experiments are still fresh in our depositors' minds. When data depositors prepare a submission with our webform tool Annotare, they're guided by the tool's built-in term suggestions and validation service to "do the right thing" and adhere to the MIAME or MINSEQE guidelines. Once submitted, we curate every experiment and try to get meta-data in good shape in consultation with the depositors. During curation, we have caught numerous cases where samples were mislabelled (e.g. a "treated" sample labelled as "control"), control samples were omitted, annotations were riddled with undecipherable acronyms, data files in unrecongised formats and hence completely unusable, and so on. As people often say, there's "often just one way to be right but many ways to be wrong", which can't be more true!


Find out more about functional genomics resources at EMBL-EBI

ArrayExpress is only one of the three functional genomics resources/databases at EMBL-EBI:

To make the most out of ArrayExpress, e.g. how to harness the power of its search features, try this free EBI Train Online course: http://www.ebi.ac.uk/training/online/course/arrayexpress-discover-functional-genomics-data-qui .

If you're interested in submitting data to ArrayExpress via Annotare, we have this recorded webinar video that includes a demo of the tool: http://www.ebi.ac.uk/training/online/course/arrayexpress-why-and-how-submit-your-data

We also developed a three-part conceptual course on functional genomics last year that covers the basics of experiment design, common technologies, data analysis methods and data management, starting with part (I), Introduction and designing experiments: http://www.ebi.ac.uk/training/online/course/functional-genomics-i-introduction-and-designing-e

For information about the Expression Atlas, try http://www.ebi.ac.uk/gxa/help/index.html

Read more posts by Charlotte Whicher