Repositive aims at indexing existing genomic datasets from multiple data sources into a unified portal to help finding and accessing data. In a previous blog post, we presented the release of the Repositive Gene Expression Data Collection. This collection includes datasets from the following repositories: SRA, ArrayExpress, InSilicoDB, Xpressomics, the Allen Institute, GEO, The Expression Atlas and dbGaP.
In this post, we provide an insight into the kind of data this collection offers and hence a proxy for what all repositories of gene expression data currently offer. A total of 130,299 datasets have been compiled for the Repositive Gene Expression Data Collection. We looked into the different assay sources, and types of data the collection has. Similar assays that appear in the usual Repositive filters such as “RNA-seq” and “RNA-seq of non-Coding RNA” have been merged for consistency to create three major types: “RNA-seq”, “Arrays” and “N/A” (not available; some datasets lack in their annotation the assay type they use, believe it or not!). We see that overall, RNA-seq data is the most abundant type of dataset (82%), while array-based assays are 12% of the total and the rest (6%) have an unidentified type of assay (Figure 1).
Figure 1: Breakdown of the 130,299 total number of Repositive Gene Expression Datasets by type of Assay. The vast majority are RNA-seq based assays.
We looked into the number of datasets that each of the data sources (i.e., repositories) has contributed to this collection. The vast majority come from SRA (88,160 – 68%) followed by ArrayExpress (21,160 – 17%) and others (Figure 2). Thus, from all 90,325 RNA-seq datasets this collection has, 98% come from SRA. SRA only provides RNA-seq, consistent with its mission to store raw sequencing data and alignment information from high-throughput sequencing platforms.
Figure 2: Breakdown of the Gene Expression Collection according to the source from which datasets were found. NCBI’s Sequence Read Archive (SRA) provides 68% of the total number of datasets in this collection.
In addition, we wish to know the type of tissue each of these datasets comes from. Figures 3 shows the subset of datasets broken down by tissue. Here we only look at the datasets matching liver, brain and heart in their descriptions.
Figure 3: Subsets of datasets in the Repositive Gene Expression Data Collection matching the liver (3A – total 1,018 datasets), brain (3B – total 106 datasets) and heart (3C – total 36 datasets) tissues. All of them are open access and use mostly array assays.
We find that when it comes to looking for specific tissues, most of the datasets retrieved come from array technology, not stored by SRA. This result may be due to SRA having very little or no RNA-seq datasets for liver, brain and heart or to lacking the necessary metadata to characterise the type of tissue datasets come from. This issue relates to a much wider problem, to do with metadata annotation being lacking or not standardised.