At the beginning of February Repositive partnered with Xpressomics to index their collection of curated and high-quality expression data. A month on I wanted to find out how this data has been received by the Repositive community and how it can be used by researchers in the future.
Why do researchers need expression data to validate their work?
Measuring gene expression enables one to quantify the level at which a particular gene is expressed within a cell, tissue or organism.
“Whenever anyone develops a new method for doing any kind of analysis, they need test data to validate their results”
An important step in developing a method for evaluating differential expression is demonstrating that it actually works. One way of doing this is to apply your new method and the current alternatives to real biological expression data. Doing this helps you:
- To show that your method can recover results that are accepted as true.
- To argue your method can reveal something others can't.
The more relevant and realistic test data you have, the better you can evaluate your method, find where it works well and where it doesn't. The alternative approach is to simulate data where you know the "real answer". But even if you take this approach, it's common practice to use real expression data as the basis of your simulations so you still need access to lots of real biological data.
“The issue of needing to find data with multiple similar assays performed on the same samples is a general problem”
The aim of our project was to test a method for correcting a statistical bias present in gene enrichment analysis of RNA-Seq data (e.g., GO term enrichment tests). As both RNA-Seq and expression microarrays are ways of measuring the same biological property (level of transcription), they should both produce roughly the same results in down-stream analyses (such as enrichment analysis) if there are no statistical biases present. Therefore, we wanted data that had both microarray and RNA-Seq for the same sample. In the end we used a prostate cancer data set, which met our requirements.
"Had it been easier to find suitable data quickly, I think we would have used multiple data sources to strengthen our conclusions."
How can the Xpressomics data help?
A major problem today is that it is difficult to find relevant information about genes from experiments that have already been conducted. Even though there is a lot of expression data available in public repositories such as GEO and ArrayExpress, there are vast amounts of information hidden in the files from large-scale gene expression profiling studies that sits unanalysed. Xpressomics is solving this problem by unlocking this hidden information through manual annotation of experiments and detailed differential expression analysis. This reveals relevant information from mountains of publicly available experimental data, allowing researchers like Matt and Hana to more easily find the data they need to validate their experiments.
The indexing of Xpressomics’ data on Repositive is already looking like a popular move; it's been one of our most successful email campaigns ever. The email link has been used to access the Xpressomics data immediately, and the email content has been shared and forwarded to members of the wider genomics community. Furthermore, ‘Xpress’ has become the second most highly searched term after ‘Cancer’ on the Repositive platform.
Top tips for finding high-quality expression data: - Look for curated and annotated datasets. - Read any discussions or publications that relate to that data. - If you know someone who has used the data, ask them about it. - Search for data via Repositive so you see available expression data from both public sources like ArrayExpress and other data providers like Xpressomics.
The uptake of interest in the Xpressomics data has been very exciting – researchers really want and need more high-quality expression data to validate their results. I will continue to keep my eye on the performance of Xpressomics’ data on Repositive and try to understand in greater detail the utility of this data for the Repositive community.