Special thanks to Edoardo Giacopuzzi, a post-doc working on NGS and data analysis for the Genetic and Genomics Analysis Platform at UniBS *, for writing this guest blog post.
The data are out there! How Repositive can help you find and access human genomic data.
I’m part of a small Italian research group working on genetic basis of psychiatric diseases and we have a couple of bench-top sequencers to support small NGS based projects for our Institution. However, since the introduction of genomic technologies, the real business in studying genetic basis of diseases has became GWAS/WES of thousands samples, an effort well beyond our resources.
I have a magnet above my desk citing E. Rutherford to remind me which is the winning approach for us: “We haven’t got the money, so we’ve got to think!”.
Indeed, we soon realized that we can’t afford discovery experiments competing with large consortia, but we may be able to think out-of-the-box and find new ways to aggregate genomic data to investigate specific biological mechanisms.
Luckily for us, and for researchers overall, the genomic field is quite open to data sharing and publication of raw data is often mandatory for papers involving NGS. As result, huge amount of data has accumulated during the years in data repository like SRA archive, ENA, GEO and there is almost no platform, method or phenotype that doesn’t have some data ready for you to grab.
I was quite excited to attend ESHG 2017 conference, looking for new smart ideas and possible collaborations to produce good science at low budget.
During a coffee break, I was walking through company stands hoping that other attendees have left some pastries. Found one and refilled with sugar, I thinked back to talks of the day on complex diseases and multi omics integration and start wondering if we may apply these approaches to our study of regulatory landscape in Schizophrenia. I proposed this idea to collaborators wondering if we can put together a pilot study. We concluded that we may need some additional data and decided to try to explore repositories once I’ve came back home. So where to go to close the day? I looked at my program for evening corporate satellites and saw Repositive.io meeting “Find the most suitable genomic data repository for your needs”.
This sound exactly what we need, a way for effective and rapid search of genomic information! So I decided to stop by their stand to take a first look and talk about their service. After a short chat with their friendly staff and a brief overview of their web portal, I signed for the T-shirt lottery and their evening meeting. From the beginning, Repositive emerged as a friendly and open-minded group of people who believes in the power of data sharing and community effort to boost genomic science.
The evening talk was a fun and informative overview by Repositive leader Manuel Corpas, which underlined the main feature of their search engine and the community-based approach they implemented to improve datasets information. Using also some fun real-world examples, Corpas illustrated their different look to genomic data. With an incredible amount of data rapidly accumulating, they understood that genomic field could take advantage of indexing and searching approaches previously applied for the world wide web data. Indeed, data are useless if you can't find them and effective search engines would improve sharing and avoid wasting resources on replicated efforts. Moreover, this would fit perfectly in genomic field allowing small research group (like us) to develop their ideas leveraging on large community based efforts, maximizing the scientific value of each dataset.
So what’s Repositive? And what can it do for me? As Corpas summarized during his talk: Repositive want to be the booking.com of genomic data. In my opinion, Repositive web service is a bit of Google search engine, a bit of booking.com and a bit of Wikipedia. The idea is to provide a centralized search engine where all genomic data repositories are indexed and can be searched by specific queries and filters. Each dataset is provided with description and attached meta-data (like platform, approach, sample source…). A second aspect is the Wikipedia like features of Repositive, where registered users can contribute to datasets by providing additional details and meta-data. Datasets can also be shared and watched to keep track of updates and changes and users can comment datasets like hotels on booking.com. Finally, users can register their own datasets or request for new repositories to be added, as well.
Since the amount of available data in genomics is rapidly growing and datasets are spread across several repositories, Repositive search engine has the potential to make data retrieval simple, quick and effective. The platform is now in its infancy, but it already include data from 43 repositories and a user-friendly web portal. Provided filters allow to restrict searches based on data accessibility, experimental approaches and source repository. The community based features has great potential to improve dataset description, but they impact depends largely on how much Repositive will be able to spread across scientific community.
Open questions and possible improvements
To provide an effective data search engine is essential to extract relevant information and made them searchable / filterable. To reach this challenge they have to face with the non-standardized ways of representing meta-data and manual revision should be need to actually provide all these information. This seems an enormous manual work to deal with… As Corpas underlined, part of the success of Repositive will depend on how many researchers decide to use and contribute in the platform. A large support by scientific community will indeed enrich meta-data and give Repositive influence to ask repositories for standardized representations.
By now, datasets are represented singularly by sample and not aggregated by project and available filters are still limited. By myself, I often want to be able to search for an entire project, like RNA-seq in a specific disease with case and controls, rather than by single sample data. Moreover, a filter on number of samples and platform used (i.e. hi-seq, bead-array, ion…) are needed to better define useful data.
They are also thinking about some kind of centralized management for data access request and automation of data access, so that for both open and some managed access data, Repositive will became a search, click, download platform. This will really make the difference!
My personal experience with Repositive
By myself, I’m quite a fan of data and knowledge sharing, since it can really speed up research, avoiding redundant efforts and allowing for quick and effective hypothesis testing.
Taking advantage of the increasing amount of accessible genomics data, we gradually shifted from production to analysis, trying to integrate our small scale experiments with big data produced from others to improve our hypothesis and investigate those aspects that was not addressed by the original authors. In this process, we soon discovered that even if data are out there, they are far from simple to catch! Indeed, they are spread across multiple repositories and several ones are not so easily searchable. Moreover, several data are under restricted access that requires complicated administrative procedures.
As soon as I got back to the lab after ESHG17, I gave a try to Repositive to find out if it could help me retrieve useful data. My first experiment was an easy one: search for additional genomic data on NA12878 reference sample that I could use to develop variant filtering algorithms. Using Repositive, I found about 40 WES and 1600 WGS, all open access and spanning different platforms. Good! A lot of material to work on. The second experiment was to find some WGS and transcriptome data for major depressive disorder. Here, the task was a little more difficult, since I want to retrieve case/control studies and, as I mentioned above, the present Repositive interface list samples one by one and not aggregated project data. However, I discovered CONVERGE data are available for download (around 11k low-coverage WGS and variant calls), as well as some interesting gene expression data from GEO repository. I’ve also noted that data from NIMH genetic repository are missing...I will request to have them added!
Looking around their blog, I also found suggestions and walk-throughs for data access in managed systems that can help newbies in their first time application at dbGaP or similar platforms.
Based on my first quick experience, Repositive can be extremely useful for our analysis projects making data retrieval easier. However, some improvements (project based and platform based filters for example) are surely needed to allow effective retrieving of useful data.
Happy sharing to everybody!
*To keep up with the great work Edoardo is doing, see: 1. Genetic and Genomics Analysis Platform @ UniBS, 2. NGS-Brescia blog