Tools and methods in genomic data analysis

Introduction

In today’s genomic era, comprehensive analysis of genomic data is becoming increasingly popular in academic and clinical research contexts 1. This development increases the need for more sophisticated tools and methods for acquiring, distributing and analysing genomic data 2.

In this scope, the comprehensive annotation and analysis of nucleotide polymorphisms becomes its own distinct discipline in the field of genomics. Efforts to not only associate risk variants to disease, but also to identify causal variants for pathological conditions and simply to better understand the genomic landscape in large datasets, already exist and will likely increase in the future.

At the same time, the range and number of software tools to help overcome these challenges is increasing at a fast pace. Scientists find themselves in the fortunate position of being able to choose from highly specialised methods specific to their use cases. Also in the field of variant analysis a solid ecosystem of tools has evolved, with many of its tools having both strengths and weaknesses 2.

Besides the right choice of tools, how to find and choose relevant genomic data is also a popular question among scientists. Acquisition of data relevant to research questions is important and can have a large impact on the validity of any analysis. It is acknowledged that data reuse and reanalysis between genomics studies should reduce false positive results and increase reliability and the chances of making novel discoveries. However, the processes researchers use to find data to power their research are often ad-hoc and inefficient, and furthermore, many researchers still do not use external data sources to power their research at all 3.

“Searching for relevant data is haphazard and usually involves a general web search, visiting one or more databases, searching through a journal database and/or tracing the data referenced in a published article, or a word-of-mouth search“ 3

Aims of this survey

In this survey, we aimed to get a better understanding of the current software tools used by bioinformaticians and data scientists working in the field of genomics, as well as the scientific questions asked when analysing variant data.

Additionally, we were interested in the survey participants’ genomic data search and access habits and whether our recipients behave similarly to or differently from those surveyed in Van Schaik et al 3.

We sent out a short web questionnaire generated with typeform via e-mail to a selected user-base including nine questions in total.

The preliminary results presented below are derived from 16 business professionals and researchers working in genomics, with their work field ranging from biology and bioinformatics to data science and software development.

The survey is still open - if you want to participate and have your say CLICK HERE. After completing the survey, enter your email address to get a summary on this survey and be entered into a prize draw to win a Repositive T-shirt!

Results

The raw results from this survey have been deposited on Figshare7 under a publicly available CC-BY licence. If you distribute, remix, tweak, and build upon this work we would love to hear about it! Let us know in the comments section on this blog or on Figshare.

Life scientists use a wide range of different web and desktop applications to analyse genomic data

When asked for the type of software used in the analysis of data, both desktop applications and web tools enjoy popularity among the target group. A vast number of the scientists (73%) use both types of software tools to tackle genomic data analysis, with a smaller fraction of bioscientists using web tools alone (20%) and even fewer relying on desktop-based tools exclusively (7%).

The list of tools is astonishingly widespread and includes 39 different tools and web services which are used among the participants to handle genomic data. Most popular are very multifunctional tools like the University College Santa Cruz genome browser (UCSC), Ensembl, the Integrated Genome Viewer (IGV), but also the statistics tool R for creating and reusing data analysis packages. Further down the list we find a rich set of software tools made for different levels of data visualisation and manipulation, ranging from mere command line tools for data parsing to complex web applications orchestrating full data analysis workflows.

In sum, the ecosystem of tools present and in use in the life sciences community can be described as very diverse. This was deduced from the multitude of different research needs that survey participants are confronted with, but also from the fact that many of the tools mentioned are a product of rather enclosed academic project communities which are sometimes only known to an exclusive user base.

Reference data for variant analysis is still kept simplistic

Looking at the scientific questions that drive survey participants’ research - specifically the analysis of variants - we were both interested in which kind of reference dataset is favourably used and which parameters drive variant comparison.

Variants are investigated in comparison to a reference genome build by most of the researchers (63%) and only rarely in contrast to a control group (19%). Variant data derived from paired samples as in e.g. cancer research are preferred by only 13% of survey participants. It is still unclear if these answers are mostly related to researchers’ interests or if obstacles in obtaining more specific reference datasets influence the scientists’ choices.

Regarding the analysis of variant frequencies we found that numbers are derived from either the overall population (37.5%) or from ethnic subpopulations (25%). The option of analysing variant frequencies in subpopulations, that are not of ethnic nature but any other parameter (e.g. blood type, genotype, etc.), was also offered in this survey, but, interestingly, was not chosen by any of the survey participants. This indicates that research interests in more specific subpopulations are not very common yet. Also, we found that a significant number of the participants are not investigating variant frequency in populations at all (37.5%) which implies that current genomic research does not focus on population studies alone.

Most researchers do, via various means, use external sources of genomic data to power their research

When asked if they used externally sourced data for their research, the majority of respondents (87%) said they did. This is an increase on the 74% of researchers surveyed in Van Schaik et al.,3 that said they accessed data from repositories. However, it is important to note that the difference in the wording of these questions means that our recent survey would include researchers who used externally sourced data not from repositories (e.g. from collaborators).

The predominant method by which the recipients find this data is via repositories/databases. However, searching for data through publications, using the NCBI publication search engine PubMed, and Google is also popular. Those repositories mentioned by name are EBI, GEO, ArrayExpress and GenBank. Additionally, some recipients source external data by asking collaborators or forming collaborations. The steps in the process of finding external data that researchers struggle with most include: finding the right kind of data; sifting through lots of data or publications; dealing with poor data descriptions; associating phenotypic/functional data with the raw data; procuring the data; and not having enough time.

Of the 13% of recipients who said they did not use externally sourced data for their research, the only cited reason for not doing so was that they “did not know how to access it”. Furthermore, all of the ‘No’ responders said that if they could find external data more easily, then they would consider using it in their research. This suggests that one of the major blockers to these individuals using externally sourced data is that it is too hard to find.

These results reinforce our understanding that it can be very difficult for researchers to find and gain access to external sources of genomic data. There is a lot of data out there but researchers either do not know where to look, struggle to wade through the masses of inconsistently formatted data to find what they are looking for, or struggle to gain access to the data once they have found it. Therefore there is clearly a need to consolidate the metadata associated with genomic data into one format that can be searched through an internationally known portal. To address this pressing problem of a lack of data discoverability, Repositive has built an online platform (repositive.io) to provide a single-point entry to search through public genomic data repositories 4.

Conclusion

This survey has been a great opportunity to gain a deeper insight into the most popular tools and methods used in current genomics research. We found that besides a couple of tools with greater popularity like Ensembl or the UCSC Genome Browser, there is a long list of utils with a small and distinct user base making the landscape of currently used software tools big and diverse. This might reflect a diverse user base in the bioinformatics community with a wide range of needs in software functionalities.

In the specific scope of variant analysis, reference genomes are still the most used type of reference datasets, although other types of reference data (e.g. control group data in genome-wide association studies or paired tissue samples in cancer research) are becoming an ever increasing topic in genomics research 5 6. To clarify if the choice of reference genomes in variant analysis is mostly done by preference or by necessity due to missing, more insightful datasets, is a question still left unanswered and should be challenged in a future user survey.

Population studies are an aspect of the majority of survey participants’ research, but only half of those involving ethnic subpopulation. An interesting outlook for another survey would be to ask participants for their motivation on doing more specified subpopulation studies, involving other parameters than ethnicity to create data subsets, in assumption that the required data was available for their research.

Additionally, we gained greater insight into the usage of external data resources by genomics researchers in their everyday work. The recipients predominantly use external data to power their research, and in greater numbers than found by Van Schaik et al.,3 in 2014. They mainly find this data by searching in databases/repositories, on google or by reading publications sourced through PubMed. However, there are many steps in the workflow of finding external data that researchers struggle with. All of the researchers who do not use external data to power their research said they would do so if it was easier to find the data.

In the future it would be interesting to look into greater detail at what the perceived or experienced blockers are to researchers trying to source external data to power their research. Before we can overcome the hurdle of helping researchers find data, we first have to help them understand why it is important to use external data in their research. Furthermore, it will be important to gain a greater understanding of the struggles and blockers users face when trying to find data and how Repositive might be able to solve those problems.

The survey is still open - if you want to participate and have your say CLICK HERE. Your responses will give crucial insight into the requirements of a web based tool which helps professionals like yourself to analyse genomic data and support their decision processes in a research context.

After completing the survey, enter your email address to get a summary on this survey and be entered into a prize draw to win a Repositive T-shirt!

About the authors

Jessica is a web developer at the Earlham Institute, Norwich, UK and contributor to the BioJS project - the leading JavaScript based component library to manipulate and visualise biological data on the web. She wants to understand which needs drive life scientists in their choice of genomics data analysis tools to create better, more intuitive and user-friendly web experiences.

Charlotte is the product manager at Repositive and is interested in understanding why researchers currently use external sources of genomic data to power their research and how they go about finding this data. Repositive is a social enterprise that is building an online platform that indexes genomic data stored in repositories and thus enables researchers to search for and access a range of human genomic data sources through a single portal.

Citations and Sources:

  1. Conesa A, Mortazavi A. The common ground of genomics and systems biology.BMC Systems Biology. 2014;8(Suppl 2):S1. doi:10.1186/1752-0509-8-S2-S1

  2. De Brevern AG, Meyniel J-P, Fairhead C, Neuvéglise C, Malpertuy A. Trends in IT Innovation to Build a Next Generation Bioinformatics Solution to Manage and Analyse Biological Big Data Produced by NGS Technologies. BioMed Research International. 2015;2015:904541. doi:10.1155/2015/904541.

  3. Van Schaik TA, Kovalevskaya NV, Protopapas E, Wahid H, Nielsen FGG. The need to redefine genomic data sharing: A focus on data accessibility. Applied and Translational Genomics. 2014. http://dx.doi.org/10.1016/j.atg.2014.09.013

  4. Kovalevskaya NV, Whicher C, Richardson TD, Smith C, Grajciarova J, Cardama X, et al. (2016) DNAdigest and Repositive: Connecting the World of Genomic Data. PLoS Biol 14(3): e1002418. doi:10.1371/journal.pbio.1002418

  5. Wu MC, Kraft P, Epstein MP, et al. Powerful SNP-Set Analysis for Case-Control Genome-wide Association Studies. American Journal of Human Genetics. 2010;86(6):929-942. doi:10.1016/j.ajhg.2010.05.002

  6. Beroukhim R, Lin M, Park Y, Hao K, Zhao X, Garraway LA, Fox EA, Hochberg EP, Mellinghoff IK, Hofer MD, et al. Inferring loss-of-heterozygosity from unpaired tumors using high-density oligonucleotide SNP arrays. PLoS Comput Biol. 2006;2:e41. doi:10.1371/journal.pcbi.0020041

  7. Whicher C, and Jordan J. Tools and methods in genomic data analysis: TGAC - Repositive Preliminary Survey Results. Figshare. 2016. 10.6084/m9.figshare.3503873

Read more posts by Charlotte Whicher