Final Results from 'Tools and methods in genomic data analysis'

It is fundamental for the community that wants to use genomics for their benefit that we understand what and how tools and methods are used to analyse genomic data, and what the blockers are that are preventing researchers from doing their desired analysis. By understanding these things we can build toward removing these blockers and using more genomic data to answer more questions.

This survey has highlighted a handful of key findings:

  • Multifunctional, web based software tools are the most popular for the analysis of genomic data.
  • 34% of researchers are not using their 'ideal' type of reference data in their research.
  • 95% of genomic researchers are using externally sourced data to power their research.
  • The main method by which researchers find this data is via repositories/databases.
  • There are still bottlenecks in the data access workflow.


Introduction

In July 2016, Jessica, a web developer at the Earlham Institute, and I ran a preliminary survey on 16 business professionals and researchers working in genomics. The aim of this survey was to get a better understanding of the current software tools used by bioinformaticians and data scientists working in the field of genomics, as well as the scientific questions asked when analysing variant data.

The blog post discussing the basis for this survey and presenting our preliminary results can be found here. The raw data, for the preliminary and this final survey, can be found on Figshare or data.world under a publicly available CC-BY licence.

After that preliminary survey we added some additional questions to gain further insights and then opened the survey up to a wider audience. 50 people responded and in this blog post I will detail our findings from this survey and our final conclusions.


Results

Life scientists use a wide range of different web and desktop applications to analyse genomic data

When asked for the type of software used in the analysis of data, as seen in the preliminary survey, both desktop applications and web tools enjoy popularity among the target group. Again, most of the scientists (53%) use both types of software tools to tackle genomic data analysis. 23% predominantly use web tools and 20% rely on desktop-based tools. Interestingly, compared to the original cohort, 13% more respondents use mostly desktop based applications over a mix of both.

This time, the list of tools used was very widespread, but with multifunctional tools like the University of California Santa Cruz genome browser (UCSC), Ensembl, the Integrated Genome Viewer (IGV) being the most popular. In fact 20% of respondents said they used Ensembl. The statistical tool R was still popular, but a new addition which was mentioned by 12% of respondents was Samtools.

Reference data for variant analysis is still kept simplistic

Variants are investigated in comparison to a reference genome build by most of the researchers (68%) and only rarely in contrast to a control group (12%). Variant data derived from paired samples, as in e.g. cancer research, are preferred by only 9% of survey participants. Interestingly, 11% or respondents answered with "other" to this question, double the number in the previous survey. For those who responded with "other", they were either using a specific project dataset as a reference dataset or they were using reference epigenomes/transriptomes.

In the preliminary survey we noted that "It is still unclear if these answers are mostly related to researchers’ interests or if obstacles in obtaining more specific reference datasets influence the scientists’ choices". Therefore for this final survey, we also asked respondents what their 'ideal' reference dataset would be. We found a similar distribution of usage as with the initial question, suggesting researchers are using their ideal reference datasets.

However, 34% of respondents did state a different 'ideal' dataset from the type of reference data they were currently using in their research. Predominantly these people were comparing to reference genomes when they would ideally want to compare to variant data of a control group.

Regarding the analysis of variant frequencies, we found that numbers are derived from the overall population by 26% of respondents, from ethnic subpopulations by 28% and from subpopulations, that are not of ethnic nature but any other parameter (e.g. blood type, genotype, etc.) by 17%. This is quite a different result from the preliminary survey, where deriving from the overall population was most popular and no-one used subpopulations that were not ethnic in nature.

As we saw previously, a significant number of the participants are not investigating variant frequency in populations at all (29%), which implies that current genomic research does not focus on population studies alone.

Most researchers do, via various means, use external sources of genomic data to power their research

When asked if they used externally sourced data for their research, the majority of respondents (95%) said they did. This is 8% more than the preliminary survey demonstrated, and 21% more than was determined by (or in) van Schaik et al 1. The caveats in this comparison are mentioned in our original blog post. Nevertheless, we are seeing an increase in the number of researchers using externally sources data for their research.

As we saw before, the predominant method by which the recipients find this data is via repositories/databases. However, searching for data through publications, using the NCBI publication search engine PubMed, and Google is also popular. Interestingly, the percentage of respondents getting external data from collaborations was less than a quarter of what we saw in the preliminary survey, and double the number found data via word of mouth.

When we asked what steps in the process of finding external data researchers struggle with most, we again found they mostly included: finding the right kind of data; sifting through lots of data or publications; dealing with poor data descriptions; associating phenotypic/functional data with the raw data; procuring the data; and not having enough time.

8 out of 9 of the respondents who did not use externally sourced data, said they would consider using externally sourced data in their research if it was easier to find.

We further investigated this area by asking respondents what the major blockers they experience when trying to source external data to power their research. Predominantly they cited problems with: differences in data formatting; access, including restrictions to access the data and issues with programmatic access or downloading of the data; the need for multi-omics data.


Conclusions

When we started Repositive 2 years ago, we already knew about the problems with access to genomic data. It has been fundamental to us to stay connected to the genomics community and regularly receive their feedback, keeping our finger on the pulse so to say. If you feel you know something about problems with data access and re-use that we don't know, get in touch! We will listen to you, share your frustration, and together, we are sure, we will build solutions to some of the problems!

This survey has helped us to gain a deeper insight into the most popular tools and methods used in current genomics research. Most of the results we found in the preliminary survey were supported when we re-ran the survey on a larger cohort. We did find out some other interesting things however that were not revealed in the first round.

We discovered that, alongside the previously found popularity of Ensembl and R, Samtools is consistently used by a wide set of respondents. The bottom line however is still that there is a long list of tools and softwares with a small and distinct user base.

As before for variant analysis, reference genomes are still the most used type of reference datasets, and most researchers are using their 'ideal' type of reference data. We did find in this second survey, however, that of those respondents who are not using their 'ideal' datasets, mostly they would prefer to be using variant data from control groups.

We supported our original findings that the genomics researchers involved in this survey predominantly use external data to power their research. They mainly find this data by searching in databases/repositories, on google or by reading publications sourced through PubMed. However, there are many steps in the workflow of finding external data that researchers struggle with and there for some there are blockers that restrain their work. These blockers were mainly to do with the formatting of the data or the inability to gain access to this data.



References

  1. Van Schaik TA, Kovalevskaya NV, Protopapas E, Wahid H, Nielsen FGG. The need to redefine genomic data sharing: A focus on data accessibility. Applied and Translational Genomics. 2014 | doi:10.1016/j.atg.2014.09.013 ↩

Read more posts by Charlotte Whicher