Special thanks to Alex Mitchell, the content and curation coordinator for the InterPro and EBI Metagenomics databases, for writing this guest blog post
The term ‘metagenomics’ describes the simultaneous analysis of the collective genomes of microbes present in a sample from a given environment, such as rainforest soil, seawater or human body site. This comprehensive genomic analysis approach can provide powerful insights into microbial community composition and function. Underpinned by dramatically falling DNA sequencing costs, metagenomic analyses have become increasingly mainstream in recent years and have been applied to a diverse variety of fields, including marine ecology, agriculture, food manufacture, bioenergy production and human health. The latter is an area of particularly keen interest, since the human microbiome appears to be important for a range of functions in health and disease. For example, dysbiosis of the gut microbiome has been linked to a myriad of disorders, including obesity 1, diabetes 2 3, cancer 4, bowel disease 5 6 and rheumatoid arthritis 7. Intriguingly, links between the microbiome and neuropsychiatric and neuropathological disorders, such as anxiety 8, depression 9, and even Parkinson’s disease 10, have also begun to emerge, potentially mediated by a microbiome–gut–brain axis 11.
The challenges of metagenomic data analysis
Despite the burgeoning interest in metagenomics, researchers often find themselves stymied, both by the sheer volume of sequencing data and by the diversity of tools with which to perform analyses. For example, a single whole genome shotgun sequencing run can yield more than 250 million sequences, representing over 100 Gb of uncompressed data. With many metagenomic experiments involving tens, or even hundreds, of such runs, data volumes can quickly overwhelm the storage capacities and analysis capabilities of individual researchers. At the same time, a quick survey of the scientific literature reveals a bewildering array of software designed for metagenomic data analysis, with over one hundred publicly available tools for researchers to choose from, but with no commonly-recognised standard analysis workflows to guide them.
Aims and scope of EBI Metagenomics
EBI Metagenomics 12 helps to resolve these issues as a freely available hub for the analysis and exploration of metagenomic datasets. It allows functional and taxonomic analyses of user-submitted sequences, as well as analysis of publicly available metagenomic datasets held within the European Nucleotide Archive. First established in 2011, and supported by EMBL, BBSRC, ELIXIR-EXCELERATE and InnovateUK funding, EBI Metagenomics has grown to become one of the world’s largest metagenomic data repositories, with over 75,000 publicly available datasets analysed using a standardised pipeline, helping support comparison of results.
The resource contains data sampled from a wide range of environments (termed ‘biomes’), ranging from insect digestive tracts to hydrothermal vents. A large proportion of the data (over 36,000 datasets) comprise microbiomes from human body sites, with this number expected to grow significantly over the coming years. EBI Metagenomics already houses the American Gut project, an extensive citizen science endeavour, aiming to analyse the microbiomes of thousands of individuals to shed light on the connections between microbiota and health. The analysis of over 8,000 sequencing runs from this project can be visualised and/or downloaded from the EBI Metagenomics web site, either on an individual run-by-run basis, or as results matrix files summarising the whole project.
Image from Spencer Phillips at EMBL-EBI
The EBI Metagenomics team constantly survey new tools and resources that can improve or complement existing analyses. Thus, the analysis pipeline is updated at approximately 6 month intervals, with pipeline versions indicated on the website. Datasets analysed using older versions of the pipeline can be updated to the latest iteration, based on user request. The team also has a watching brief to ensure studies that use emerging sequencing technologies, such as Oxford Nanopore Technologies, can be analysed appropriately.
Supporting data discovery
As the number of datasets continue to grow, one aim of EBI Metagenomics is to improve support for data exploration and discovery. To this end, an API is under development to allow access to analysis results and contextual metadata. The team is currently seeking feature requests from the user community, to ensure the API can best support their needs. Another exciting development is the establishment of a formal collaboration between EBI Metagenomics and the US metagenomics portal MG-RAST 13, helping users identify and compare the analysis results for equivalent datasets in both resources. The ultimate aim is that a dataset submitted to either portal will be analysed in both, combining the strengths of the two analysis pipelines and web sites. This approach will provide complementary insights and visualisations, and provide a standard baseline for all metagenomic data analyses.