Getting Google to Find Human Genome Datasets

The plummeting prices in sequencing has triggered an exponential growth in the generation of genomic datasets. Currently it is very difficult for search engines such as Google to index data in the biomedical domain. How do we make data from this domain findable by search engines? A group of developers and scientists have gathered on March 6-7 at the European Bioinformatics Institute to come up with a standard specification to index data in the biomedical domain. This data includes training courses, materials, genomic datasets, databases, samples among others. This meeting, centred around the BioSchemas project, has the overall goal to improve data interoperability in the life sciences. BioSchemas itself is an extension of Schema.org, a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the internet.

In the morning of March 7th, we had a session on description of biological datasets. A number of projects have been presented, including the Semantic Web Health Care and Life Sciences Interest Group and the DATS Specification for the BD2K DataMed model, whose preprint paper is now available in biorxiv, co-led by Susanna Sansone and Alejandra Gonzalez-Beltran, both from Oxford. OmicsDI was presented by its lead developer, Yasset Perez-Riverol, DataCite, the leading provider of DOIs for research data, also presented their research on structured metadata for datasets.

I was able to present Repositive to the crowd, demoing what it looks like and how we expect users to interact with it (see picture below). I emphasised the importance of the community and social networking capabilities that Repositive is able to provide. These capabilities are our strategy to improve the currently incomplete and inconsistent annotations with which deposited biomedical datasets are available. Philippe Rocca-Serra also presented his work on mapping Nature Scientific Data journal ISA-formatted metadata to Schema.org. Philippe was able to show some current important gaps in Schemas.org, which currently misses terms such as population, genotype, phenotype, consent, etc. These terms are incredibly important for characterising genomic datasets. Tony Burdett from the European Bioinformatics Institute, lead developer for BioSamples said that ‘the more precise the community tries to map terms to a specification, the harder the task’. Phil Quinlan from UK Biobank expressed the idea that biobanks need to integrate data in a standard way. All attendees concurred with him that this is the only way for users to find sample data and beyond. Phil also thinks that it is unethical for samples in biobanks not to be findable.

All in all, the BioSchemas initiative is gaining momentum as an attempt to standardise the metadata for a great number of digital objects in the biomedical domain. With the backing of influential organisations such as Google, the European Bioinformatics Institute or Springer Nature among many others, BioSchemas will surely enhance our current capabilities to locate relevant biomedical datasets in the deep sea of data accessible from the World Wide Web.

Read more posts by Manuel Corpas