In September 2016, Repositive utilised a massive collective push to launch the discover.repositive.io platform, bringing it out of beta testing, so anyone at any time could sign up. It marked a significant shift in development as we began opening the doors to the world. In the process, we went from 42,000 data sets to almost 1 million. It's a time which is hard to forget as we saw a surge in press, website visitors and user sign-ups.
Also memorable, I rather famously celebrated by collapsing on our terrace to make a 'rain angel'.
To celebrate 1 year of discover.repositive.io, I wanted to take a look at what is trending on the Repositive platform since launch... What are the most searched terms? What are the most viewed datasets? The most popular collections? What success stories have we been privileged to witness? All very good questions...
Let's dive in!
Search and rescue
Since 1st September 2016, Repositive has seen 7,434 unique visitors arrive at Discover to take a look at our platform. With non-users able to query all our data and see what is available, many of these visitors come just to have a nose around what we are doing.
Speaking of 'just having a nose'... With any web page data/analytics, you have to take away any background noise. For example the 649 unique searches for "Cancer". How do I know this is noise? Well 1) doing that search returns 254,700 datasets and I can't imagine anyone browsing all of that, 2) for some time it was a suggested keyword to try on signup. So, it's clear, this sort of search is just to try out the search bar and how we represent data sets. It's not very reflective to what data is actually 'in-demand' or what people are actually looking for.
Looking at the list of popular Discover pages on web analytics, before any search term or phrase, came one particular dataset which received 314 unique views:
How can a dataset be viewed by 314 people, but not searched for by 314 people? Easy, for some time this data set was featured on the Discover home page, meaning a lot of people who signed up, saw this data set and began exploring the platform without even typing anything into the search bar.
This also happened for a particular data collection, the Population Data Collection, which is the second most looked at collection, with 151 visits.
After a few more collections appearing we finally get to some juicy search terms:
Returning 4,100 datasets.
4,100 is still a lot of datasets to go through, but thanks to some filters, users began to start breaking these data sets down further.
Performed by 21 users - returning 3,108 results
Performed by 15 users - returning 257 results
"Obseity assay:WGS access:Open"
Performed by 17 users - returning 257 results
The last search, for instance, returns 257 datasets, now this is much more manageable. Interestingly, all of which are from SRA.
There were also some other broad search terms which followed similar patterns:
Performed by 57 users - returning 7,362 results
"lung cancer assay:RNA-Seq"
Performed by 6 users - returning 2,013 results
Performed by 21 users - returning 641 results
Some searchers were a little more specific:
Performed by 5 unqiue people - returning just 1 result
Performed by 3 unique people - returning 11 results
"idiopathic pulmonary hypertension"
Performed by 2 unique people - returning 21 results
Analysing the data made me realise, that many users first search something broad, perhaps because it is the first thing that comes to mind? Perhaps it is because they want to first make sure they capture everything? These are the sort of assumptions as a start-up we need to verify. However, we assume, that after seeing just how much data was returned from the search, some users opt to refine searches further with additional keywords or pre-defined filters.
It makes sense. With data so poorly annotated across the board, it is easier to start of with lots of results and cut away, because using some specific terms like "lapc4" only returns those datasets where lapc4 is mentioned in the metadata. It then begs to question, just how many datasets are relevant to "lapc4" but are not appearing with the search?
Repositive to the rescue... that ladies and gentleman, is why we have enabled users to add extra annotation to datasets. Every piece of extra annotation, is another dataset brought to the surface.
Data Collections are an easy way for Repositive to serve key data requirements with little development time required. Right now, we are focused on search, but in our product roadmap we want to enable users to create collections of their own. To demonstrate there value, we looked at the entire data landscape, what data sources we had identified and our knowledge of the genomics industry to create several collections to allows users to quickly submerge themselves in a category of data, straight away.
So, which collections were the post popular?
1. 23andMe Collection v.10
With 172 unique views, our latest 23andMe collection has been our most successful yet! 2,261 open access datasets from 5 data sources; OpenSNP, PGP, Open Humans, Corpasome and Steven Keating.
2. Population Data Collection
With 154 unique views, our Population Data collection became our second most viewed collection. Slightly larger with 3,808 datasets from 9 different sources; Thousand Genomes, Simons Diversity Project, Estonian Biocentre, THL Biobank, dbGaP, Genome Asia, Genome of The Netherlands, Kadoorie Biobank and the Singapore Genome Variation Project.
3. Autism Related Data Collection
Taking bronze is our Autism Related Data Collection with 132 unique views. The largest of the 3 collections with 6,983 datasets from 9 different sources; SRA, ArrayExpress, dbGaP, InSilicoDB, GEO, Xpressomics, Metagenomics, PGP and (this is awesome to see) Repositive.
GWAS data is very popular with our users. We see many people interested in the variations between different populations. The Population Data Collection also represents some of the most diverse data we have, with sources like Simons Diversity Project and Estonian Biocentre actively trying to improve the European bias in genomics. It is no surprize for me to see that this collection took silver medal in the most popular collection awards.
What does surprise me though, is that our latest collection on 23andMe is the most popular of them all. Some collections have been on the platform for 12 months, so seeing as the 23andMe collection has only been live for 1/6 of the time, really shows the popularity of this data collection. We did write a press release highlighting how Repositive can utilise data from DTC genomic tests, and we also have some very passionate advocates of personal genomes on the team. Although this has no doubt contributed, we also see that many researchers and DTC genomic test pioneers like to measure their impact. They like to see there data is online, available and being viewed and used. That, combined with an increasing interest and awareness around personal genomic tests like 23andMe, has lead to a very popular collection indeed.
Who will claim gold as the most viewed data source on Repositive? Will it be a well known source like Array Express or dbGaP? Or will it be a new niche source that has everyone talking! Find out now, read about the most popular repositories and Repositive users in the second installement, The discover.repositive.io launch - one year on (Part 2)
Repositive Launches Full Version of its Platform
The discover.repositive.io launch - one year on (Part 2)