Rethinking how we deal with data


Guest blog post by Austin Tanney from Analytics Engines



Looking at disease from its molecular basis has pretty much become normal for many of us. Nowhere is this truer or relevant than in cancer. Just last week I was talking to someone about this and we were both agreeing on how far the field has come in the last few years. What better example than breast cancer. We all agree that breast cancer is not one disease but made of multiple subtypes. This was first published by Chuck Perou from Stanford in 2000 in which they identified four “intrinsic subtypes” in breast cancer. The same authors published another paper a year later stating that there were six subtypes. This basic intrinsic subtyping and is now pretty much universally accepted. Right?





After that conversation I got to thinking and doing some browsing on the web. One of the first things I looked at was how much subtype is taken into consideration when treating breast cancer. It seems that there is definitely a move to making treatments more subtype specific, but it’s not all pervasive right now.

What I started thinking next was how well established are the subtypes. Now at this stage I must confess that while I am aware of Perou and Sorlie’s work in this area, I wasn’t really hugely aware of any subsequent work by others. I know the PAM50 geneset was developed and this is now available as a test from Nanostring. This test classifies patients into one of 4 main subtypes based on the original work of Perou et al.



(image taken from Nanostring prosigna website)



My curiosity though was in regard to what else had been done in this space. Obviously, no one wants to be the second to discover something (I forget where I stole that quote from) but surely there had been more work done on this? Beyond this, the whole point of the so-called intrinsic subtypes is that they are intrinsic and should this be fairly easy to identify.

For a while now I’ve been playing with the Repositive website. It had appeared to be an excellent resource, but anything I had done so far was really just idle curiousity, though several people I have recommended it to have already found it incredibly useful. This particular problem seemed to be a perfect opportunity to test Repositive along with the Analytics Engines PipelineArchitect analytics tools out on a little tester project.



First stop was to go to Repositive and have a quick search. The pipeline I had ready to run was for Affymetrix array data.

Now let me take a quick sidestep here… “Why Affy data Austin?” I hear you cry! “Why not RNASeq data?” Well fundamentally there are two reasons. Firstly, the original data used to discover the subtypes was array data, though it was on an array designed and fabricated in Stanford. When I realised / remembered this I was curious as to what had been done with higher density Affy arrays.

Secondly, although RNASeq clearly yields fantastic information, I really just wanted to run a quick check and see what was out there and there is a lot of array data available in the public domain to play with. I do have a rather nice RNASeq analysis pipeline in our system so I will be producing a future post based on that.

OK, so my choices defended, allow me to proceed.



I went to Repositive and ran a search for “breast cancer subtyping Affymetrix”. There were four datasets found, one of which looked interesting… E-GEOD-20685 – Microarray-based molecular subtyping of breast cancer using 327 breast cancer samples. OK... a good sample number to play with. A quick click on the link in Repositive Link to dataset gave me some more info on the paper and linked to the data directly. From there I was able to have a very quick scan through the paper.



The paper is from 2011 and uses the Affy plus 2 array to subtype breast cancer using semi-supervised analysis. This group found 6 main subtypes in their classification.



(Figure from Kao et al http://bmccancer.biomedcentral.com/articles/10.1186/1471-2407-11-143)



I was about to jump to the methods section and see how aligned it was with my pipeline, but then I stopped.

The whole idea I had was to see how intrinsic these subtypes really are. If the subtypes truly represent fundamental biological differences, the subtleties of the algorithms and approaches shouldn’t make a great deal of difference.

So I jumped from Repositive back to the interface for Analytics Engines PipelineArchitect clicked the dataset tab and “add dataset”. We have a direct connector to ArrayExpress so I just needed to cut and paste the accession number from Repositive there, name the dataset and click the upload button.





Then I went for a coffee and came back to run the analysis.



The pipeline that I had loaded used the Gap statistic for classification and to use this in a semi-supervised manner, I needed the add the genelist. I went back to the paper and checked how they did this. It turns out they hand-picked a genelist of 23 genes and then associated these with 783 probsets. I decided to take the least effort approach to working with this and just took the list of 23 genes and decided to let the pipeline work out the probesets for itself. I chose the pipeline I wanted to use, selected the dataset and clicked the run button then went to a meeting. The total hands-on time to set the pipeline running was less than 5 minutes.

Now this dataset is fairly big, much bigger in fact than the original sets used by Sorlie and Perou. It’s been many years since I was a “practising bioinformatician”. More years than I care to remember if I’m honest. Obviously, things have changed a lot since back in the dark ages, but if I were to try and analyse this dataset myself It would have taken (at a guess) … weeks.

Even just downloading the data, managing the files, figuring out the probesets for the semi-supervised analysis etc. would have taken an age. For this analysis the total hands-on time was about 5 minutes and the brain in use time was probably zero.



When I came back from my meeting I refreshed the page and saw that the pipeline which had commenced at 11:05 had finished at 11:53.

48 minutes’ total run time.

In this time the data had been normalised and the Gap statistic run. My first thought was “surely that can’t be right?”

I clicked the button for the results and found a heatmap that split the dataset into 6 subtypes.

The clinical information data had been automatically pulled from Array Express (thanks to the ingenuity of our engineers) and had been used to run all comparative analysis. The examination of the correlation between the clusters we had found and the subtypes identified in the paper correlated with a p value of 6.4 X 10-124.





The total time taken from me randomly wondering how much work had been done in breast cancer subtyping to wondering if I could replicate the results to actually doing so was probably 2 hours. The first hour I was flicking through papers, reminding myself of Sorlie and Perou’s original work and then looking through subsequent publications. The hands-on time taken to find a dataset, download it and run the analytics was around 15 minutes (with a cup of coffee in there while the data downloaded to our system).

Now obviously a great deal of thought, development and engineering went into developing the solution I used on the part of both Analytics Engines and Repositive. Repositive have made it simple, easy and painless to find good datasets. Its literally as easy as running a google search, but with one key difference. A Google search using the terms I listed above generates around 300,000 results and none of these are telling me what the data is, how many samples there are and linking me directly to that dataset. Repositive does this and massively reduces the barrier to finding good datasets. Analytics Engines have made it easy to obtain datasets and run complex analytics with Analytics Engines XDP and PipelineArcitect with just a few mouse clicks. With both of these solutions in place, I was able to have an idea, run an analysis to test that idea and generate meaningful results with minimal time or effort.



I’m not claiming this analysis is perfect. I did little to ensure my approach matched the approach of the original authors. But what comes out of this does essentially validate the original findings. When I manually looked through the results some of the clusters and subtypes don’t match up, but with a different algorithm and a different set of probesets used in the semi-supervised analysis, this is to be expected.

On the whole, this shows several key things. It shows that this research is reproducible. It shows that with the right tools on hand, it’s very easy to test the reproducibility of research. Probably most importantly, it shows that we can take a different approach to hypothesis generation. Thanks to public databases and with the help of Repositive we can now access so much data. With analytics tools like Analytics Engines XDP and PipelineArchitect, there is no reason not to take advantage of this data, to explore it, play with it and ultimately come to a point of data-driven hypothesis generation. The potential advantages of being able to properly mine the data available are endless.



Back in the day when I worked in a lab, data generation was hard work and bioinformatics was harder.

These days’ data generation has become commoditised and there is already a phenomenal amount of data available. What this means is that everyones first question before generating data should be “has someone already generated a dataset I can use?”

Bioinformatics is still not easy, but once a pipeline has been established, running it should be easy. The best bioinformaticians should be figuring out new ways to do things, new pipelines, new algorithms… Not running routine analysis time and time again. I think what I’ve shown here is that once a pipeline has been worked out, we can streamline this pipeline and make running it easy.


Guest blog post by Austin Tanney from Analytics Engines





If you’ve any questions about the Repositive platform please contact Charlotte on charlotte@repositive.io

For any questions about the Analytics Engines platform, please feel free to contact me directly on a.tanney@analyticsengines.com


Read more posts by Craig Smith