A Guided Tour on Downloading and Accessing Repositive Methylome Data

The Repositive platform allows users to discover datasets from many repositories under a single interface. Repositive does not only encourage reutilisation of existing datasets, it makes them findable with little effort. Being able to find the appropriate datasets, however, is just a step in the experimental workflow of knowledge discovery. Usually, found datasets need to be downloaded and saved in the appropriate environment to allow their processing and interrogation.

In this post, we discuss how to download a dataset from the Repositive Methylation Data Resources for future re-use. As of 19 September 2016, Repositive references more than 200,000 methylation datasets (Figure 1).

Figure 1: Retrieved list of datasets from the Repositive.io searching by "methylation", returning a total of 202,864 results. These results can be filtered by "Assay type", "Open" or "Restricted" and the data source type.

For this example, we are going to retrieve Human Bisulfite Sequencing (Bisulfite-Seq) data. Bisulfite treatment yields information about the methylation status of a segment of DNA at a single nucleotide resolution. By clicking on the "Bisulfate-Seq" Assay filter from the Repositive interface, we narrow down to 2,125 results, all of them from NCBI's Sequence Read Archive (SRA) [1] (Figure 2).

Figure 2: Screenshot Repositive showing all methylation datasets that match "Assay type" Bisulfite-Seq, a total of 2,125.

Among the returned results, we identify a study of interest, for example: "Pooling-based genetic mapping of DNA methylation" [2]. We click on the "SRX684994" dataset card and a new page is opened displaying the annotations associated with that dataset (Figure 3).

Figure 3: By clicking on a Repositive dataset card a new page is opened showing the annotation associated with the dataset. At the bottom of the page, the button "Access Data" points to the original source from which this dataset is indexed. On the right hand side, comments can be added by users in the discussion box.

Clicking on the "Access Data" button at the bottom of the data card page, the source (from which this dataset is indexed) is accessed. This new webpage provides information about the study, its design and all related datasets in this study (Figure 4).

Figure 4: We can see that the "SRX684994" dataset is part of a greater collection of data files from the "Pooling-based genetic mapping of DNA methylation" study.

We click on "All runs" to see all of the datasets that are part of this study. A new page is shown (Figure 5) presenting all datasets, and a lot of the metadata associated with this project. All of the datasets sum up a total of 116.88 Gb from a total of 192 runs.

Figure 5: A break down of all datasets included in the project (accession: PRJNA257890).

We see that Experiment SRX684994 is associated to Run SRR1555839. We click on it and a new window appears showing information about "Metadata", "Reads" and "Download" (Figure 6). Here we notice that this particular run is 117.9 Mb in size.

Figure 6: Screenshot of metadata, reads and download information for run SRX684994.

In order to download the dataset for the SRR1555839 experiment, first it is necessary to install the SRA Toolkit locally on your computer. We can find a link to the SRA Toolkit libraries on the "Download" tab (Figure 6). Click on the link, select the appropriate binary and install it. Since I am using a Mac, I downloaded the MacOS 64 bit architecture binary.

Once installed, make sure that your SRA Toolkit directory is added to your $PATH, e.g.:

export PATH=$PATH:/path/to/my/sratoolkit.2.7.0-mac64/bin

Now you are able to fetch this dataset by typing:

prefetch -v SRR1555839

If you are successful, you should get

'SRR1555839' was downloaded successfully

Happy data mining!

View the Methylation Data collection


[1] Wheeler DL, Barret T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Ostell J, Pruitt KD, Schuler GD, Shumway M, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Tatusov RL, Tatusova TA, Wagner L, Yaschenko E (Jan 2008). "Database resources of the National Center for Biotechnology Information". Nucleic Acids Research. 36 (Database issue): D13–21. doi:10.1093/nar/gkm1000. PMID 18045790.

[2] Kaplow IM, MacIsaac JL, Mah SM, McEwen LM, Kobor MS, Fraser HB (2015). "A pooling-based approach to mapping genetic variants associated with DNA methylation". Genome Research. 2015;25(6):907-917. doi:10.1101/gr.183749.114.

Related Blog Posts

More Methylation data to power your research

Read more posts by Manuel Corpas