The Human Microbiome Project Datasets



Special thanks to Kemi Ifeonu for co-ordinating this blog post



The Human Microbiome Project (HMP) was established by the NIH Common Fund in 2008, to generate resources that would enable the comprehensive characterization of the human microbiome and help us understand its role in human health and disease.



The first phase of the HMP was designed to provide a “healthy” microbiome baseline. Around 300 healthy individuals were sampled at 15 (men) or 18 (women) body sites. Over the five years of the project, 16S sequence was generated for ~10,000 samples and whole metagenome shotgun (WMS) sequence was generated for ~2,400 samples. In addition, the whole genomes of about 2,800 reference strains isolated from the human body, were sequenced.

The second phase of the HMP is known as the Integrative Human Microbiome Project (iHMP). Using multiple omics technologies, this project integrates longitudinal datasets from both the microbiome and host from three different studies of microbiome-associated conditions: Pregnancy & Preterm Birth, Onset of Inflammatory Bowel Disease (IBD), and Onset of Type 2 Diabetes.



All HMP datasets are freely available through the HMP Data Coordination Center (DCC) website (Browse)

Some of the major datasets generated by the HMP projects are presented here:


Reference Genomes

About 2,800 reference genomes sequenced from strains isolated from human body sites.

Datasets available

  • HMP Reference Genomes Project Catalog. This resource provides access to metadata collected for the reference genomes including information such as the body site the organism was isolated from, taxonomy, annotation information and much more.

  • Sequence and annotation data is available through the HMP DACC website and at the NCBI

How can you use these datasets?

The reference genomes sequences can serve as a database for read mapping, and the information gained will aid in taxonomic assignment and functional annotation. Users can also build a BLAST database with a subset of the reference sequences (for example by body site) and search their sequence against the database.


Metagenomic datasets for healthy individuals

We collected samples from 300 healthy adult men and women between the ages of 18 and 40, from five major body areas: oral cavity, nasal cavity, skin, gastrointestinal tract and urogenital tract. 16S rRNA and metagenomic WMS sequencing was performed on the samples using the 454 and Illumina platforms respectively.

Datasets available

How can you use this dataset?

Users can download the raw reads and perform independent analysis either on the full dataset or a subset. Subsets can be generated based on metadata available through the project catalog or clinical metadata available through the NCBI dbGAP.

The processed datasets (e.g., gene indices, community profiles, metabolic pathways) can also be mined for interesting trends.


Multi’omic datasets

The iHMP is in progress, and has started generating integrated longitudinal datasets of biological properties. These datasets include: 16S rRNA gene surveys, host genome sequences, host transcriptomes, whole metagenome shotgun sequences, metatranscriptomes, proteomes, and metabolomes. Some of this data is already available but more datasets will be made available soon.


Related Collections

Human Microbiome Data


Related Blog Posts

Moving to human microbiome data

Repositive Announce Launch of Microbiome Data Collection

A Guided Tour on Downloading and Accessing Repositive Methylome Data

Read more posts by Craig Smith