Population data: a valuable resource for the community

Population datasets on Repositive

Repositive is now indexing a wide variety of data from different population studies. These include; sources of data that are dedicated to the sequencing of a certain population (eg. The Kadoorie Biobank and GoNL) or many diverse populations (eg. SGDP); or studies that contain a large cohort of individuals from many different populations (eg. Estonian Biocentre Human Genome Diversity Panel). For more details about the population data we are indexing on Repositive, go to the end of this blog post.


Why is population data a valuable resource for the community?

In theory whole-genome sequencing allows for the complete characterisation of genetic variation in humans. However, this is not possible without studying many individuals from a wide array of populations. This is because different ethnic groups found in different geographic locations have different frequencies of genetic variations. Therefore, to link genetic variations to the environment or certain diseases one must perform large studies on diverse populations.

You need to study many diverse populations if you want to:

  • Cluster rare alleles by geography - 'geographic clustering'.

  • Investigate the risk factors of the common chronic diseases in the population.

  • Optimise the design of large-scale genetic association studies.

  • Study gene-environment interactions.

  • Gain insights into the dispersal of modern humans across the globe through history.

  • Gain a greater understanding of evolutionary genetics.

"Investigating the medical and evolutionary impact of structural variation requires that we understand the distribution of such variation within a species and the factors influencing that variation: in other words, the population genetics of structural variation."

Donald F Conrad & Matthew E Hurles et al. 1

Finally, by researching the genetic commonality across ethnic groups researchers hope to also provide a preliminary indication of whether genes involved in drug and enzyme metabolism are common or different across the ethnic groups. 2

Pikachu variants - all one species but different genetic variants!

You would have to sequence all these Pikachu individuals and many more to know what is a common or rare variation, and which variations are associated with disease.

Use Cases

Sarah, Postdoctoral researcher in Sheffield:

"I'm trying to find causal variants associated with rare neurological disorders. I have a large dataset from patients and healthy controls, and by analysing this I have come up with a list of potential variants. I then compare this list to datasets, like the 1000 genomes, to see if these variants are common or rare. If they're rare then they are more likely to cause these rare disorders."

Adam, Principle Investigator in New Zealand:

"I was searching for genome wide association studies (which require large cohorts) or data banks where genotypic and phenotypic data might be available. I am using the genotypic and phenotypic data from the Kadoorie data bank for construction of a polygenic risk score for cardiovascular and respiratory illnesses."

A bit more detail about the sources of population data on Repositive

The following sources are on Repositive are dedicated to, or contain datasets dedicated to, the sequencing of one specific or many diverse human populations:

All of which I discuss in more detail in my Having trouble finding Chinese genomic data? blog post.

I talk about the value and importance of this dataset in my Simons Genome Diversity Project - Now Featured on Repositive blog post.

I explain how the Estonian Biocentre Human Genome Diversity Panel dataset, from the Estonian Biocentre, brings us One more step towards reducing the ‘European Bias’ in another blog post.

I haven't talked about these data source before so I will go into a bit more detail below :-)

1000 Genomes

The goal of the 1000 Genomes Project, which ironically consists of data for 2,504 individuals from 26 populations, was to find most genetic variants with frequencies of at least 1% in the populations studied. The 1000 Genomes samples have proved a popular resource for molecular phenotyping experiments and investigating the associations between genetic variation and expression or measurements of epigenetic state.


GoNL will also serve as a reference panel for imputation in the available genome-wide association studies in Dutch and other cohorts to refine association signals and uncover population-specific variants.Must apply for access. The resource will be made available to the research and medical community to guide the interpretation of sequencing projects.

The THL Biobank

The Finland National Institute for Health and Welfare's (THL) Biobank contains unique resource of high quality longitudinal samples from the Finnish population. It stores collections of human biological samples and information associated with the samples that have been collected for research. The purpose of the THL Biobank is to maintain population-based data for use in future research.


Cover Image credit: Biomedical Genomics & Evolution Lab, Genomics in Health and Disease

  1. Donald F Conrad & Matthew E Hurles. The population genetics of structural variation. Nature Genetics. 39, S30 - S36 (2007) | doi:10.1038/ng2042

  2. The Singapore Genome Variation Project

Read more posts by Charlotte Whicher