Having trouble finding Chinese genomic data?

Well you’re not the only one! Many scientists, particularly in China, are looking for Chinese genome reference datasets and healthy control data for their patient studies. We discovered this last year, when we partnered with GigaScience to organise a workshop in Hong Kong called “How do I find human genomic data to power my research?”

Scott Edmunds, Executive Editor of GigaScience:

All of the participants we surveyed in Hong Kong, Mainland China and Singapore raised the same critical problem holding back their research: a lack of control data for Chinese populations. We heard stories of a data rich academics hoarding data to control authorship; and it is very clear that regional policy makers, funders, and journal editors have not been holding them to account. And the people paying the price for this are patients, who have ineffective drugs and are making under-informed decisions on their health. We hope the access to these new Chinese Control data sources will help reduce the power of the data hoarders, empower researchers to make new discoveries, and patients to have better medications.


You might think…

...that as the Asian population represents as much as 40% of the global population, there would be a lot of Chinese healthy control genomic data, but in reality, the biology of disease within the Asian ethnicities is currently highly under-represented.

Mahantesh I. Biradar, PhD Candidate at the Academia Sinica in Taiwan:

"As researchers in Biomedical Sciences, we all know and face the problem of getting sufficient sample sizes to address our research questions and hypotheses. Often it’s very hard to get the required number of cases and controls to carry out research in public health and epidemiology.

The Chinese population is one of the biggest ethnic populations in the world with as many as 1.6 billion people, according to a 2014 estimate. It is very important to have the repository of population data to study and discover potential gene and gene regions involved in various major diseases in this particular population.

There is a need for data from the ethnic Chinese population – because we see there are many genotypes that are totally contrasting to western populations, such as American and European. The major differences in genetic and environmental aspects result in the need for genotype and phenotype data of Chinese individuals."



Unlike in many Western countries, there are very few comprehensive studies looking into the causes of rare and inherited diseases, as well as complex diseases such as cancer, diabetes and neurological disorders within the Asian population. Furthermore, most baseline genotyping studies have not been performed on Asian ethnicities, and as a result there is a serious lack of reference data for these genotypes. Therefore, in recent years there has been a huge push from within China and across Asian countries to genotype their populations.





Below is a brief outline of some of the major data sources that are now becoming available to researchers, that contain genomic data from healthy and diseased Chinese individuals. The datasets that are currently available from these sources can be found in the Chinese control data collection on the Repositive platform.


View the Chinese Control Data collection



The China Kadoorie Biobank (CKB) (Browse) was set up by the University of Oxford’s Clinical Trial Service Unit & Epidemiological Studies Unit (CTSU) and the Chinese Academy of Medical Sciences (CAMS) to investigate main environmental and genetic causes of common chronic conditions within the Chinese population. From 2004 to 2008, the CKB recruited over 0.5 million individuals across 10 geographically distinct regions of China. Extensive data collection involved the comprehensive creation of a ‘baseline’ including a questionnaire, physical measurements, blood sampling and genotyping, which is now being followed up with close monitoring for death and other health-related outcomes. Furthermore, the ‘baseline’ data collection will be repeated every few years.

Singapore contains three major ethnic groups – Chinese, Malays and Indians. The Singapore Genome Variation Project (SGVP) (Browse) aims to look at over 1 million SNPs from almost 300 DNA samples to characterise the genetic variation between these ethnic groups. This data is designed to supplement the data provided by the International HapMap Project which surveys genomic variation internationally.

The first diploid sequence of a Han Chinese individual was sequenced in 2007. Since then, the sequencing of Chinese nationals by the Chinese biobanking initiative has made much progress. The National Biobank Network in China now consists of over 20 biobanks based in Beijing and Shanghai alone. With data collected in these biobanks, they aim to develop disease maps for health management, references for translational research and resources for medical industry.

The Beijing Genomics Institute (BGI) (Browse) is one of the worlds largest genome sequencing centres, and has been involved in hundreds of national and international sequencing projects. The BGI sponsors the publishing of the open access journal GigaScience and the maintenance of the associated database GigaDB. Therefore, the majority of datasets on GigaDB are from the BGI, and include a handful of important studies containing Chinese reference data.



Within the last few weeks, the initiation of a non-profit consortium made up of academic and industrial partners has been announced. GenomeAsia 100K (Browse) aims to sequence 100,000 individuals from across 12 South Asian countries and at least 7 of North and East Asian countries. They will combine sequencing data with microbiome clinical and phenotypic data to create depth of understanding of diseased and healthy individuals.


There’s more!

These are only a few of the multiple resources now becoming available for researchers to gain access to genomic data from Asian ethnicities.

Stay tuned for more blog posts on what resources are out there for other ethnicities, minority groups, rare diseases and common diseases!

For more details about the resources discussed above and how to access their data, sign-up to Repositive.


Related Blog Posts

Repositive Makes New Chinese Data Available

Read more posts by Charlotte Whicher