Repeat expansions are simply an increase in the number of consecutive copies in a genome of a short sequence of DNA bases above what is typical in the population. While such repeat sequences are widespread in the human genome and size variations in most of these are probably unrelated to human health, a number of repeat expansions have been determined to be causes of severely debilitating and often fatal diseases that are typically 'late-onset', i.e. the early symptoms become apparent in adulthood rather than at birth or in childhood.
One well-known example is Huntington's Disease. In 1993 , it was found that this condition occurs when someone has more than 36 (consecutive) repeats of the sequence 'CAG' in a coding region of the huntington gene. This DNA base triplet encodes the amino acid glutamine and expansion of repeats of this coding sequence is the cause of a number of other neurodegenerative disorders, thus collectively known as 'polyglutamine diseases'. These include Spinal and Bulbar Muscular Atrophy (the first repeat expansion disorder to be identified as such ) and at least five types of Spinocerebellar Ataxia .
Repeat expansions do not have to occur in the coding sequence, however, to cause disease. Fragile X syndrome and Amyotrophic lateral sclerosis (ALS) are two examples of repeat expansion diseases in which, although the repeat occurs within a gene, it is either in the untranslated region at the start of the gene or in an intron. Due to a visible (under the microscope) effect on the X chromosome, Fragile X syndrome was linked to the physical genome as early as 1969 ; in more detail it is an expansion to >200 copies of a CGG repeat in the untranslated region at the start of the FMR1 gene. Although various mutations in various genes have been identified as the causes of small percentages of cases of ALS, it was only in 2011 that an expansion of a hexamer (GGGGCC) repeat in intron 1 of the C9orf72 was identified as another cause  - accounting for ~10% of cases of ALS and thus the most prevalent cause of ALS found to date. These reports  also implicated the same expansion in cases of Frontotemporal Dementia, providing the mechanism linking two neurodegenerative disorders long considered to be separate.
While a Huntington's repeat expansion of 40 copies of the CAG triplet (120 bases) is spannable by a 150bp Next Generation Sequencing (NGS) read, the larger expansion sequences typical of non-coding pathogenic repeat expansions are often much longer than NGS reads. By definition, these are highly repetitive sequences with very short period, so assembling the repeat sequence from NGS reads is not a promising approach. At Illumina in early 2014, I started working part-time with Mike Eberle on computational exploration of his insight that the count of reads that matched a repeated sequence of interest could be used in detecting repeat expansions. I wrote the initial versions of the expansionHunter tool, with testing on a small set of Fragile X genomes processed by the sequencing team at Illumina Chesterford and with subsequent contribution by Mitch Bekritsky of the fuzzy repeat model.
A year later we began a collaboration with Jan Veldink and Joke van Vugt at the University Medical Centre Utrecht, under the auspices of Project MinE. Thanks to Joke's unflagging commitment to processing the genomes of ALS patients and controls with expansionHunter (and her patience with requests for reruns using bug-fixed or otherwise improved versions!), we began seeing promising results and identified and addressed issues that were not present or apparent in our Fragile X dataset. For org-chart reasons, I had to wind down my involvement in the project; within a few months I had found my way to Repositive and Mike had recruited Egor Dolzhenko to take over the development of expansionHunter. It is great that expansionHunter is now available as open source at https://github.com/Illumina/ExpansionHunter and that a description of its application to the Project MinE ALS samples (and to the Fragile X dataset) is available in a preprint. Hopefully this will unlock the potential of widespread next generation sequencing for investigation of repeat expansions in ALS and other neurodegenerative disorders in which this mechanism has been (or is yet to be!) identified as a cause. Here at Repositive, we look forward to making our contribution by adding this mode of genomic analysis to the many others for which we already index datasets for our users. These neurodegenerative disorders are a context in which we would be particularly delighted to hear of many more Data Eureka moments.
- The Huntington's Disease Collaborative Research Group. A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington's disease chromosomes. Cell. 1993 Mar 26;72(6):971-83. PMID 8458085.
- La Spada, A. et al. Androgen receptor gene mutations in X-linked spinal and bulbar muscular atrophy. Nature. 1991 Jul 4;352(6330):77-9. PMID 2062380.
- Liou, S. Trinucleotide repeat disorders. http://web.stanford.edu/group/hopes/cgi-bin/hopes_test/trinucleotide-repeat-disorders/. 2010 June 26 (accessed 2016 Dec 23).
- Lubs, H.A. A marker X chromosome. Am J Hum Genet. 1969 May; 21(3): 231–244. PMID 5794013. PMC1706424.
- DeJesus-Hernandez, M. et al. Expanded GGGGCC hexanucleotide repeat in noncoding region of C9ORF72 causes chromosome 9p-linked FTD and ALS. Neuron. 2011 Oct 20;72(2):245-56. PMID 21944778 PMC3202986.
- Renton, A.E. et al. A hexanucleotide repeat expansion in C9ORF72 is the cause of chromosome 9p21-linked ALS-FTD. Neuron. 2011 Oct 20; 72(2): 257–268. PMID 21944779 PMC3200438.