Part of my job at Repositive is to think about the different kinds of specific queries that can be run on genomic data that are scientifically interesting. For example, calculate the allele frequency at a certain genomic location, count the number of C->T mutations in a data set, etc.

I was recently speaking to a friend who mentioned that knowing the ethnicity of the individuals in a study is important when trying to identify particular genetic variants associated with disease. The usual approach to trying to identify genomic locations associated with a disease is to sequence the genomes of a collection of healthy individuals (the controls) and some individuals with the disease you're interested in (the cases). You then look for genomic features that are present in the diseased cases but absent in the controls.

If you're not careful, you might end up with your cases and controls being drawn from different genetic populations (for example, maybe your controls are mainly of Asian heritage and your cases mainly European). This matters because the hope in doing this sort of case/control study is that the average genetic differences between the two groups (the cases and the controls) is because one group have the disease and the other doesn't. So if there are systematic genetic differences between the two groups other than the disease you won't be able to separate which differences are associated with the disease and which are associated with other traits.

There have also been several cool papers recently that use genetic data to cluster individuals into different populations. It's quite common to do a similar thing with the 1000 genomes project data (which should really be called the 2504 genomes project at the time of writing). Leading to pretty plots like this,
Source

So I decided it was time I joined the fun and wrote some genetic clustering functions of my own. In particular I thought I could use this clustering to estimate which background population is best suited as a control for a given sample or set of samples. There are a couple of tutorials on the internet that show how to cluster the 1000 genomes data, so if that's what you're after, you should check there.

## Clustering the 1,000 genomes data

To make one of these pretty clustering plots, the first key decision to make is to decide how to quantify the variation between the genomes of your samples. That is, Bob might have an A at position 10,232,111 on chromosome 17, while Alice has a T. Because computers don't operate on A,C,G & T directly, we have to turn the genetic differences into numerical ones. There are many ways to do this and I can't see any reason to think that one of them is the "right" choice. The standard way seems to be to put a 1 at locations that differ from the reference genome and a 0 everywhere else. This ignores completely that we have two copies of most chromosomes (one from each parent), so I used the incremental improvement of counting the number of differences from the reference (0,1 or 2) at each location.

Now we have a computer friendly version of everyone's genome we can make pretty pictures right? If you have some futuristic computer with Terabytes of RAM, then yes! But you probably don't, in which case you need to do something about the fact that representing 2504 genomes of 3 billion letters in a computer needs around 7,500 Gb of RAM (my laptop has 8Gb).

The first thing to do is to recognise that we're interested in finding differences between genomes. As such, we can throw away any location where everyone has the same sequence. This gets us down to about 30 million locations. Even this is way too many to be practical. There are many ways to choose a useful subset. I chose to consider only those that changes that fall within a gene and cause the expressed protein to change (i.e., nonsynonymous changes). There are roughly 15,000 of these in the 1000 genomes sample, a number that should make clustering manageable on any modern machine.

Having decided on a quantification strategy, I constructed a 15000x2504 matrix, where the rows represent the genomic locations where protein changing mutations occur, columns represent individual samples and the cells encode the number of non-reference alleles present in that sample at that location. Running this through a standard principal component analysis routine and plotting the results gave me this:

Which looks pretty good right? Especially considering we're only looking at a very small subset of the possible locations.

How does it look if we add a non 1000 genomes sample? To do this, we need to recognise that each sample's principal component value is just a special sum of the numeric representation of the genome we decided on earlier, where each location is given a different weighting. This is much easier to explain with mathematical notation. If the numerical representation of a sample $S$ is $(x_0,x_1,...,x_N)$ (where each $x_i$ is 0, 1 or 2 as described earlier) then the value of the first principal component for $S$ is,

$$\sum_{i=0}^N x_i w_i$$

where $w_i$ are the weights for the first principal component. Anyway, the point is that we can easily calculate the principal component values for any sample and place them on the plot. Here's an example using some of my test data,

Ugh, that doesn't look right! All my new samples are way off in the middle of nowhere. What's going on? The problem here is that I only have data for chromosome 15 for the samples I added. But when I constructed the numeric vector I put in 0 (for no difference from the reference) at all other locations. So the analysis sees a collection of weird samples that are identical to the reference everywhere but chromosome 15. No wonder they cluster separately! To get around this I constructed a new set of principal components where I only used information from chromosome 15. Which looked like this:

Much better right? Although as you'd expect (given we're using less information than before) the different populations don't separate as cleanly. Still, this can hopefully form the basis of a useful function for our users.

If all you're interested in is another vague tutorial on how to cluster genomic data using PCA, you can stop reading now1. Still here? Good for you. Now you get to join me as I try and work out why this type of analysis doesn't seem quite right. And I don't mean in the not quite right because it involves race and genetics which always feels vaguely icky.

## Alternatives to principal component analysis

Really the aim of the whole clustering analysis as I understand it is to identify which samples are most similar to one another at a genetic level. For someone with a background in astrophysics, this sounds a lot like trying to split stars into different clusters. As such, my first instinct is to try and define a way of calculating the distance between two objects. For stars, this is just the physical distance between them, but for genomic sequences it's obviously more complicated.

In everything I did above, I was able to cluster individuals into groups without explicitly defining how I was measuring the "genomic distance" between them. I guess that's what makes me feel uneasy. I think that really what I'm doing above is to use the principal component analysis to create a distance measure. The distance between two samples is the euclidean distance between them in principal component space. That is, two samples are close if they have similar principal components.

While this clearly works reasonably well in this case, I don't like that it makes the interpretation of the data difficult. Furthermore, while it worked in this case, all PCA does is guarantee that the principal components maximise the variance, not that they separate the data into clusters. It is in principal (ha-ha, get it!) possible that the dimension with the least variance is the one which separates the data into clusters the most clearly. If this were the case, PCA would work terribly.

Motivated by these reservations, I started thinking about alternative ways of clustering and visualising the data, with a clearer interpretation. I started by defining the distance between two genomes as the number of locations where they differ. Which is embarrassingly simple really, but no one can accuse it of being difficult to interpret. I then created a square matrix containing all the pair-wise distances between all the samples.

Finally, I used a technique called "friends of friends" clustering (at least that's what it's know as in the astrophysical community) to group them into clusters. Friends of friends work by defining every data point's "friends" to be those points that are less than some distance cut-off away from them (you choose the distance cut-off). Clusters are then defined as the set of points that can be reached within the same "friendship" network. That is, A and B are in the same cluster if they're friends, friends of friends, friends of friends of friends, etc.

Playing around with the cut-off I was able to split the data into two large groups. One contained all the individuals of African ancestry except 1, the other contained everyone else. Which I was rather pleased with. Of course, PCA also did a pretty good job of splitting the data into the same two groups, but I'm more comfortable with this method where the interpretation is clear.

##### Cluster 1:
 AFR 600
##### Cluster 2:
 AFR AMR EAS EUR SAS 1 347 504 503 489

Readers with some background in cluster analysis will recognise that what I've done is to implement my own crappy version of hierarchical clustering. There are many nice packages in R that will take a matrix of pairwise distances and compute a hierarchy of clusters, complete with tools to visualise those clusters. Here's one such example.

Which is a lot nicer than my ad-hoc friends of friends implementation (even if the labels have gone walk-about), but not all that different conceptually.

## Footnotes

1. Not to down-play the draw of my charming repartee, you're probably better off elsewhere if that's what you're after.