Dimensionality reduction of genotype data

interactive visualizations

In the project described in this paper, we've been exploring the use of convolutional autoencoders to perform dimensionality reduction of genotype data.

The idea behind dimensionality reduction is to transform data from a high-dimensional space to a lower-dimensional space in a way that retains some meaningful properties of the data. Here, a set of individuals is represented by a subset of their genetic sequences, 160858 positions, which are transformed to just 2 dimensions. This allows visualization of the data on a 2D scatter plot, revealing different patterns of genetic variation depending on the method used.

Below are some interactive plots showing results for our model (GCAE) compared to some other methods: a variational autoencoder method described here (popvae), Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP).

All plots show 2-dimensional representation of genotype data of 2068 samples from 166 populations from the highly diverse Affymetrix Human Origins SNP array from Lazaridis et al. (2016) .

Click on the legend items to hide/show populations.

Select model

GCAE popvae PCA t-SNE UMAP

(note that the plots can take some time to load)