Projects and interests

Data visualization

Bokeh is a Python-based visualization library that I've been trying out lately for applications that require interaction and higher levels of customization. It's nice because its relatively fast, flexible, and can be embedded in web apps and notebooks.


I've found it useful for exploration of large, complex data like in our project on deep learning for dimensionality reduction of genomes, and this related visualization of hyperparameter tuning of t-SNE, UMAP and ISOMAP.


GitHub

Deep learning for genetic data

A deep-learning framework based on convolutional autoencoders for use on genetic data that I developed during my PhD. This paper describes the model and how it can be used for dimensionality reduction as well as genetic clustering. We show how this nonlinear, data-driven approach can learn a useful representation of complex genetic data sets, and provide new insights in comparison to traditional methods.


Here's an interactive version of the dimensionality reduction results presented in the paper.

GitHub

Serverless workflow execution engine

A collaboration with researchers at the University of Washington, Seattle, the Serverless Workflow Enablement and Execution Platform (SWEEP) is a workflow management system based on the serverless paradigm. The goal of SWEEP is to simplify the creation of cloud-native workflows, allowing users to define tasks as functions or containers, set up rules regarding their orchestration, and execute them in the cloud without setting up any virtual infrastructure.


Two interesting use-cases of SWEEP are from the field of environmental research, where it was used for analysis of satellite imagery data to learn about the effects of climate change on remote Arctic lakes and wildflower communities in Mt. Rainier National Park.

Efficient imputation of missing data with hidden-markov models on GPUs

Turns out the increasingly trendy GPU can be leveraged for old-school (non-deep) methods too. In this paper, we consider a computationally intensive HMM-based method for imputing missing data in genomes, and present an adaptation of the algorithm that reduces memory consumption enough to allow for execution on GPUs. We show that this pays off for the particularly challenging application of imputing ancient DNA, giving improved accuracy with similar runtimes compared to alternative pipelines.



GitHub

Handling missing data in PCA

Working with PCA in the field of population genetics, I often came across the problem of missing values in the data. I found that most implementations didn't have a lot of support for handling this (usually implementing the simplest solution of filling in missing values with 0), and that the issue wasn't discussed much overall.


I implemented some alternative methods that I found in the literature, had a go at using them on some data sets, and found that they gave much more accurate results (even the really simple and cheap ones).


I wrote this technical report about it. The Python source on GitHub is linked, and publishing a package is on my TODO list. Let me know if you're interested in using it.




GitHub

Massively parallel analysis of genome data

BAMSI: the BAM Search Infrastructure is a SaaS solution for cloud-based searching and filtering of massive genomic data sets. Based on celery and rabbitMQ, it can be used to set up multi-cloud, distributed processing of alignment files that leverages multiple mirrors of public data sets.


The framework is presented in this paper, where we also describe an implementation for the 1000 Genomes data set with a worker ecosystem spanning AWS and SNIC Science Cloud.


GitHub