I work in the intersection between machine learning methods and computational biology problems. So many of the projects that I am involved with have both components, resulting in overlaps in the descriptions below.
The word cloud on the right has been obtained using Wordle on the text on this page.
Computational Biology View
Biomarker discovery in genomics data
One of the main aims of genome-wide association studies (GWAS) is to identify DNA features which if not causal, are at least statistically significantly associated with increased risk of various diseases/traits or increased benefit from specific treatments. However, an increased risk of developing complex diseases such as cancer or diabetes is caused by mutations in combinations of genes, rather than any one gene. These genetic variations may reduce resistance to infections or trigger auto-immune reactions that lead to disease. We study these combinations of locations, called epistatic interactions, which are jointly strongly associated with disease. This can be considered in the framework of feature selection methods in machine learning, with challenges such as dependent features, high dimensional low sample size data, and uncertain data.
Medical Imaging for Cancer Pathology
We consider an automated processing pipeline for tissue micro array analysis of renal cell carcinoma. It consists of several consecutive tasks, which can be mapped to machine learning challenges. We investigate tasks such as nuclei detection and segmentation, nuclei classification, and staining estimation. This work is done in collaboration with the Peter MacCallum Cancer Center in Melbourne.
Experimental Design for Systems Biology
Our long term goal is to determine the nonlinear dynamical system underlying glucose signalling in yeast. In order to do so, we developed a computational method which proposes biological experiments that would maximize the information gained. This work is done under the umbrella of the YeastX project.
Machine Learning View
One of the key challenges facing modern machine learning is the difficulty of obtaining good annotations for data. For a particular task as hand, this could be due to time, cost or experimental limits. Active learning uses computational algorithms to suggest the most useful measurements. This is sometimes known as experimental design, and one possible theoretical framework is by balancing exploration and exploitation in a function maximization setting.
Learning the kernel
A kernel captures the similarity between objects, and its choice is highly important for the success of the classifier. When it is not known which kernel is best, one can attempt to learn the similarity from data.
Multiple kernel learning (MKL) is a way of optimizing kernel weights while training the SVM. In other words, for a number of different kernels, we choose a small number of good ones for the task at hand. In addition to leading to good classification accuracies, MKL can also be useful for identifying relevant and meaningful features. We recently derived the structured output learning version of MKL in "Multiclass multiple kernel learning", and related various slightly different formulations. When we do not have an explicit number of different kernels, the idea of "hyperkernels" comes into play, i.e. a kernel on kernels. We also had some success in learning the similarities between outputs (learning output kernels), by optimizing over the whole space of positive semidefinite matrices.
Structured Output Spaces
In many applications, the classifier has to make a prediction that is more complex than just a single value, which is the case in binary classification or regression. Further, the label is structured in the sense that there are dependencies and correlations between individual parts. One approach is to use our prior knowledge of the problem to define a structure on the set of labels, using tools such as graphical models. We have applied this to problems such as gene finding (mGene), spliced alignment (PALMA) and image denoising.
There are currently two main approaches to discriminatively learn such models, conditional random fields and structured output support vector machines. We have shown that they are essentially differently regularized models (entropy and margin maximization), and derived efficient approximations to train such models for intractable graphical models (tree approximation).
Dynamical Systems Modeling in Neuroscience
Measurements of signals in neural tissue presents a particularly challenging scenario for traditional machine learning, since it is of high dimension and there are very few subjects. We propose to model each subject by a corresponding dynamical system, called a dynamic causal model, which is a neurophysiologically motivated model of brain activity. Our generative embedding approach has resulted in interpretable models and significantly more accurate results in predicting a spectrum disorder (aphasia) in human subjects from fMRI data. We also analysed whisker stimulus and auditory oddball detection from electical activity in the mouse model. This work is done in collaboration with the University of Zurich and the Wellcome Trust Centre for Neuroimaging, University College London.
One of the key requirements in Support Vector Machines (SVMs) is the positive definiteness of the kernel. It turns out that most of the theory still works when one relaxes positive definiteness to just symmetric but "indefinite" kernels. The corresponding concept is a Reproducing kernel Krein space.
Today, next-generation technologies have rendered genome sequencing an almost routine process, allowing individual scientists to obtain the sequences of their favorite organisms. The task of annotating new genomes may therefore move partly into the domain of individual researchers or labora- tories. Consequently, labor intensive procedures like manual annotation by experts, albeit presumably most precise, are not always affordable, and highly automated computational methods are called upon to fill the gap. We developed mGene, which is a complete discriminative gene finder.
Alternative splicing has been linked to the complexity of higher organisms. However, the incidence of alternative splicing is still unclear, and the mechanisms not well understood. Using information from publicly available genome and EST databases, we constructed a database of transcription units, which are summarized by splicegraphs, for several model organisms. From the splicegraphs, we identify exon skipping, intron retention, alternative 5' and alternative 3' splicing, and use these transcript confirmed events to train a SVM to predict novel events.
Protein Subcellular Localization
Protein subcellular localization is a crucial ingredient to many important inferences about cellular processes, including prediction of protein function and protein interactions. We investigate the problem of predicting the subcellular localization of a protein from its peptide sequence. We propose a general class of protein sequence kernels which considers all motifs, including motifs with gaps, and use multiple kernel learning to optimize over many kernels to obtain state of the art results.
Despite many years of research on how to properly align sequences in the presence of sequencing errors, alternative splicing and micro-exons, the correct alignment of mRNA sequences to genomic DNA is still a challenging task. We present a novel approach based on large margin learning that combines accurate splice site predictions with common sequence alignment techniques. By solving a convex optimization problem, our algorithm -- called PALMA -- tunes the parameters of the model such that true alignments score higher than other alignments. We study the accuracy of alignments of mRNAs containing artificially generated micro-exons to genomic DNA.