Abstract:
Learning Representations for Molecular Sequences Sequence comparison is a vital step in bioinformatics tasks such as annotation of molecules, phylogeny construction, and sequence retrieval. Methods for sequence comparison are broadly divided into alignment-based and alignment free approaches. Alignment-free methods offer some computational advantages – with some loss of sensitivity - and usually rely on high dimensional vector representations based on the bag of words model. This representation excludes contextual information from the sequences. Recent work on representation in Natural Language Processing (NLP) has gained wide popularity across a number of domains. These methods essentially work on the “distributional hypothesis” – that words that occur together frequently have some semantic relationship - and provide a way to generate low dimensional distributed representations of words or sentences while considering contextual information. Motivated by these works in NLP, in this thesis, we aim to address the limitations of alignment and alignment-free methods by developing representation learning methods for bioinformatics tasks. It is notable that for many problems in bioinformatics, the available metadata also plays a role in biological inference. Representation learning frameworks allow metadata to be accommodated along with contextual information in the process of generating sequence representations. More specifically, this research aims to develop scalable, computationally efficient and competitive alignment-free solutions for bioinformatics problems such as protein classification, retrieval, and proteinprotein interaction predictions.
The main contributions of this thesis are as follows: • Seq2Vec – a new unsupervised framework for learning useful low dimensional representations of molecular sequences. • SuperVec and SuperVecX – novel approaches for fusing meta and sequence information to generate improved embeddings of molecular sequences. • H-SuperVec(X) – a hierarchical algorithm utilising learned representations for sequence comparisons, achieving performance comparable to alignment-based approaches at substantially lower computational cost. • A hybrid system that utilises these alignment-free approaches as a rapid pre-processing filter to reduce the candidate set for an alignment-based algorithm, yielding a substantial speed up in the overall process. • Demonstrated utility of these representation-learning based approaches for a variety of bioinformatics problems, e.g., protein family prediction, protein-protein interactions and homologous sequence retrieval. The representation learning and task-specific approaches proposed in this thesis are generic and can be adapted for similar problems within bioinformatics and other domains.