next on phyloseminar.org
To attend a seminar, please visit our YouTube channel.
Machine learning 2
Phyloformer: towards fast and accurate phylogeny estimation with self-attention networks

An important problem in molecular evolution is that of phylogenetic reconstruction, that is, given a set of sequences descending from a common ancestor, the reconstruction of the binary tree describing their evolution from the latter. State-of-the-art methods for the task, namely Maximum likelihood and Bayesian inference, have a high computational cost, which limits their usability on large datasets. Recently researchers have begun investigating deep learning approaches to the problem but so far these attempts have been limited to the reconstruction of quartet tree topologies, addressing phylogenetic reconstruction as a classification problem. We present here a radically different approach with a transformer-based network architecture that, given a multiple sequence alignment, predicts all the pairwise evolutionary distances between the sequences, which in turn allow us to accurately reconstruct the tree topology with standard distance-based algorithms. The architecture and its high degree of parameter sharing allow us to apply the same network to alignments of arbitrary size, both in the number of sequences and in their length. We evaluate our network Phyloformer on two types of simulations and find that its accuracy matches that of a Maximum Likelihood method on datasets that resemble training data, while being significantly faster.
The tree reconstruction game: phylogenetic reconstruction using reinforcement learning
We propose a reinforcement-learning algorithm to tackle the challenge of reconstructing phylogenetic trees. The search for the tree that best describes the data is algorithmically challenging, thus, all current algorithms for phylogeny reconstruction use various heuristics to make it feasible. In this study, we demonstrate that reinforcement learning can be used to learn an optimal search strategy, thus providing a novel paradigm for predicting the maximum-likelihood tree. Our proposed method does not require likelihood calculation with every step, nor is it limited to greedy uphill moves in the likelihood space. We demonstrate the use of the developed deep-Q-learning agent on a set of unseen empirical data, namely, on unseen environments defined by nucleotide alignments of up to 20 sequences. Our results show that the likelihood scores of the inferred phylogenies are similar to those obtained from widely-used software. It thus establishes a proof-of-concept that it is beneficial to optimize a sequence of moves in the search-space, rather than optimizing the progress made in every single move only. This suggests that a reinforcement-learning based method provides a promising direction for phylogenetic reconstruction.