next on phyloseminar.org
To attend a seminar, please visit the livestream portion of our YouTube channel.
Next-generation sequence evolution models
Deep Models of Protein Evolution

Models of protein evolution seek to quantify how proteins evolve over time while experiencing intricate constraints and adapting new functions. These models are the engine of phylogenetics, enabling, amongst other applications, phylogenetic tree reconstruction and ancestral sequence inference. Classic and contemporary work in protein sequence modeling incompletely address each others’ shortcomings - the gold standard classical models (e.g. WAG, LG) are limited by a need to consider sites in protein sequences as evolving independently, and while deep protein language models are able to account for interactions between sites, they lack an explicit time component. Here, we tackle this challenge by introducing a framework for training deep evolutionary models on protein family trees. By constructing comprehensive training datasets, we are able to train a deep generative model that bridges this methodological gap to model evolutionary transitions on unaligned sequence pairs, capturing the full spectrum of evolutionary forces including insertions and deletions. Our model, termed PEINT (Protein Evolution IN Time) significantly outperforms classical evolutionary approaches and enables realistic simulations of evolutionary trajectories. This advance opens new possibilities to understand and harness evolution for protein design, variant effect prediction, viral evolution forecasting, and statistical phylogenetics.
Modeling sequence evolution by learning epistatic terms from protein families

The use of coevolutionary information, i.e. the knowledge of amino acid or nucleotide interactions maintained through evolution in each protein family, has provided a framework to study structural biology in a predictive manner. Amino acid couplings, obtained from inference algorithms like Direct Coupling Analysis, have been fundamental to predict non-local interactions in protein three-dimensional structure, being now the core of state-of-the-art protein structure prediction algorithms. Coevolutionary information has had an impact in understanding conformational plasticity, complex prediction in molecular interactions and specificity in signal transduction and predicting the effect of mutations. In this seminar I will discuss new directions in the study of global models of sequence evolution, primarily focused on the integration of theory and experiment. I will present recent advances in my lab integrating the statistical mechanics of sequences to develop a model of sequence evolution called Sequence Evolution with Epistatic Contributions (SEEC) that unifies several statistical features of independent models by integrating epistatic contributions and context dependance in the formulation of evolutionary landscapes. We then demonstrate how this model can produce, in silico, evolved sequences that are functional, in vivo, even after a large number of mutational events proposed by the dynamics of the model.
Reconstruction of ancestral protein sequences using autoregressive generative models

Ancestral sequence reconstruction (ASR) is an important tool to understand how protein structure and function changed over the course of evolution. It essentially relies on models of sequence evolution that can quantitatively describe changes in a sequence over time. Such models usually consider that sequence positions evolve independently from each other and neglect epistasis: the context-dependence of the effect of mutations. On the other hands, the last years have seen major developments in the field of generative protein models, which learn constraints associated with structure and function from large ensembles of evolutionarily related proteins. Here, we show that it is possible to extend a specific type of generative model to describe the evolution of sequences in time while taking epistasis into account. We apply the developed technique to the problem of Ancestral Sequence Reconstruction (ASR): given a protein family and its evolutionary tree, we try to infer the sequences of extinct ancestors. Using both simulations and data coming from experimental evolution we show that our method outperforms state-of-the-art ones. Moreover, it allows for sampling a greater diversity of potential ancestors, allowing for a less biased characterization of ancestral sequences.