# previously recorded seminars

## Machine learning

Simultaneous protein coevolution and phylogeny prediction using a differentiable tree structure

Protein coevolution and phylogeny are useful properties of genes, and there exist well known algorithms to infer both. However, phylogeny influences coevolution inference and vice versa. I propose a method of inferring both at the same time to make both inferences more accurate. In order to do so, I have created a differentiable tree structure that uses a hyperbolic space embedding to smoothly infer joint phylogeny and coevolution information. In this presentation, I will describe the structure and evaluate its efficacy.

Distance-based phylogenetic placement: from traditional distances to deep learning

Phylogenetic placement, the addition of a query sequence onto an existing backbone phylogeny, has been studied extensively using methods such as maximum likelihood. In this talk, we explore the distance-based approach to phylogenetic placement with two goals: (1) formulating placement as a least-squares problem that can be solved efficiently (in linear time), and (2) exploring both traditional and new exciting directions for computing distances. We show that the distance-based framework is versatile. For example, it enables placement with or without alignments, a feature that enables applications of placement to unassembled data. We also discuss how instead of using traditional Markov models of sequence evolution, we can use machine learning to train neural network models directly from the data in a way that enables distance calculation. When the sequence data have not evolved on the backbone tree under traditional models, the machine learning method has the potential to increase the accuracy of distances. In particular, we show that the deep learning approach paired with distance-based placement, implemented in a method called DEPP, enables insertion onto a species tree using data from one or a handful of genes rather than genome-wide data. We end by showing how DEPP can be used to combine 16S and metagenomic data by inserting them into a single tree.

Exploring the potential of deep learning for faster and more accurate phylogenetic inference

Reconstructing the phylogenetic relationships between species is one of the most formidable tasks in evolutionary biology. Multiple methods exist to reconstruct phylogenetic trees, each with their own strengths and weaknesses. Both simulation and empirical studies have identified several “zones” of parameter space where accuracy of some methods can plummet, even for four-taxon trees. Further, some methods can have undesirable statistical properties such as statistical inconsistency and/or the tendency to be positively misleading (i.e. assert strong support for the incorrect tree topology). Recently, deep learning techniques have made inroads on a number of both new and longstanding problems in biological research. In this study, we designed a deep convolutional neural network (CNN) to infer quartet topologies from multiple sequence alignments. This CNN can readily be trained to make inferences using both gapped and ungapped data. We show that our approach is highly accurate on simulated data, often outperforming traditional methods, and is remarkably robust to bias-inducing regions of parameter space such as the Felsenstein zone and the Farris zone. We also demonstrate that the confidence scores produced by our CNN can more accurately assess support for the chosen topology than bootstrap and posterior probability scores from traditional methods. Although numerous practical challenges remain, these findings suggest that the deep learning approaches such as ours have the potential to produce more accurate phylogenetic inferences.

## Birth-death identifiability

Phylogenies of extant species are widely used to study past diversification dynamics. Over the last three decades, new phylogenetic tools have been developed that allow accounting for variations in diversification rates in time and across lineages. Important efforts have also been made to extend these tools in order to incorporate information from the fossil record. Despite their utility, these tools are regularly given bad press, on the basis that they may not infer true diversification histories. I clarify here what we can expect from models, and argue that the birth-death models used to infer diversification dynamics behave as expected. I also argue that the current approach in the field, which models speciation and extinction rates, provides the proper framework to make progress in our understanding of past diversification histories.

Birth-death congruence classes can be collapsed using Bayesian shrinkage priors

Recently, the discovery of congruent sets of phylogenetic birth-death processes has raised a series of questions as to what diversification patterns we can infer from phylogenetic trees and which are indistinguishable. Many phylogenetic trees are estimated solely from extant samples, and thus yield ultrametric trees that lack critical information (i.e., extinction events) about the tempo of diversification histories. This is the crux of the problem, and makes models in the same congruence class statistically unidentifiable. However, the general behaviour for such classes is not well known. How similar are these classes, are they easy to construct, and is it possible to mimic any plausible diversification history within one single class? To answer this, we simulate a series of phylogenetic trees, both within and across congruence classes, and investigate their properties using state of the art Bayesian inference methods. Our results show that the diversification rates inferred using Bayesian shrinkage priors produce not an arbitrary model from the congruence class. Instead, using Bayesian shrinkage priors collapses the congruence class, yielding a single, simplest model in accordance with the prior expectations. Thus, diversification rate can be inferred from molecular phylogenies when realistic priors are used.

Fundamental limits to diversification rate inferences

Time-calibrated phylogenies of extant species are widely used for estimating historical diversification dynamics. However, since the original development of methods to do so, there has been controversy over the validity of this approach. Over the years, there have been a number of empirical case studies and simulation experiments that have sought to characterize the reliability of diversification rate estimators but it has been difficult to make general claims about when we might be able to accurately distinguish between alternative historical scenarios and when we cannot. In a recent paper, we mathematically solved this problem and, in doing so, proved the existence of vast ‘congruent’ sets of alternative diversification scenarios that cannot be distinguished using extant timetrees alone. In this talk, we will present the derivations of our key results and provide some perspective on our statistical approach to the problem, which will highlight how and why our findings differ from previous work. We will show that far from being a mathematical curiosity, our results fundamentally challenge previous interpretations of empirical data and demonstrate this with some new analyses. We will end the talk by exploring whether our approach and our reasoning could be applied to related types of phylogenetic models.

## SARS-CoV-2

Efficient methods for the phylodynamic analysis of large SARS-CoV-2 datasets

The current SARS-CoV-2 pandemic has prompted an unprecedented, global sequencing effort, which has greatly shaped our understanding of how the virus has spread around the world. However, attempts to incorporate large-scale genetic data into epidemiological investigations remains a significant challenge. The relatively slow evolutionary rate of the virus combined with intense sampling makes estimating a single, resolved phylogenetic tree difficult if not impossible. Additionally, Bayesian approaches, which naturally account for topological uncertainty, are untenable due to the size of tree-space to be explored as well as the computational burden of calculating the likelihood of large candidate trees. In order to make complex phylodynamic models tractable in this setting, we have extended a classic Bayesian method for estimating time trees from phylogenies which simplifies the likelihood calculation and constrains tree-space. Combined with a number of computational optimisations recently implemented in BEAST, this approach allows for efficient, complex analyses of datasets with tens of thousands of genomes. We have applied these methods to the large dataset produced by the COG-UK consortium to investigate SARS-CoV-2 import into the UK as well as its subsequent spread.

Evolutionary origins of SARS-CoV-2 and tracking its spread using phylodynamic data integration

I will describe a collaborative effort to address evolutionary questions on the recent emergence of human coronavirus SARS-CoV-2 including the role of reservoir species, the role of recombination, and its time of divergence from animal viruses. Our findings indicate that sarbecoviruses – the viral subgenus containing SARS-CoV and SARS-CoV-2 – undergo frequent recombination and exhibit spatially structured genetic diversity on a regional scale in China. Contrary to other analyses, we find that SARS-CoV-2 itself is not a recombinant of any sarbecoviruses detected to date, and its receptor binding motif, important for specificity to human ACE2 receptors, appears to be an ancestral trait shared with bat viruses, not one acquired recently via recombination. Divergence dates between SARS-CoV-2 and the bat sarbecovirus reservoir indicates that the lineage giving rise to SARS-CoV-2 has been circulating unnoticed in bats for decades.

Following the emergence of the virus, unprecedented sequencing efforts have resulted in the accumulation of more than 100,000 genome sequences sampled globally. Despite this rich source of information, evolutionary reconstructions are hindered by the slow accumulation of sequence divergence over its relatively short history of transmission and by the spatiotemporal bias in genome sampling. I will describe a phylodynamic data integration approach in a Bayesian framework and demonstrate how it helps addressing questions about SARS-CoV-2 molecular epidemiology.

Real time tracking for real-life pandemics: Nextstrain and SARS-CoV-2

The emergence of SARS-CoV-2 has driven an enormous global effort to contribute and share genomic data in order to inform local authorities and the international community about key aspects of the outbreak. Analyses of these data have played an important role in tracking the epidemiology and evolution of the virus in real-time.

Nextstrain (nextstrain.org) is an open science initiative to harness the scientific and public health potential of pathogen genome data, and has previously provided key insight into outbreaks of Ebola and Zika, and longer-term pathogen spread of Influenza and Enterovirus. It provides a continually-updated view of publicly available data alongside powerful analytic and visualization tools for use by the community.

The Nextstrain team has been maintaining an up-to-date analysis of SARS-CoV-2 at nextstrain.org/ncov since 20 Jan 2020. In this talk, I'll discuss the realisation of 'real-time tracking' with SARS-CoV-2 and what genetic epidemiology has allowed us to uncover about the virus' spread. I'll also discuss some of the challenges Nextstrain has faced in processing and displaying large amounts of real-time data with unprecedented public attention, and how the move from 'global' to 'local' focus is presenting new challenges.

## Tree Sequences

Inferred genealogies are a reconstruction of our shared genetic past; embedded in them is information about key evolutionary events that have shaped the genetic variation observed in present-day individuals. Stored in tree sequence format, tree-based inferences can be extremely powerful and computationally efficient, allowing for joint, genome-wide analyses of large sample sizes. This seminar will aim at giving an overview of the quickly growing set of tools that extract information about our evolutionary past from these trees. We will focus particularly on Relate and tsinfer and discuss tree-based inferences of genetic structure, demographic histories, and natural selection, among others.

Inferring evolutionary trees from genetic data is a classic and hard problem in biology. We have recently published an algorithm for estimating gene tree topologies along a genome which scales to millions of trees with millions of tips. We have also just released a fast method to date (that is, to infer branch lengths for) these trees.

Both methods are based on the concept of a "tree sequence": a recently-introduced efficient format to store the correlated evolutionary trees that describe the full genomic ancestry of a set of genomes. While initially developed to describe the genetic history of individuals within a species, tree sequences are also starting to be used to analyse relationships between individuals from multiple closely related species.

The topology estimation method, called "tsinfer" (tree sequence inference), is composed of two heuristic steps. The first step uses the distribution of mutations to estimate partial fragments of ancestral DNA sequence, roughly ordering them by age. The second step uses a highly optimised tree-based HMM algorithm to match fragments against each other and build up a tree-like structure at each position in the genome. The resulting structure can be thought of as a large network of nodes connected by edges, where each edge connects a child node to a parent node over a specific portion or "span" of the genome (this is similar to the Ancestral Recombination Graph, or ARG).

The challenge of estimating branch lengths on trees in a tree sequence can be restated in terms of placing dates on tree sequence nodes. Our dating method, "tsdate", treats the date of each ancestral node as a hidden state, and estimates the probability of different node ages by considering the inferred tree sequence as a Bayesian network. In particular, each node is given a prior probability of being in one of a fixed number of time slices, then a two-pass algorithm iterates over all the edges in the tree sequence, updating these priors to take account of the mutations and span associated with each edge.

In this seminar I will describe the workings of both the tsinfer and tsdate algorithms, demonstrate their speed and accuracy, discuss their limitations, and show their application to a variety of genetic data sets, including ones with historically sampled individuals, and with multiple species present. I will also describe a new topology-based measure, based on genealogical nearest neighbours, which naturally summarises genome-wide relationships between individuals in a tree sequence.

The tree sequence data structure is a concise encoding of whole-genome ancestry and sequence data, with a rapidly maturing software ecosystem. Tree sequences encode the genealogical history of individuals subject to recombination and gene conversion, providing an efficient means of working with phylogenetic networks and other applications with a linear sequence of correlated trees. The tskit (tree sequence toolkit) library is a comprehensive framework for navigating and manipulating tree sequences using Python and C APIs and also hosts a unified interface for efficiently calculating a growing number of summary statistics. The ecosystem developing around this central technology now includes several genome simulators, a highly-scalable method for inferring ancestry from data, tsinfer, and an efficient method for dating ancestral nodes in a tree sequence topology, tsdate. In this primer session, I will introduce tskit and the tree sequence data structure as well as use downloadable Jupyter Notebooks to demonstrate the simulation of genomic datasets with msprime and the calculation of population genetic statistics with tskit.

## Transmission trees

Pairwise regression, phylogenetics, and epidemiologic methods for infectious disease transmission

Pairwise survival analysis handles dependent happenings in infectious disease transmission data by analyzing failure times in ordered pairs of individuals. The contact interval in the pair ij is the time from the onset of infectiousness in i to infectious contact from i to j, where an infectious contact is sufficient to infect j if he or she is susceptible. The survival function of the contact interval distribution determines transmission probabilities, and its hazard function determines the infectiousness profile of infected individuals. Many important questions in infectious disease epidemiology involve the effects of covariates (e.g., age or vaccination status) on the risk of transmission. Effects on infectiousness and susceptibility can be estimated simultaneously using parametric and semiparametric regression models that account for both person-to-person transmission and infection from external sources. Partial information on who-infected-whom from pathogen genetic sequences can be incorporated into these analyses, which improves precision and reduces bias. Finally, we will discuss how these methods suggest new approaches to the development of epidemiologic methods for infectious diseases.

Inferring phylogenetic and transmission trees from genetic sequence data with phybreak

Genetic sequencing of pathogens becomes more and more routine during infectious disease outbreaks. If an outbreak has been closely monitored and all (or most) cases have been sampled, these sequences provide the opportunity to infer who infected whom, which may help in identifying the index case, evaluating outbreak control, and characterising risk factors for infectiousness. However, there is no single path from genetic sequences through a phylogenetic tree to a timed transmission tree, describing who infected whom, and when. In this seminar I will explain how this problem is tackled with phybreak, a model and R package. Phybreak has four submodels, for infection, for sampling, for within-host pathogen dynamics, and for mutation, which together describe the relation between a transmission tree and observed genetic variation. Through an MCMC routine the possible phylogenetic and transmission trees are sampled while integrating over the uncertainty created by all submodels simultaneously. I will cover the original submodels and MCMC procedure (doi.org/10.1371/journal.pcbi.1005495), and show some results obtained with the package. I will finish with some new developments such as multiple samples per host and a wide transmission bottleneck.

Genomic Epidemiology with TransPhylo: methods, applications and limitations

I will describe a Bayesian approach, TransPhylo, and some of its recent extensions. TransPhylo reconstructs who infected whom and when, with the help of pathogen genetic data. It is a two-stage process, in which first, one or more timed phylogenetic trees are reconstructed from sequence data, and then these are augmented with transmission and timing information. As sequencing technologies have dramatically declined in cost, it is now feasible to sequence large numbers of viral or bacterial genomes in infectious disease outbreaks, and there have been high hopes that the resulting DNA or RNA sequences will tell the story of who infects whom and when, leading to both better infectious disease control and a better understanding of pathogen evolution. However, we find that having pathogen sequences does not directly reveal who infected whom -- considerable uncertainty remains. I will outline our main approach and its underlying mathematics, and then I will describe several extensions to include multiple datasets and to handle covariates. I will give some applications and their results, describe the limitations of the method, and discuss open challenges in this area.

## Variational inference

Stochastic Variational Inference for Bayesian Phylogenetics: A Case of CAT Model

The pattern of molecular evolution varies among gene sites and genes in a genome. By taking into account the complex heterogeneity of evolutionary processes among sites in a genome, Bayesian infinite mixture models of genomic evolution enable robust phylogenetic inference. With large modern data sets, however, the computational burden of Markov chain Monte Carlo sampling techniques becomes prohibitive. Here, we have developed a variational Bayesian procedure to speed up the widely used PhyloBayes MPI program, which deals with the heterogeneity of amino acid profiles. Rather than sampling from the posterior distribution, the procedure approximates the (unknown) posterior distribution using a manageable distribution called the variational distribution. The parameters in the variational distribution are estimated by minimizing Kullback-Leibler divergence. To examine performance, we analyzed three empirical data sets consisting of mitochondrial, plastid-encoded, and nuclear proteins. Our variational method accurately approximated the Bayesian inference of phylogenetic tree, mixture proportions, and the amino acid propensity of each component of the mixture while using orders of magnitude less computational time.

Bayesian phylogenetic inference is currently done via Markov chain Monte Carlo with simple mechanisms for proposing new states, which hinders exploration efficiency and often requires long runs to deliver accurate posterior estimates. In this paper we present an alternative approach: a variational framework for Bayesian phylogenetic analysis. We approximate the true posterior using an expressive graphical model for tree distributions, called a subsplit Bayesian network, together with appropriate branch length distributions. We train the variational approximation via stochastic gradient ascent and adopt multi-sample based gradient estimators for different latent variables separately to handle the composite latent space of phylogenetic models. We show that our structured variational approximations are flexible enough to provide comparable posterior estimation to MCMC, while requiring less computation due to a more efficient tree exploration mechanism enabled by variational inference. Moreover, the variational approximations can be readily used for further statistical analysis such as marginal likelihood estimation for model comparison via importance sampling. Experiments on both synthetic data and real data Bayesian phylogenetic inference problems demonstrate the effectiveness and efficiency of our methods.

An introduction to variational inference in phylogenetics

Markov chain Monte Carlo algorithms (MCMC) have become the workhorse of Bayesian phylogenetic inference since they were introduced in the late 1990's. One alternative to MCMC that has been proposed for Bayesian inference of model parameters is variational Bayes. The main idea behind variational inference is to transform the posterior approximation of an intractable model into an optimization problem using a family of tractable densities. Although variational inference is growing in popularity in the machine learning community, until recently it has received relatively little attention in the field of phylogenetics.

In this talk, I will describe the basic ideas behind variational inference. First I will introduce variational inference form a statistical perspective. Second, I will show how variational inference can be applied to the phylogenetic problem using the Stan package, a probabilistic programming language. Finally, I will compare the performance and accuracy of variational inference to MCMC-based analyses using diverse phylogenetic models, including time trees and coalescent models.

## Branch-specific diversification inference

Population genetics of adaptation and ecological diversification on substitutable resources

Most mutations are subject to competitive exclusion, and will either come to dominate a population or go extinct. In special cases, a mutant may evade competitive exclusion by exploiting a different ecological niche. Both types of mutations can be found in large microbial populations, yet little is known about how they combine to determine the genealogical structure of a population. In this talk, I will describe some recent theoretical efforts to address this question, focusing on the dynamics that emerge in simple resource competition models. I’ll show how the competition between ecological diversification and fitness evolution leads to an emergent state of diversification-selection balance, in which semi-stable ecotypes are continuously generated and purged by natural selection. The ecological and genealogical structure of this non-equilibrium steady-state can be characterized analytically in simple asymptotic limits, revealing a crucial dependence on the range of genetically accessible phenotypes. I’ll conclude by discussing potential connections to empirical data, both from laboratory evolution experiments and natural populations of bacteria.

Coupling adaptive molecular evolution to phylodynamics using fitness-dependent birth-death models

Beneficial and deleterious mutations cause the fitness of lineages to vary across a phylogeny and thereby shape its branching structure. While standard phylogenetic models do not allow non-neutral mutations to feedback and shape trees, birth-death models can account for this feedback by letting the fitness of lineages depend on their type. To date however, these multi-type birth-death models have only been applied to cases where a lineage's fitness is determined by a single evolving character state. We extend these models to track the fitness of a lineage and sequence evolution at multiple sites. This approach remains computationally tractable by tracking the fitness and ancestral genotype of lineages probabilistically in an approximate manner. Although approximate, we show that we can accurately estimate the fitness of lineages and even specific mutation effects from phylogenies. We apply this approach to estimate the population-level fitness effects of mutations previously identified to modulate the fitness of Ebola virus and human influenza viruses in the lab.

Population genetics of rapid adaptation and fitness inference from trees

Many large microbial populations are genetically diverse and consist of many strains that compete against each other. Competition for susceptible humans, for example, drives the immune escape dynamics of seasonal influenza viruses. I will discuss how such competitive non-neutral dynamics differ from typical population genetic models based on the Kingman coalescent. If many small effect mutation contribute to fitness variation in the population, the evolutionary dynamics converges towards a different universal coalescent process known as the Bolthausen-Sznitman coalescent which generates distinctly different tree ensembles and diversity patters. I will then show how these insights into the population genetics of rapid adaptation can be used to infer relative fitness of individuals in a population from sequence data sampled at a single time point.

A Bayesian approach for estimating branch-specific speciation and extinction rates

Species richness varies considerably among the tree of life which can only be explained by heterogeneous rates of diversification (speciation and extinction). Previous approaches use phylogenetic trees to estimate branch-specific diversification rates. However, all previous approaches disregard diversification-rate shifts on extinct lineages although 99% of species that ever existed are now extinct. Here we describe a lineage-specific birth-death-shift process where lineages, both extant and extinct, may have heterogeneous rates of diversification. To facilitate probability computation we discretize the base distribution on speciation and extinction rates into k rate categories. The fixed number of rate categories allows us to extend the theory of state-dependent speciation and extinction models (e.g., BiSSE and MuSSE) to compute the probability of an observed phylogeny given the set of speciation and extinction rates. To estimate branch-specific diversification rates, we develop two independent and theoretically equivalent approaches: numerical integration with stochastic character mapping and data-augmentation with reversible-jump Markov chain Monte Carlo sampling. We validate the implementation of the two approaches in RevBayes using simulated data and an empirical example study of primates. In the empirical example, we show that estimates of the number of diversification-rate shifts are, unsurprisingly, very sensitive to the choice of prior distribution. Instead, branch-specific diversification rate estimates are less sensitive to the assumed prior distribution on the number of diversification-rate shifts and consistently infer an increased rate of diversification for Old World Monkeys. Additionally, we observe that as few as 10 diversification-rate categories are sufficient to approximate a continuous base distribution on diversification rates. In conclusion, our implementation of the lineage-specific birth-death-shift model in RevBayes provides biologists with a method to estimate branch-specific diversification rates under a mathematically consistent model.

## Non-traditional data

Cellular ‘phylogenetics’ - decoding the developmental history and relationships among individual cells

Multicellular organisms develop by way of a lineage tree, a series of cell divisions that give rise to cell types, tissues, and organs. This pattern mirrors the evolutionary relationships between species, though our knowledge of the cell lineage and its determinants remains extremely fragmentary for nearly all species. This includes all vertebrates and arthropods such as Drosophila, wherein cell lineage varies between individuals. Embryos and organs are often visually inaccessible, and progenitor cells disperse by long-distance migration. We recently pioneered a new paradigm for recording cell lineage and other aspects of developmental history that has the potential to enhance our understanding of vertebrate biology. In brief, we engineer cells to stochastically introduce mutations at specific locations in the genome during development. The resulting patterns of mutations, which can be efficiently queried by massively parallel sequencing, can be used to reconstruct lineage using methods adapted from phylogenetics. We demonstrate our technique by tracing the lineage of tens of thousands of cells within individual Zebrafish and Drosophila, relating the lineage of numerous emerging tissue and organ systems.

Reconstructing probabilistic trees of cellular differentiation from single-cell RNA-seq data

Recent advances in single-cell methods have made tangible how individual cell profiles can reflect the imprint of ephemeral or dynamic processes. However, synthesizing this information to reconstruct dynamic biological phenomena – from data that are noisy, heterogenous, and sparse, and from processes that may unfold asynchronously – poses a computational and statistical challenge.

We develop a full generative model and inference for reconstructing a dynamic process (cellular differentiation) from many static snapshots (single-cell RNA-seq profiles), with calibrated uncertainties. Specifically, we define cell state by the latent parameterization of a distribution over gene expression space, and model these latent vectors as arising from bifurcating, self-reinforcing paths along a probabilistic tree — necessitating the design of a new class of Bayesian tree models for data that arise from a latent branching spectrum.

In this talk, I explore how our model fills a hole in the existing literature on probabilistic trees, and what having an explicit generative model buys us in the context of reconstructing trajectories to understand cell fate decisions in differentiation.

Advances in computational Bayesian methods and their use in large-scale single-cell tree reconstruction

I will describe a Bayesian method to reconstruct single cell phylogenetic trees from copy number events such as those that arise in cancers with high genomic instability. The method is motivated by low-depth genome-wide data which can be obtained for increasingly large numbers of cells thanks to technologies such as Direct Library Preparation or 10x Single Cell Genomics.

Computing the posterior distribution in this model at scale is challenging. I will describe how recent advances in the field of Bayesian computational statistics can be used to parallelize the posterior inference computation to an arbitrary number of cores, touching on topics such as non-reversible methods and change of measure approaches.

The posterior inference methods described are available through an open source Bayesian modelling language called Blang, which can be used for a range of phylogenetic problems including more traditional phylogenetic models, as well as other Bayesian analysis problems. The motivating copy-number-based phylogenetic model is implemented in Blang and available in a cancer Bayesian phylogenetics and population genetics library we are actively developing. This library has been used to infer phylogenetic trees on >4000 cells using >60 cores.

## Model adequacy

Model adequacy of experimentally informed site-specific substitution models

Phylogenetic substitution models are hypotheses about evolutionary process and, like all models, they contain simplifying assumptions. One common assumption is that all sites in a gene evolve identically. However, even a cursory analysis of a multiple sequence alignment will show that this assumption is violated in natural protein evolution. Relaxing this assumption greatly increases the number of model parameters to account for the effect of every amino acid at every site in the protein. We have developed a family of models, called Experimentally Informed Codon Models (ExpCMs), which describe the site-specific constraints on a protein using empirical measurements from a high-throughput functional assay in the lab. Even though the vast majority of the parameters are determined empirically rather than fit to the data, we have found that ExpCMs are generally better descriptors of natural sequence evolution than site-uniform codon models, as evaluated by model comparison techniques such as AIC. Now, we are turning to model adequacy tests as a more quantitative and comprehensive way to evaluate ExpCMs on a site-by-site basis. We believe that sites which are inadequate descriptors of natural sequence evolution may indicate sites where the selective pressure differs between the lab and nature and point to interesting biological mechanisms. Model adequacy tests will also allow us to compare how well experiments performed under different conditions are able to capture natural constraint. Overall, the site-specific ExpCMs can be used as a tool to bridge the gap between what we know about selection in the lab and in nature.

Statistical models are widely used in phylogenetics to infer the evolutionary history of groups of organisms. In the context of rapidly evolving pathogens, phylogenetic analyses can be used to make inferences about epidemiological processes, a field known as infectious disease phylodynamics. A key component of phylodynamic analyses is a branching model to describe transmission. For example, coalescent and birth-death models can estimate the average number of secondary infections using phylogenetic trees. However, the resulting inferences are contingent on the extent to which models describe key aspects of the data. For example, the simplest models in phylodynamics assume that transmission rates are constant over time and lineages, which is not necessarily the case for many empirical data sets. In this talk I will discuss model adequacy methods in phylodynamics. In contrast to model selection, where models are ranked according to their statistical fit, the goal of model adequacy is to determine whether key aspects of the data at hand could have been generated by the model in question. That is, to assess the absolute, rather than relative, model fit. Model adequacy typically consists of simulating data from the model and comparing them to the empirical data. The crux of such comparisons is to develop summary statistics that represent the expectation under the model. Using examples from different virus data sets I will present several approaches to assess phylodynamic models to reveal the importance of modelling population dynamics, such as population structure and variation in transmission rates, in epidemiological estimates. Finally, I will illustrate ways in which an uptake of these approaches can improve our understanding of infectious disease evolution and motivate the development of models in phylodynamics.

More data alone will not resolve the Tree of Life. That statement encapsulates perhaps the most striking lesson of phylogenomics. While genome sequences provide us with an invaluably rich source of information about evolutionary history, our ability to properly interpret this information is sometimes flawed, which has led to protracted debates about some of the most interesting and enigmatic relationships across the Tree. However, phylogenetic inference now has a robust grounding in statistical inference. This grounding gives us tools to at least recognize the existence, and hopefully resolve the source, of errors when they occur. These tools are important and broadly applied in other areas of statistical inference, but have been slow to be adopted in phylogenetics. In this talk, I will cover some of the strategies that have been proposed for assessing model fit, some of the reasons for the slow adoption, and the challenges that remain.

## Primer

This series of 4 talks will be an introduction to phylogenetics in 4 parts from master expositor Paul Lewis.

Part 3b continues part 3a with proposals (updating model parameters or trees during MCMC), prior distributions, hierarchical models, and Bayes factors.

This series of 4 talks will be an introduction to phylogenetics in 4 parts from master expositor Paul Lewis.

Part 3 is an introduction to Bayesian statistics and how it is used in phylogenetics. This part is divided into parts 3a and 3b. Part 3a explains Bayes Rule, the difference between probabilities and probability densities, the difference between joint, conditional and marginal probabilities, and illustrates how MCMC is used to approximate posterior probability distributions.

This series of 4 talks will be an introduction to phylogenetics in 4 parts from master expositor Paul Lewis.

Part 2 explains how the likelihood is calculated for trees, how parameters of models are estimated, and how missing data and unknown ancestral states are accommodated. This part ends with an introduction to three major ways of modeling among site rate heterogeneity (site-specific rates and the +I and +G models).

Part 1 covers terminology used in describing phylogenetic trees and the basic features of substitution models, including a survey of the common GTR family of models as well as codon and amino acid models.

## Continuous traits

One of the most apparent features of life is its diversity of forms. Biologists are especially driven to learn what evolutionary processes generated which components of life's variation. Central to this puzzle is whether or not phenotypic change tends to accumulate by slow but steady increments or by rare but sudden pulses. Although many phylogenetic models and methods are known for describing incremental change, such as Brownian motion and the Ornstein-Uhlenbeck process, models of pulsed change have received less study, making it difficult to measure the prevalence of competing evolutionary modes. My talk provides an overview of evidence, theory, and methods that have advanced our understanding of evolutionary pulses of trait change. I share some of my contributions on the exploration of this topic, including recent findings that phylogenetic models of pulsed evolution explain a major component of vertebrate body size evolution. I conclude with some remarks regarding the potential for models of pulsed evolution to aid in the study of macroevolution.

The availability of phylogenetic trees based on molecular sequence data has revolutionized evolutionary biology by providing a map from which we can understand divergence and diversification across the tree of life. Numerous phylogenetic comparative approaches have opened up new avenues for testing macroevolutionary hypotheses regarding the drivers of the tempo and mode of trait evolution and lineage diversification. However, recent crises in the field have suggested that many of the methods we commonly use don't tell us what we would like them to. Specifically, I will show that unreplicated evolutionary events can break nearly every comparative method for testing associations from phylogenetically structured data. I will argue that one solution to this problem is to unite hypothesis testing with data-driven approaches, which I term "phylogenetic natural history", to disentangle the impact of singular shifts from replicated patterns of association. More generally, I will argue that we should abandon thinking of phylogenetic comparative methods as "corrections for statistical non-independence" and more meaningfully confront how our causal hypotheses map on to phylogenetically structured data.

Applications of phylogenetic continuous trait models to gene expression

Historically, quantitative genetics was developed to understand macroscopic phenotypes, such as body mass. However, with the advent of high throughput genomics technology, we now have access to extremely high dimensional molecular phenotypes. One of the most common traits analyzed is gene expression, measured via RNA sequencing in modern applications. I will attempt to put this work in historical context, highlighting some early work on building models of neutral gene expression evolution, which poses unique challenges in a phylogenetic comparative framework. I will also discuss some work, including my own, that attempts to leverage the dimensionality of gene expression data to increase power. Finally, I will conclude with some perspectives on future directions for analysis of molecular phenotypes in a phylogenetic context.

## Networks

I will first highlight why network reconstruction is worth the effort, and then explain some of the challenges of network reconstruction and network intepretation. These challenges include identifiability issues, difficulties to summarize network uncertainty, and interpretation issues related to network-thinking. Finally, I will describe new phylogenetic comparative methods that can be applied to phylogenetic networks, and are implemented in the PhyloNetworks Julia package.

Several parsimony-based methods aiming at reconstructing explicit phylogenetic networks have been developed in the last two decades. In the first part of this talk I will review several of these methods that share the same underlying approach: First, combinatorial objects such as phylogenetic trees, hierarchical clusters or trinets are constructed from the data of the species under study; Second, these combinatorial objects are combined into an explicit phylogenetic network. The way they are combined and the parameters to optimise (e.g. minimising the hybridisation number, i.e. the number of reticulations of the network, or the level, i.e. the maximum number of reticulations in each biconnected component) give a large range of different problems, each of biological interest. In the second part of the talk I will discuss different definitions of maximum parsimony for phylogenetic networks, as well as the pros and cons of each of them. Then I will introduce several algorithmic results to lay the foundations for new parsimony-based methods for phylogenetic network reconstruction.

Phylogenetic Networks: From Displayed Trees to a Distribution of Gene Trees

Phylogenetic networks are leaf-labeled, rooted, directed acyclic graphs that are used to represent and model reticulate, or non-treelike, evolutionary histories. Phylogenetic networks have received significant attention in the last two or three decades and the computational phylogenetics community has developed a wide array of mathematical results and algorithmic techniques for their inference. A fundamental observation that guided much of these developments was that a network is a summary of a set of trees. This observation gave rise to the parsimonious formulation of inferring a network with the smallest number of non-tree events that displays a given set of trees.

More recently, though, efforts have been dedicated to statistical inference of these networks from data of multiple, unlinked loci. This formulation is based on extending the multi-species coalescent to species phylogenies whose topologies are networks. With this extension, inferences simultaneously account for reticulation events, such as hybridization, in the presence of incomplete lineage sorting, thus not interpreting all heterogeneity in the data as caused solely by reticulation.

In this seminar, I will introduce the phylogenetic network model, and give a brief survey of the results based on the parsimonious formulation. I will then introduce the multispecies network coalescent and describe recent results on statistical inference of phylogenetic networks from multi-locus data under this model.

## Invariants

Developing a statistically powerful measure for quartet tree inference using phylogenetic and Markov invariants

Recently there has been renewed interest in phylogenetic inference methods based on phylogenetic invariants, alongside the related Markov invariants. Broadly speaking, both these approaches give rise to polynomial functions of sequence site patterns that, in expectation value, either vanish for particular evolutionary trees (in the case of phylogenetic invariants) or have well understood transformation properties (in the case of Markov invariants).

While both approaches have been valued for their intrinsic mathematical interest, it is not clear how they relate to each other, and to what extent they can be used as practical tools for inference of phylogenetic trees. By focusing on the special case of binary sequence data and quartets of taxa, we are able to view these two different polynomial-based approaches within a common framework.

We present three desirable statistical properties that we argue any invariant-based phylogenetic method should satisfy: (1) sensible behaviour under reordering of input sequences; (2) stability as the taxa evolve independently according to a Markov process; and (3) explicit dependence on the assumption of a continuous-time process. Motivated by these statistical properties, we develop and explore several new phylogenetic inference methods. In particular, we develop a statistically bias-corrected version of the Markov invariants approach which satisfies all three properties. We also extend previous work by showing that the phylogenetic in- variants can be implemented in such a way as to satisfy property (3). A simulation study shows that, in comparison to other methods, our new proposed approach based on bias-corrected Markov invariants is extremely powerful for phylogenetic inference.

The advent of rapid and inexpensive sequencing technologies has necessitated the development of computationally efficient methods for analyzing sequence data for many genes simultaneously in a phylogenetic framework. The coalescent process is the most commonly used model for linking the underlying genealogies of individual genes with the global species-level phylogeny, but inference under the coalescent model is computationally daunting in the typical inference frameworks (e.g., the likelihood and Bayesian frameworks) due to the dimensionality of the space of both gene trees and species trees. By viewing the data arising under the phylogenetic coalescent model as a collection of site patterns, the algebraic structure associated with the probability distribution on the site patterns can be used to develop computationally efficient methods for inference via phylogenetic invariants.

In this talk, I will discuss three problems that can be addressed using invariants. First, I will describe how identifiability results for four-taxon species trees based on site pattern probabilities can be used to build a quartet-based inference algorithm for trees of arbitrary size. Second, methods for rooting phylogenetic species trees inferred under the coalescent model will be discussed. Finally, the use of invariants to detect species that arose via hybridization will be described. The methods presented will be demonstrated on several phylogenomic-scale datasets. Because the methods are derived in a fully model-based framework (i.e., the coalescent process is used to model the relationship between gene trees and the species tree, and standard nucleotide substitution models (GTR+I+G and all submodels) are used for sequence-level evolution), these methods are promising approaches for computationally efficient, model-based inference for the large-scale sequence data available today.

Phylogenetic invariants: what are they and why should we care

It has been now thirty years since the introduction of phylogenetic invariants by Lake, Cavender, and Felsenstein. However, the use of phylogenetic invariants as a method of phylogenetic reconstruction has been in a dormant state for about 20 years; quoting J. Felsenstein in his 2004 book "invariants are worth attention, not for what they do for us now, but what they might lead to in the future".

During the last decade many efforts have been made by mathematicians to completely understand the structure and use of phylogenetic invariants. This has led to the characterization of different types of invariants for many different models: from the most simple Jukes-Cantor model to the general Markov model, and even mixtures of them and the coalescent. Most importantly, this has produced new and efficient methods of phylogenetic reconstruction for complex models. The use of invariants has also been used in model selection and has been crucial in proving the identifiability of parameters for certain models.

In this talk we shall introduce phylogenetic invariants, explain the main ideas that underlie the methods of phylogenetic reconstruction based on invariants and discuss the advantages and drawbacks of them.

## Philosophy

Talks in this series have largely focused on population genetic and phylogenetic methods for reconstructing micro- and macroevolutionary patterns consequent from microevolutionary processes. When natural selection is invoked, it is generally assumed to operate through the differential reproduction of favored variants among populations of physical entities, be they genes, cells, organisms or (rarely) species. The Gaia hypothesis of James Lovelock, co-developed and vigorously promoted by Lynn Margulis in the 1970s, has been very popular with the lay public. But most mainstream Darwinists scorned and still do not accept the notion. They cannot imagine global biospheric stability being selected for at any of the above levels, and do not see the Earth's biosphere as part of a population of comparable global entities engaged in reproductive competition. Most philosophers of biology would similarly argue that any global homeostatic systems (if they exist) can be only "fortuitous byproducts" of lower-level selection. I will suggest that we look at the biogeochemical cycles and other homeostatic processes that might confer stability-- rather than the individual organisms or "species" (mostly microbial) that implement them-- as the relevant units of selection. By thus focusing our attentions on the "song", not the "singers," a Darwinized Gaia might be developed. Our understanding of evolution by natural selection would however need to be stretched to accommodate differential persistence, and our definition of reproduction would need to be reworked.

## Archaea and Bacteria

Joint Bayesian inference of bacterial ancestral recombination graphs

Homologous recombination is a central feature of bacterial evolution, yet confounds traditional phylogenetic methods. In this seminar I will present a novel approach to inferring bacterial evolution based on the ClonalOrigin model (Didelot et al., Genetics, 2010). This method permits joint Bayesian inference of the entire bacterial recombination graph and associated model parameters. The method is implemented in the BEAST 2 phylogenetic inference package. It can be easily combined with a variety of substitution models accounting for site-to-site clock rate heterogeneity as well as parametric and non-parametric models of effective population size dynamics. I will also present work on summarizing posterior distributions over the space of tree-based recombination graphs which, together with the joint inference method, aims to bridge the technological gap between recombination-aware phylogenetic inference and traditional methods.

Recombination happens frequently in most bacterial and archaeal species. Traditional phylogenetic techniques do not account for this, which can greatly limit their usefulness for the analysis of genomic data. The coalescent with gene conversion accurately models the ancestry process of prokaryotes, and this can be used to simulate realistic data, but it is too complex to use in an inferential setting. Approximations have therefore been introduced, which are centred around the concept of the clonal genealogy, that is the phylogeny obtained by following the line of ancestry of the recipient of each recombination event. I will review these mathematical models and ongoing efforts to develop statistical software to perform phylogenomic analysis in recombining prokaryotes.

Total community approaches (omics) provide a blueprint of the microbial functions and community diversity within an environment. With genome-resolved metagenomics, this view can be refined, identifying an organism's specific contributions to pathways and processes as well as their interactions with other community members. This approach has led to a recent explosion of genome sequences for uncultured and uncharacterized microbial lineages, many with previously-unknown roles in biogeochemical cycles. My work explores the environmental importance of these novel organisms and the emerging view of the Tree of Life that stems from our new understanding of microbial diversity.

## Heterogeneous substitition

Systematic errors in phylogenomic studies: on the importance of modeling pattern-heterogeneity across sites.

While all models now used in phylogenetic analyses account for rate-heterogeneity across sites, the case of pattern-heterogeneity (i.e. qualitative variation in substitution processes across nucleotide or amino-acid positions) is much less clear and has recently been the subject of some controversy. One main question is whether pattern-heterogeneity should be modelled at the level of genes (or groups of genes), or at the level of sites. Both approaches have been used in recent phylogenomic analyses of metazoans---sometimes leading to radically different conclusions---in particular concerning the early patterns of diversification within this group.

In this talk, I will first explore the empirical evidence concerning the presence, and the relative importance, of either type of heterogeneity in empirical sequence alignments. Then, I will introduce Dirichlet process mixture models accounting for site-specific amino-acid preferences. The statistical meaning of Dirichlet processes, as a non-parametric method for estimating arbitrary distributions of site-specific effects, will be explained and illustrated through simulation experiments. Finally, based on simulations implementing pattern heterogeneity simultaneously at both the gene and the site levels, I will show the importance of using models explicitly accounting for pattern-heterogeneity across sites for reconstructing accurate phylogenies.

Modeling substitutional heterogeneity and its impact on inferring relationships

Heterogeneity in amino acid substitution is an inherent feature of most phylogenomic-scale datasets, and modeling such heterogeneity is now widely seen as important for phylogenomic inference. Site-heterogeneous substitution models such as CAT-F81 and CAT-GTR, as implemented in PhyloBayes, have been forcefully advocated for use on large datasets because they may reduce long-branch attraction artifacts that could result from not adequately modeling amino acid substitutional heterogeneity. However, site-heterogeneous models arguably became popular not because of a deep appreciation for how well they modeled substitutional heterogeneity, but rather because analyses with CAT models often resulted in trees that matched preconceived notions of animal phylogeny (e.g., sponges as the sister lineage to all other extant animals). Importantly, site-heterogeneous models have not been thoroughly compared to other methods for modeling substitutional heterogeneity such as coarse modeling of heterogeneity with data partitioning coupled with site-homogeneous models such as WAG or LG. Here, I show through analyses of simulated and empirical data that data partitioning often performs as well as, or better than, site-heterogeneous CAT models. In contrast to past claims, I demonstrate that partitioning with site-homogeneous models suppresses long-branch attraction artifacts as well as CAT-GTR and much better than CAT-F81. Analyses with data partitioning and site-homogeneous models can require orders of magnitude less computational time than popular site-heterogeneous models, while still resulting in reasonably accurate trees. Although site-heterogeneous models may describe the amino acid substitutional process much better than data partitioning with site-homogeneous models, current implementations of the most popular site-heterogeneous models do not appear to result in more accurate phylogenetic hypotheses than those inferred with partitioning. Thus, the need to model fine-scale site-heterogeneity in phylogenetic inference is called into question.

Combating phylogenetic artefacts by modeling site-specific substitution processes with mixture models and approximations

The most widely used phylogenetic models of amino acid substitution involve a single reversible empirical substitution matrix (e.g. LG, WAG, JTT etc.) and a mixture model of rate heterogeneity cross sites, such as a discretized gamma distribution. However, these models fail to capture important constraints on protein sequence evolution, heterogeneity in the substitution process across the tree, and heterogeneity across multiple proteins in a concatenated data matrix. Failure to model these features of the data can lead to artefacts in phylogenetic reconstructions, especially for "deep" phylogenetic problems. Here I focus on the importance of modeling site-specific heterogeneity in the substitution process.

The structural and functional roles of residues in proteins lead to constraints on the kinds of amino acids that may be substituted at positions over time, a feature that is not captured by the single-matrix models. Site-heterogeneous mixture models have been developed to address this issue. For example, the "CAT" mixture models (CAT-Poisson or CAT-GTR), implemented in the Phylobayes program, have been shown to successfully avoid long branch attraction problems associated with single-matrix analyses in a number of published cases. However, the utility of these and other mixture models is severely limited for very large phylogenomic analyses because of their computational time cost and memory usage. I will discuss several simple rapid and efficient approximations to these full profile mixture models. Our simulation and empirical data analyses demonstrate that these approximations ameliorate long branch attraction artefacts and, in several cases, provide more accurate estimates of phylogenies than the mixture models from which they derive.

## History

I will discuss the history of the use of computers to infer phylogenies, starting in the late 1950s and giving particular emphasis to the introduction of the major methods in the 1960s. Much of this history I watched happen, from 1965 on. In particular I will explain the way that work in biological systematics, in population genetics, and in molecular evolution of multiple species gave rise to the early methods. I will touch on the controversies that developed in the 1970s and 1980s, a period of intense conflict over what should be the logical foundation of the reconstruction of phylogenies. Computational phylogenetics is becoming continually more statistical and continually less connected to the separable task of erecting a biological classification of organisms. Recent Twitter controversies show that arguments that were dominant and vehement in the 1980s are now taken seriously by few.

## Structure and molecular evolution

Evolutionary and phylogenetic analyses are the basis of understanding the the origins and properties of all living systems. Darwin noted that the manner in which any organism evolves is largely determined by its interactions with other organisms and the environments they produce, on the "tangled bank" of plants, birds, insects, and worms, all "dependent upon each other in so complex a manner." This is also true at a protein level, where the selection acting on a protein for traits such as function, structure, and stability depend on the manner in which the amino acids interact, so the substitutions that occur at one site is affected by the amino acids at other sites in the protein (as well as other proteins and biomolecules). Capturing and characterising these networks is central to developing new mechanistic models of the substitution process grounded on the underlying molecular biophysics and population biology. The simulated evolution of proteins under selection for thermodynamic stability suggests connections between substitutions and other processes described by statistical physics. By using the language of statistical physics, we can develop deeper insights into the evolutionary process. By using the tools of statistical physics, we can move us towards calculating substitution rates from first principles.

Structural and functional constraints on protein evolution

Proteins are under selective constraints to fold stably into their native conformation and to carry out their biological function. These selective constraints shape how proteins evolve, and they cause variation in substitution rates among the sites within a given protein. In particular, sites in the core of a protein, with many residue-residue contacts, tend to be more conserved than sites on the protein surface. Further, catalytic residues in enzymes are highly conserved, and they impart a measurable increase in conservation to much of the enzyme structure, in a distance-dependent manner. (The further a site is from a catalytic residue, the less extra conservation it experiences.) Finally, protein-protein interfaces show a surprising ability for evolutionary divergence, even if they are strongly selected for function.

Computational algorithms to infer phylogenetic relationships or detect sites of positive selection are widely used in diverse branches of biology. However, anyone with a passing knowledge of modern biochemistry can recognize that the quantitative models of the evolutionary process used by these algorithms are woefully oversimplified. I will discuss prospects for making these models more realistic while keeping them computationally tractable. In particular, I will discuss how new sources of high-throughput experimental data can be leveraged to improve algorithms for the analysis of gene sequences.

## Biased sampling

New routes to phylogeography: a Bayesian structured coalescent approximation

Phylogeographic methods aim to infer migration trends and the history of sampled lineages from genetic data. Applications of phylogeography are broad, and in the context of pathogens include the reconstruction of transmission histories and the origin and emergence of outbreaks. Phylogeographic inference based on bottom-up population genetics models is computationally expensive, and as a result faster alternatives based on the evolution of discrete traits have become popular. In this seminar I will discuss the advantages and disadvantages of different phylogeographic methods, in particular, I will address the issue of the sensitivity of discrete trait methods to the sampling strategy. I will also present a new method called BASTA (BAyesian STructured coalescent Approximation), implemented in BEAST2, that combines the accuracy of methods based on the structured coalescent with the computational efficiency required to handle more than just few populations. I will illustrate the potentially severe implications of model choice for phylogeographic analyses by investigating the zoonotic transmission of Ebola virus and the between-species transmission of the Avian Influenza Virus.

Preferential sampling through time when estimating changes in effective population size

Phylodynamics seeks to estimate effective population size fluctuations from molecular sequences of individuals sampled from a population of interest. However, when analyzing sequences sampled serially through time, current methods implicitly assume either that sampling times are fixed deterministically by the data collection protocol or that their distribution does not depend on the size of the population. Through simulation, we first show that, when sampling times do probabilistically depend on effective population size, estimation methods may be systematically biased. To correct for this deficiency, we propose a new model that explicitly accounts for preferential sampling by modeling the sampling times as an inhomogeneous Poisson process dependent on effective population size. We demonstrate that in the presence of preferential sampling our new model not only reduces bias, but also improves estimation precision. Finally, we compare the performance of the currently used phylodynamic methods with our proposed model through seasonal human influenza examples. Our analysis demonstrates that influenza data sets constructed by mining sequence databases do contain strong preferential sampling signal. Accounting for this preferential sampling produces a markedly cleaner picture of influenza population dynamics.

As was recently shown, variation in speciation rates among lineages results in substantial biases when estimating diversification rates from phylogenetic trees. Consequently, confidence in many phylogenetic estimates for trait-dependent models of diversification from trees on extant species alone may well exceed what is possible. From a mathematical point of view a fair amount is known about the probability distribution of ancestral trees derived from single type birth and death process, while much less is known about ancestral trees derived from multi-type branching processes with type dependent rates. In this talk I will present a few results in this direction. First, there is an algorithmic way to construct an ancestral tree of the standing population of a multi-type branching process in terms of a Markov chain (of vectors of types and multiplicities). This construction allows one to get explicit formulae for calculating: (a) statistical features that describe the shape of the tree (the law of coalescence times together with types on the ancestral lineages), and (b) statistical features that link types in the standing population with the shape of the tree (the law of same-type coalescence times). Second, explicit calculations can be used to compare the effect that different branching mechanisms have on the distributions of ancestral trees. I will illustrate this in a simple example of two-type process with completely asymmetrical vs symmetrical probabilities of offspring types.

## Phylo-genetic conservation

Comparing patterns in phylogenetic and trait diversity

Studying the phylogeny led to the emergence of interdisciplinary approaches combining ecology, evolutionary biology and biogeography. The analysis of the phylogenetic relatedness among species complemented the analysis of the functional (trait-based) similarities among species, and even sometimes replaced it when phylogenetic relatedness was considered as a proxy for functional similarity. The use of phylogenetic diversity as a proxy for functional diversity as been questioned due to the observation of moderate phylogenetic signal in many field studies. From a methodological viewpoint, a fundamental difference between phylogenetic and functional analyses is that phylogeny is intrinsically dependent on a tree-like structure whereas trait data can, most of time, only be forced to adhere a tree structure, not without some loss of information. I will discuss the ways phylogenetic and functional diversity patterns can be compared and the consequences of their simultaneous analyses for conservation and community ecology.

Phylogenetic beta-diversity: a means to understand, map and conserve spatial patterns of biological diversity

Beta-diversity has long been recognized as an instrumental diversity measure providing insight as to how and why diversity varies across space. Beta-diversity also underlies most complementarity-based reserve design algorithms which quantify the extent to which an area contributes unrepresented features to an existing area or set of areas. In the early 2000 researchers started to recognize that beta-diversity could be extended to include phylogenetic information. By accounting for shared evolutionary history among assemblages/regions phylogenetic beta-diversity can provide insights into both the ecological and evolutionary mechanisms influencing variation in species diversity and the best way to conserve phylogenetic diversity in a reserve system. In this seminar I will begin by briefly reviewing various definitions and approaches to measuring and mapping beta-diversity. Then I will use a series of examples to show some of the new insights phylogenetic beta-diversity has provided to both basic science and conservation.

Conserving phylogenetic information: indices, approaches and gaps

There seems to be increased interest in the notion that evolutionary history is worthy of management and conservation (see, e.g. Frishkoff et al. 2014; Diniz-Filho et al. 2013). The basic quantity seems to be “phylogenetic diversity” (PD) or the sum of the edge lengths connecting a candidate set of species (Faith 1992). Given a tree or network, one can produce many measures of current (or expected) (contributions to) PD, and these can be modified by other axes of value and expected costs and benefits of interventions. The technical side of the field seems to me to be in some disarray; there are overlapping terms and definitions, weak connections to other literatures (particularly community ecology), and under-tested assumptions. My presentation will offer little or no new data, but I will draw on the work of others in an attempt to partially organize the technical side of the field as I see it. Key issues concerning mapping traits and geographic scale are taken up in the following two presentations in this series.

## Ebola

Ebola virus epidemiology, transmission, and viral evolution from four months of sequencing in Sierra Leone (Analysis and Methods)

Adding to the work reported in Gire, et al (Science, 2014) which sequenced Ebola viruses from the first three weeks of the epidemic in Sierra Leone, we here present analyses of 150 additional viral genomes sampled from EVD cases at Kenema Government Hospital between the months of June to September 2014. We describe continued evidence for sustained human-to-human transmission with no additional zoonotic events, and preliminary results concerning new lineages from Guinea. We also characterize the epidemiological history of the limited number of exported viruses from the country. We also observe a slowing of the viral substitution rate over the course of the outbreak, consistent with the increased effect of purifying selection as the outbreak continues over time. These findings allow a closer view of viral evolution during its extended time in human populations and provide critical insights into the movement of the virus through the region.

This is the second talk in a pair of talks from collaborators Daniel Park and Gytis Dudas concerning their analysis of Ebola virus sequences.

Ebola virus epidemiology, transmission, and viral evolution from four months of sequencing in Sierra Leone (Overview)

Adding to the work reported in Gire, et al (Science, 2014) which sequenced Ebola viruses from the first three weeks of the epidemic in Sierra Leone, we here present analyses of 150 additional viral genomes sampled from EVD cases at Kenema Government Hospital between the months of June to September 2014. We describe continued evidence for sustained human-to-human transmission with no additional zoonotic events, and preliminary results concerning new lineages from Guinea. We also characterize the epidemiological history of the limited number of exported viruses from the country. We also observe a slowing of the viral substitution rate over the course of the outbreak, consistent with the increased effect of purifying selection as the outbreak continues over time. These findings allow a closer view of viral evolution during its extended time in human populations and provide critical insights into the movement of the virus through the region.

This is the first talk in a pair of talks from collaborators Daniel Park and Gytis Dudas concerning their analysis of Ebola virus sequences.

## Ancestral recombination graphs

A demography-aware conditional sampling distribution for inferring ancient demography and detecting introgression patterns

Complex demographic histories shape the genealogies of contemporary individuals and thus have a substantial impact on the genetic variation observed today. These genealogies are commonly modeled by the ancestral recombination graph (ARG), and we developed a novel demography-aware conditional sampling distribution (CSD) to approximate these ARGs under general demographic models. We apply this CSD in an expectation-maximization framework for demographic inference. We show that this method can accurately recover biologically relevant demographic parameters like population divergence times, migration rates, or ancestral population sizes from simulated datasets. Furthermore, we apply the CSD to detect tracts of genetic material that introgressed from Neanderthal into modern humans. Our results are in general agreement with previously published results, and we will discuss the similarities and differences, and their biological implications.

Often, the summary statistics of population genetics are framed in the setting of Kingman's coalescent or related models. These statistics can be alternatively thought of as descriptive statistics of the realized population pedigree-with-recombination, in a way that has become much more useful in the era of whole-genome sequencing. For instance, pairwise number of nucleotide differences is proportional to "effective population size", which is sometimes more usefully thought of as an estimate of the average length of the path through the pedigree to the most recent common ancestor at a randomly chosen locus (with an explicit standard error). Another example is the pairwise distribution of long tracts of IBD, which provides an estimate of a functional of the entire distribution of such paths.

Mathematical and visualization tools for working with ancestral recombination graphs

The fields of phylogenetics and population genetics share several important models including gene trees, species trees, ancestral recombination graphs (ARGs), and pedigrees. These models are all closely related and can be viewed as subgraphs of one another. Amongst them, the ARG is particularly central and if inferred efficiently can enable many applications such as inference of selection and demography. Here, I will review various helpful mathematical tools for working with ARGs, including what we call the threading algorithm, the branch graph, and the leaf trace visualization.

## Viral phylodynamics

Phylodynamic methods are widely used to estimate demographic parameters and historical population dynamics from genealogies of individuals sampled from a population. In this phyloseminar, I will describe how we can understand genealogies in terms of basic demographic or ecological processes, and how these concepts can be used to develop statistical models for inference. In particular, I will discuss some similarities and differences between the two main modeling frameworks in phylodynamics: the coalescent and birth-death models. I will also briefly introduce some of the latest statistical methods currently used to fit these models to genealogies. I will end by discussing one of the main challenges facing the field---adequately representing the structure of complex, heterogenous populations in phylodynamic models.

Major recent advances in genome sequencing technology make it feasible that in future epidemics, a sequence will be available for every clinical case that can be identified. In some scenarios, such as agricultural epidemics (where farm-to-farm spread is of more interest than animal-to-animal), diseases such as HIV (where most infected individuals will eventually present themselves to clinicians), and epidemics occurring in well-monitored populations such as hospital inpatients, we will as a consequence be able to acquire a set of sequences representing the pathogens infecting most or all cases in the transmission chain. Genetic data therefore provides an important new tool for the investigation of epidemics, in particular the determination of the epidemic's transmission tree, which describes which case infected which others. As the genetic diversity in a set of sequences taken from the same epidemic will not be enormous even for fast-evolving RNA viruses, the best approach would be to combine both genetic and epidemiological data. I present here a new method for transmission tree reconstruction which is integrated into the Bayesian phylogenetics framework available in BEAST. It is based on the observation that if the phylogeny is know, there is a one-to-one correspondence between possible transmission trees and partitions of the internal nodes of the tree into connected subgraphs. The MCMC procedure in BEAST has been modified to sample from the space of trees with nodes partitioned in this way, simultaneously estimating both phylogenetic tree and transmission tree. Rather than assuming that the entire tree is generated by a single coalescent process, the posterior probability of a phylogeny is now calculated based on an individual-based model of disease transmission, which can take into account epidemiological characteristics of the host cases, such as spatial location. I will outline results using simulated data and sequences from the 2003 Dutch epidemic of H7N7 avian influenza.

The genetic diversity of many pathogens is shaped by epidemiological history. But, the dynamics of infectious disease epidemics differ in important ways from demographic processes that have traditionally been studied by population geneticists. In many epidemics, the population size and birth rate changes rapidly in a nonlinear fashion through time. Mathematical models for describing infectious disease dynamics have a long history that has run parallel to the development of modern population genetics, but until recently, there has been little communication between these fields.Interest has grown in developing a new set of mathematical models for genealogies generated by epidemic processes. These methods reveal how the effective population size of a pathogen depends on transmission rates, the number of infected hosts, and the size of the bottleneck at the time of transmission. These mathematical models have also enabled new applications of pathogen genetic data to public health. Pathogen genetic data can be informative about epidemic processes in ways that standard surveillance data are not, especially regarding the source of infections and risk factors for transmission. I will review several approaches to mathematical modeling of pathogen genealogies and present applications of these methods to HIV-1 and the recent Ebola virus epidemic in Western Africa.

## Phylogenetics of cancer

Tumour heterogeneity, i.e. the genomic diversity of cancer cells within a single tumour, is thought to be the source of chemotherapy resistance. In many cancers, this heterogeneity is not limited to point mutations but includes large scale genomic rearrangements and endoreduplications that lead to aberrant copy number (CN) profiles. Reconstruction of the evolutionary tree of cancer within the patient allows us to quantify and understand the aetiology of tumour heterogeneity. In some cancers, such as high-grade serous ovarian cancer (HGSOC), CN profiles predominate. However tree inference is hindered by unknown phasing of major and minor CNs, horizontal dependencies between adjacent genomic loci and the lack of curated CN profile databases to use as a reference for probabilistic inference.

We recently developed MEDICC (Minimum Event Distance for Intra-tumour Copy number Comparisons), an algorithm for phylogenetic reconstruction based on CN profiles. MEDICC uses finite-state transducers (FSTs) to encode a minimum evolution criterion that determines pairwise evolutionary distances between CN profiles. This minimum-event distance computes the smallest number of amplification and deletions of arbitrary length that are necessary to transform one genomic profile into another. The FST-based approach allows us thereby to model dependencies between sites, similar to the problem of modelling indels on trees in traditional phylogenetics. Using this approach we are able to phase major and minor CN profiles to the parental alleles and infer trees and ancestral genomes, while minimizing the overall tree length. The distance measure is formulated such that the resulting matrix of pairwise distances has a direct mapping to a positive semi-definite kernel matrix. This allows us to perform principal component analysis in evolutionary space and use this embedding to numerically quantify tumour heterogeneity and other quantities of interest, such as the degree of clonal expansion, using spatial statistics.

I will talk about the basics of FST-based phylogenetic inference and explain how they can be used to model genomic rearrangement events with horizontal dependencies. I will explain how this approach implicitly maps genomes into a feature space in which we can quantify heterogeneity. Finally, I will present clinical results that show how this quantification of ITH can predict resistance development in the hospital.

Phylogenetic analysis of metastatic colon cancer in humans

Metastasis is the main cause of cancer morbidity and mortality. Despite its clinical significance, several fundamental questions about the metastatic process in humans remain unsolved. Does metastasis occur early or late in cancer progression? Do metastases emanate directly from the primary tumor or give rise to each other? How does heterogeneity in the primary tumor relate to the genetic composition of secondary lesions? Addressing these questions – ideally by examining the genetic makeup of tumor cells in distinct anatomic locations and reconstructing their evolutionary relationships – is crucial to improving our understanding of metastasis. I will give an overview of a simple PCR-based assay that enables the tracing of tumor lineage in patient tissue specimens. The methodology relies on somatic variation in highly mutable polyguanine (poly-G) repeats located in non-coding genomic regions. Poly-G mutations are present in a variety of human cancers. In colon carcinoma, an association exists between patient age at diagnosis and tumor mutational burden, suggesting that poly-G variants accumulate during normal division in colonic stem cells. Poorly differentiated colon carcinomas (which have a worse prognosis) have fewer mutations than well-differentiated tumors, possibly indicating a shorter mitotic history of the founder cell in these cancers. By presenting several patient case studies, I will describe how poly-G fingerprints can be used to construct phylogenetic trees that reflect the evolution of metastatic colon cancer, with an emphasis on how biological considerations inform analysis strategies.

## Mini-course on genome-scale phylogeny

Genome rearrangements were discovered and used to build molecular phylogenies in the 1930s. They are implied in many cancers and their evolutionary role might be of primary importance. But the mathematical and computational tools to model rearrangements are still not as efficient as the ones developed later for local mutations as nucleotide or amino-acid substitutions. In this seminar I will report the attempts to integrate genome organisations in the usual models of genome evolution. I will explain how this can improve the inference of phylogenies, as well as ancestral genomes.

In this second talk of our series on genome-scale phylogeny, I build upon Gergely's introduction and present the modelling assumptions and algorithmic details behind some of the methods we and others have developed. There will be two parts to this talk. I start with the model of gene duplications and losses implemented in PHYLDOG. I present the assumptions we make and the shortcuts we take to improve the program's efficiency, and show some results on real and simulated sequence data. I notably show problems that arise when the program is confronted with data generated with a model of incomplete lineage sorting (Rasmussen and Kellis, 2012), and present avenues of research to find solutions to these problems. In the second part, I present our current efforts to use our model of gene duplication, loss, and transfer (Szöllosi et al, 2013) to infer a species tree in which speciation nodes are ordered in time. I briefly remind the forgetful viewer of what this model does and how it works, and I then explain how we devise a new MCMC algorithm to use it on data sets containing dozens of species and thousands of gene families. I finish with some perspectives of our plans uniting gene tree-species tree models and databases of gene families and phylogenetic trees.

Molecular phylogeny has focused mainly on improving models for the reconstruction of gene trees based on sequence alignments. Yet, most phylogeneticists seek to reveal the history of species. Although the histories of genes and species are tightly linked, they are seldom identical, because genes duplicate, are lost or horizontally transferred. Building models describing the relationship between gene and species trees can thus improve the reconstruction of gene trees when a species tree is known, and vice-versa. Several approaches have been proposed to solve the problem in one direction or the other, but in general neither gene trees nor species trees are known. Only a few studies have attempted to jointly infer gene trees and species trees.

I introduce models that describe the relationship between gene trees and species trees. I begin with models that account for gene duplication and loss, and subsequently introduce models that account for the horizontal transfer of genes. I review results from simulations as well as empirical studies on genomic data that show that combining gene tree-species tree models with models of sequence evolution improves gene tree reconstruction. In turn, these better gene trees provide a better basis for studying genome evolution or reconstructing ancestral chromosomes and ancestral gene sequences. I also discuss the possibility of extracting information on the timing of speciation events from ancient horizontal transfer events.

## Open Tree of Life

The emergence of graph databases has presented a potential alternative for ways of storing and querying phylogenetic trees. The Open Tree of Life has been exploring these options and ways that trees from multiple datasets or within a single dataset can be placed in a graph database. I will go over some of the ways that we do this and how we can query and synthesize trees as an alternative to supertrees and consensus trees. While still a work in progress, these methods show great promise for further development.

Technical and social challenges of synthesizing phylogenetic data across the tree of life

Open Tree of Life aims to synthesize published phylogenetic data into a comprehensive tree of life. The challenges associated with the collection, curation and synthesis of both phylogenetic and taxonomic input data are both technical and social. We present the first draft of the Open Tree of Life, as well as the workflow and software tools for curating, annotating and viewing phylogenetic data. In a subsequent Phyloseminar, Stephen Smith will present details of the phylogenetic synthesis methods.

## Integrating fossils into phylogenies

Phylogenetic Paleobiology: What do we stand to gain from integrating fossils and phylogenies in macroevolutionary analyses?

The aim of macroevolutionary science is to understand the patterns and processes responsible for generating organismal diversity in space and time. Although macroevolutionary change typically occurs over geologic timescales and has traditionally been studied by paleobiologists, comparative biologists have become increasingly interested in macroevolutionary questions, utilizing time-calibrated molecular phylogenies of extant taxa as a framework for testing hypotheses about rates of evolution. In this seminar, I’ll examine how integrating fossils and phylogenies can increase our power to test and answer fundamental questions about tempo and mode in phenotypic evolution. Integrating fossil taxa into phylogenies of extant taxa is worth the effort: on a per taxon basis, fossils contribute more information about macroevolutionary pattern and process and increase our ability to distinguish processes that leave similar signals in extant species datasets. I’ll discuss some recent work, and highlight how fossil information can be used to inform macroevolutionary inference when a combined phylogeny is lacking. One theme emerges from all of this work; we stand to gain a better understanding of macroevolution not when we approach it as biologists or paleontologists but, as G.G. Simpson recommended 60 years ago, as practitioners of both.

The fossil record offers a rich source of macroevolutionary data. Fossils can reveal transitional forms that could not be predicted from extant taxa alone, reveal unexpected biogeographic patterns, and provide temporal information crucial for inferring rates of evolution and correlations between evolution and abiotic events. At the same time, including fossil taxa in phylogenetic analyses presents many challenges. Currently, there are a wide variety of methods for including fossil data in phylogenetic analyses ranging from indirect use of fossil ages to inform divergence dates to simultaneous analyses of fossil and extant taxa under various optimality criteria and with varying levels of constraints. One important consideration remains that fossils typically provide only morphological data, which can lead to problems related to missing data and potential violation of common assumptions for model-based phylogeny inference methods designed primarily for molecular sequence data. Morphological character data are typically harvested from from fossils taxa not at random, but with an intentional bias towards parsimony-informative characters (with apomorphies omitted from matrices). Combined with issues related to sparse codings in large combined matrices, care must be taken to avoid spurious inferences.

The Fossilized Birth-Death Process: A Coherent Model of Fossil Calibration for Divergence Time Estimation

Accurate estimates of absolute node ages are critical for addressing a wide range of questions in evolutionary biology. Because molecular sequence data are not informative on absolute time, external data–most commonly fossil age estimates–are required to calibrate estimates of species divergence times. For Bayesian divergence-time methods, the common practice for calibration using fossil information involves placing arbitrarily-chosen and parameterized parametric distributions on internal nodes, often disregarding most of the information in the fossil record. The `fossilized birth-death' (FBD) process is a model for calibrating divergence-time estimates in a Bayesian framework, explicitly acknowledging that extant species and fossils are observations from the same macroevolutionary process. Under this model, absolute node age estimates are calibrated by a single diversification model and arbitrary calibration densities are not necessary. Moreover, the FBD model allows for inclusion of all available fossils. We performed analyses of simulated data and show that node-age estimation under the FBD model results in accurate estimates of species divergence times with realistic measures of statistical uncertainty, overcoming major limitations of standard divergence time estimation methods.

## In honor of Carl Woese

Carl Woese's grand view of life that just keeps getting grander

Most microorganisms cannot be grown in pure culture (or at least not easily). This has been apparent for decades by comparing the number of cells seen under a microscope to the fraction of those cells that will grow into colony forming units (typically <1%). The objective classification of cellular life by comparative rRNA analysis pioneered by Carl Woese provided the first grand view of the tree of life and also provided the reference framework upon which his friend and colleague, Norman Pace, developed ways to directly survey microbial communities via their rRNA sequences without the need to grow them. This put our degree of ignorance of the microbial world into perspective: dozens of major microbial lineages have emerged over the last 20 years that lack even a single cultured representative. New approaches, such as deep metagenomics and single cell genomics, are now transforming the rRNA-based phylogenetic outlines of the tree of life into a fully-fledged genome-based view of the tree. I will present a recent snapshot overview of the genome tree of the bacterial and archaeal domains and examples of functional insights in the context of a more complete view of microbial evolution.

How Carl Woese transformed the field of microbial ecology

The challenges of dissecting naturally occurring microbial assemblages, with respect to their community composition, interspecies interactions, functional attributes, and activities, are numerous and daunting. For many years, these challenges impeded our understanding of the properties and dynamics of microbial communities, and thus hindered development of the field of microbial ecology. Enter Carl Woese: the theory and application of molecular phylogenetics and genomics in studies of microbial evolution and ecology can be traced directly to Woese and one of his primary collaborators, Norman Pace. This lecture will trace the logic and roots of the application of molecular phylogenetics and genomics to the study of microbial ecology, through a historical review and examination of its past and current applications.

Following Carl Woese into the Natural Microbial World – The Beginnings of Metagenomics

Carl Woese, one of the great scientists of all time, died in December, 2012. Among other important contributions, he used primitive sequencing technology to compare small subunit (16S) ribosomal RNA sequences from different organisms and thereby establish the outlines of a universal tree of life. His results also put in place a sequence-based reference framework within which to understand and articulate biological diversity. Since this perspective is based on molecular sequences and not properties of organisms, it opened the door to begin to understand the kinds of organisms that make up the natural microbial world. Prior to Woese’s sequence-based reference framework, microbial ecologists had to culture organisms to study them, but not many environmental organisms, <<1%, are cultured using standard methods. Sequence surveys of environmental microbial genes and genomes – “metagenomics” - have now revolutionized understanding of microbial ecology, including its influence on human health. The seminar will discuss how metagenomics developed and the impact it has had on our understanding of environmental microbial diversity and the structure of the molecular tree of life.

## Phylogenetics and language

Bobbins, Borrowing, and Bayesian Inference: Horizontal Transfer and the application of Phylogenetic Methods in Cultural Evolution studies

Researchers have applied quantitative phylogenetic methods to study human cultural and linguistic evolution. However, a common critique of this approach is that cultural evolution and biological evolution differ in important ways that make phylogenetic analyses unsuitable for cultural data. Principally, horizontal transmission (or borrowing) of cultural and linguistic traits is argued to be so pervasive as to invalidate the approach. In this talk I will address this issue by asking how much does horizontal transfer occur?, and does it matter if it does? Contra the skeptics, I will discuss studies that demonstrate that 1) many biological systems also show non-tree-like patterns of evolution, 2) cultural systems vary in the degree to which horizontal transfer occurs, and 3) borrowing does not necessarily cause big problems. Rather than being a reason to give up on the whole project, borrowing can be productively investigated using phylogenetic techniques to yield deeper insights cultural and linguistic evolution.

Anthropologists had a name for the non-independence-of-species-problem way back in the 1880s. Solving "Galton's Problem", and the promise of comparative methods for testing hypotheses about cultural adaptation and correlated evolution was a major catalyst for the field of cultural phylogenetics. In this talk I will show how linguistic, cultural, and archaeological data is used in comparative phylogenetic analyses. The "treasure trove of anthropology" - our vast ethnographic record of cultures - is now being put to good use answering questions about cross-cultural similarities and differences in human social and cultural norms in a rigorous evolutionary framework.

Charles Darwin famously noted that there were many curious parallels between the evolution of species and languages. Since then evolutionary biology and historical linguistics have used trees to conceptualise evolution. However, whilst evolutionary biology developed the vast discipline of phylogenetic methods, linguistics dabbled with computational methods before rejecting them. The last decade or so has seen the introduction of phylogenetic methods into linguistics, often with some startling results. In this talk I will present some of these studies, and discuss how phylogenetics can help us grapple with the problems of linguistic and cultural evolution. These problems range from testing population dispersal hypotheses, to investigating the shape of cultural evolution, to inferring the rates at which languages change.

## Rates and Dates

Species richness results from past and current speciation, extinction and dispersal events, themselves influenced by various ecological and evolutionary processes. Estimating rates of diversification, and understanding how and why they vary over evolutionary time, geographical space, and species groups, is thus key to understanding how ecological and evolutionary processes generate biological diversity. Phylogenetic approaches are critical for making such inferences, especially in groups or regions lacking fossil data. I will illustrate how phylogenies, coupled with models of cladogenesis, can be used to test the role of ecological limits, boom-then-bust diversity dynamics, the paleoenvironment, and population dynamics on the biodiversity patterns that we observe today.

Phylogenetic trees of present-day species allow inference of the rate of speciation and extinction which led to the present-day diversity. Classically, inference methods assume a constant rate of diversification, or neglect extinction. I will discuss major limitations of this null model and will present a new framework which allows speciation and extinction rates to change through time (environmental-dependent diversification), with the number of species (density-dependent diversification), and with a trait of a species (trait-dependent diversification). For the latter model, particular focus is given to the trait being the age of a species. Issues arising in empirical data analysis, such as incomplete taxon sampling, model selection, and confidence interval estimation, will be discussed. The methods reveal interesting macroevolutionary dynamics for mammals, birds and ants, and can easily be applied to other datasets using the R packages TreePar and TreeSim available on CRAN.

## Structure and molecular evolution

Adaptation, coevolution, and convergence in the context of protein thermodynamics

Interactions within and between proteins are a fundamentally important part of how they evolve and adapt. We have been considering how and why proteins adapt, coevolve, and converge, and working to understand these concepts in the context of protein thermostability and function. We will expand from the previous talk of our collaborator, Dr. Goldstein, and discuss how and why coevolution is and should be detected, and how thermostability affects reconstruction of ancestral functions. Further, we will discuss our work on adaptive redesign in mitochondrial proteins, perhaps the largest known case of an adaptive burst in multiple metabolic proteins. The convergence between ancestral snakes and ancestral acrodont lizards is also perhaps the largest known case of adaptive convergence. We will consider what these examples tell us about the theory of how proteins appear to evolve in the context of nearly neutral versus cases of adaptive change. Further, we will discuss the impact on understanding phylogenetic relationships, and we will also discuss a unified theory of nearly neutral and adaptive evolution in the context of structure and function.

Simulating evolution with in silico models of protein thermodynamics

Many of the most basic issues of protein evolution are difficult to determine from the relationship between existent protein sequences. We would ideally like to analyse the complete evolutionary record: what mutations were attempted when in what lineage, which ones were deleterious or advantageous and by how much, which ones were accepted, and how these substitutions affected further mutations and the overall evolution of protein properties. In the absence of available biological data, we can create our own - simulate protein evolution in silico, such as in our work modelling how proteins would evolve given their need to be thermodynamically stable. These simulations allow us to explore a range of phenomena and develop a conceptual framework that tells us which questions may be interesting and important to consider in real proteins. Such simulations can also illuminate which conditions are necessary and/or sufficient to explain observed protein characteristics. We consider how evolution of protein thermostability explains why proteins are generally marginally stable, why eukaryotes may have more disordered proteins than prokaryotes, and what the consequences of this are for biochemical networks. We also consider how various locations in a protein can co-evolve, and how this can inform the next generation of substitution models.

Protein Structural, Biophysical, and Genomic Underpinnings of Protein Sequence Evolution

Common models for amino acid substitution assume that each site evolves independently according to average properties in the absence of a genomic, protein structural or functional context. Two characterizations of amino acid substitution will be presented. One approach extends a population genetic model to inter-specific genomic data and a second approach evaluates the effects of selection for protein folding and protein-protein interaction on sequence evolution. Several take home lessons include the importance of considering linkage independent of protein structure, the importance of negative pleiotropy (or not statements in folding and binding), and the nature of the co-evolution of sites and how it links standard substitution models with covarion models when binding function is conserved and when it changes.

## Software

RevBayes: An R like Environment for Bayesian phylogenetic inference

RevBayes is a computer program that uses directed acyclic graphs (DAG's) to specify any type of model, to hold the model and data in memory, and to compute the likelihood of the parameters of the model. DAG's provide a framework for the construction of modular models. Models can easily be extended and/or parts of the model exchanged (e.g., the substitution process and clock model) and several models can be combined. The design of RevBayes should allow the implementation of any extension to existing models. RevBayes is mainly developed for Bayesian phylogenetic analyses, but it can be extended to any inference on probabilistic models.

In this talk, I will give a brief introduction to the concept of DAG's and how they are used to construct a model. Once the model is specified, I will show how to simulate new observations under the model and how to estimate its parameters. I will demonstrate this in the RevLanguage, which is an R-like language for building DAG's for phylogenetic problems. The RevLanguage is used interactively to specify the model, as done with R. I will show how a full phylogenetic model is specified, step-by-step. I will mainly focus on various standard substitution models, relaxed clock models, and divergence times priors. Specifically, I will show a new birth-death model with speciation and extinction rates varying over time and use this in a integrative analysis. In the integrative analysis I condition only on the alignment (only the alignment is considered to be known) and estimate the tree and divergence times simultaneously as well as the speciation and extinction rates.

Example files for the demonstration are available here.

HyPhy is an open-source software package for the analysis of genetic sequences using techniques in phylogenetics, molecular evolution, and machine learning. It features a complete graphical user interface (GUI) and a rich scripting language for limitless customization of analyses. Additionally, HyPhy features support for parallel computing environments (via message passing interface) and it can be compiled as a shared library and called from other programming environments such as Python or R.

Introduction to phytools and phangorn: phylogenetics tools for R

phytools is a new multifunctional phylogenetics package for the R statistical computing environment. The focus of the package is on methods for phylogenetic comparative biology; however it also includes tools for simulation, phylogeny input/output, manipulation, and even inference. The phytools library is designed for maximum interoperability with other important R phylogenetics packages such as ape, geiger, and phangorn.

phangorn is a package for phylogenetic reconstruction and analysis in the R language. Previously it was only possible to estimate phylogenetic trees with distance methods in R. phangorn, now offers the possibility of reconstructing phylogenies with distance based methods, maximum parsimony or maximum likelihood (ML) and performing Hadamard conjugation. Extending the general ML framework, this package provides the possibility of estimating mixture and partition models. Furthermore, phangorn offers several functions for comparing trees, phylogenetic models or splits, simulating character data and performing congruence analyses.

## Beyond IID

The key component of a probabilistic joint approach to tree and alignment inference is a Continuous Time Markov Chain (CTMC) over strings. Ideally, this CTMC should support tractable inference algorithms and should be easily extensible to support a wide range of evolutionary models. The classical string-valued CTMC, the TKF91 model (Thorne et al., 1991), is limited in both of these axes. Previous work has focussed on increasing the complexity of the TKF91 model, making the inference problem computationally more difficult (Miklos et al., 2004).

In this work, we present a new stochastic process, the Poisson Indel Process (PIP), which allows simple and practical inference algorithms. Efficient computations are based on an exchangeable representation and on Poisson processes. This representation gives a natural way of extending the capacity of the model while keeping inference computationally practical.

We used this process to design a joint Bayesian estimator over alignments and trees. We evaluated both consensus trees and alignments against standard baselines on synthetic and real data. These experiments demonstrate that competitive trees and alignments can be inferred using a Bayesian model equipped with a PIP prior.

Accurate reconstruction of insertion-deletion histories by statistical phylogenetics

The "multiple sequence alignment" is a computational artifact. In nature there is no such thing; rather, an alignment represents a partial summary either of indel history, or of structural similarity. Here we show, via evolutionary simulation tests, that all currently-available multiple alignment tools introduce systematic biases into downstream evolutionary analysis - particularly when used to reconstruct histories of insertions and deletions.

I will present our unification of Felsenstein's "pruning" algorithm and "progressive alignment" to build a fast, linearly-scaling approximate-maximum-likelihood phylogenetic alignment/reconstruction algorithm. Inference of evolutionary history in this framework displays a clear improvement in accuracy over non-statistical phylogenetic reconstructions and a massive improvement in performance over slow-running MCMC statistical reconstructions.

## Evolutionary genomics

Fungi occupy diverse ecological niches in roles from nutrient cycling in rainforest floors to aggressive plant and animal pathogens. Molecular phylogenetics has helped resolve many of branches on the Fungal tree of life and enabling studies of evolution across this diverse kingdom. The genome sequences from hundreds of fungi now permit the study of change in genes and gene content in this phylogenetic context and to connect molecular evolution with adaptation to ecological niches or changes in lifestyles. I will describe our work in studies contrasting pathogenic and non-pathogenic fungi and efforts to unravel the evolution of multicellularity in fungi comparing unicellular basal fungi with multicellular mushrooms and molds.

The development of tools for data mining and use of fungal genomics is also driving the pace of molecular biology and genetics of fungi. I will highlight new approaches to make this easier and the ways data integration can inform and transform studies of functional biology of fungi.

Besides their value for biomedicine, individual genome sequences represent a rich source of information about human evolution. I will describe an effort to estimate key evolutionary parameters from the genome sequences of six individuals from diverse human populations. We have used a Bayesian approach based on coalescent theory to extract information about ancestral population sizes, divergence times, and migration rates from inferred genealogies at many neutrally evolving loci from across the genome. We introduce new methods for accounting for gene flow between populations and integrating over possible phasings of diploid genotypes. I will also describe a custom pipeline for genotype inference to mitigate possible biases from heterogeneous sequencing technologies, coverage levels, and read lengths. Our analysis indicates that the San of Southern Africa diverged from other human populations 108--157 thousand years ago (kya), that Eurasian populations diverged 38--64 kya, and that the effective population size of the ancestors of all modern humans was ~9,000.

Locating protein-coding sequences under selection for additional, overlapping functions in 29 mammalian genomes

The degeneracy of the genetic code allows protein-coding DNA and RNA sequences to simultaneously encode additional, overlapping functional elements. A sequence in which both protein-coding and additional overlapping functions have evolved under purifying selection should show increased evolutionary conservation compared to typical protein-coding genes -- especially at synonymous sites. We developed a method to systematically locate short regions within known ORFs that show conspicuously low estimated rates of synonymous substitution, based on phylogenetic codon rate models and likelihood ratio tests.

We applied this method to genome alignments of 29 placental mammals, resulting in more than 10,000 “synonymous constraint elements” (SCEs) with resolution down to nine-codon windows. These are found within more than a quarter of all human protein-coding genes and contain ~2% of their synonymous sites. We collected numerous lines of evidence that the observed synonymous constraint in these regions reflects selection on overlapping functional elements including splicing regulatory elements, dual-coding genes, RNA secondary structures, microRNA target sites, and developmental enhancers. We also ruled out certain alternative explanations such as codon usage bias and neutral rate variation.

Our initial results show that overlapping functional elements are common in mammalian genes, despite the vast genomic landscape. Furthermore, anticipating the future availability of additional mammalian and vertebrate genomes, we are currently developing Bayesian codon modeling methods to measure synonymous rates at even higher resolutions, perhaps eventually allowing the detection of individual regulator binding sites embedded in protein-coding ORFs.

## Macroevolution

For decades, biologists have addressed evolutionary and ecological questions using measurements of species traits, phylogenies, and an assortment of comparative methods. Unfortunately, while there is a large assortment of these methods, they are still fairly limited and development of new methods is slow. It took seven years between the introduction of using a simple Brownian motion model for looking at trait evolution (Felsenstein, 1985) and the use of this same model for looking at rates of trait evolution (Garland, 1992), and an additional 14 years to more powerful tests using a small modification of the basic model (O'Meara et al., 2006). Still other promising methods are described and even tested but remain unavailable to empiricists because they are not put into software. As a result, the questions empiricists can ask about the world are limited by the research productivity of the few dozen scientists who develop and implement new methods in phylogenetics. We describe a new approach based on Approximate Bayesian Computation and implemented in R that will allow researchers to easily develop their own models for trait evolution without requiring them to have specialized mathematical or computational knowledge.

We're building the tree of life, but what can we do with it? It seems clear that there is a wealth of information about evolution in the structure of this tree. There are some methods that can use phylogenetic trees to test macroevolutionary models, but the range of models that we can test is still severely limited. In some cases, such as the estimation of extinction rates from phylogenetic trees, current methods have proven controversial. We are now beginning to develop and implement methods that use tree-of-life scale data to answer key questions in evolution. I will review three new approaches developed in my lab for analyzing comparative datasets: MECCA, fossil-Medusa, and reversible-jump MCMC. I argue that these methods represent the next generation of comparative methods that will open the door to analyzing a much broader range of models with large datasets.

What poultry breeders and guinea pigs have to tell us about statistical nonmolecular phylogenetics

We are far from having an understanding of the determination of morphological characters at the genome level, so most evolutionary biologists working on them still need to use phenotypic approaches. I will discuss the prospects for using the tools of quantitative genetics, which has faced the same dilemma for the past century. I will use as examples three projects of my own. One, which is joint work with Fred Bookstein, adapts the tools of morphometrics, of which he is a chief developer, to modeling change of morphological forms on phylogenies. The second is a similar project that asks how to best place fossil forms into a phylogeny of present-day species when there is molecular data enabling us to get a good estimate of the phylogeny for those species. The third models discrete 0/1 characters using the Threshold Model developed by Sewall Wright for his work on guinea pigs. All of these lead to asking whether we can connect Brownian Motion models with quantitative genetics models. In all such cases we will have limits on what we can infer, and need to be aware of the need to carry that uncertainty through any subsequent inference using these results.

## Infectious disease

Accurate estimation of evolutionary attributes of coding sequences and evolutionary fingerprinting

Codon substitution models have facilitated the interpretation of evolutionary forces operating on genomes. Most of these models, however, assume a single rate of non-synonymous substitution irrespective of the nature of amino acids being exchanged. Recent developments have shown that models which allow for amino acid pairs to have different rates of substitution offer improved fit over single rate models. However, these approaches have been limited by the necessity for large alignments in their estimation or the adoption of a particular residue exchangeability scale. We present an alternative procedure which assigns substitution rates between amino acid pairs can be subdivided into a few rate classes, dependent on the information content of the alignment. This procedure permits us to infer generalizable models for specific genes, organisms and taxonomic clades.

The representation of all virus families within a single phylogenetic tree may be a misleading description of their evolutionary history. First, it is unlikely that all viruses originated from a unique common ancestor. Second, viruses (retroviruses in particular) can integrate into the host genome and be transmitted vertically as well horizontally. Third, different viral genera can evolve according to dramatically different molecular clocks. Three paradigmatic examples from the retroviridae family will be considered here: the simian foamy viruses (SFVs); the primate T-lymphotropic viruses (PTLVs), which include HTLV and STLV, and the primate lentiviruses (PLVs), which include SIV, HIV-1 and HIV-2. SFV is an example of an ancient virus that has been co-evolving with its primate hosts over the last 30 million years. PTLVs emerged around 300 thousand years ago and are characterized by frequent interspecies transmissions and multiple introductions into human populations since prehistoric times. PLVs have a much more recent origin and only within the last 200 years have been able to spread successfully within the human population. The complex relationship between population dynamics and evolutionary time-scale of these retroviruses, as well as the challenge of their integration within the tree of life will be discussed.

Emerging infectious diseases continue to appear all over the world, and importantly, they have also risen significantly over time after. Having the potential to quickly adapt to new hosts and environments, RNA viruses are prime candidates to emerge as global threats to human health. Their rapid rate of evolution, however, also turns viral genomes into valuable resources to reconstruct the spatial and temporal processes that are shaping epidemic or endemic dynamics.

In this seminar, I will highlight recent developments in phylogenetic diffusion models that tie together sequence evolution and geographic history in a coherent statistical framework. Both discrete and continuous phylogeographic models have recently been implemented in a Bayesian statistical approach. I will position this approach among other popular phylogeographic methods, and then focus on applications in viral molecular epidemiology to demonstrate their use. Finally, I will hint at future extensions that may provide entirely new opportunities for phylogeographic hypothesis testing.

The influenza A virus infects approximately 500 million individuals each year. Owing to its RNA makeup, influenza mutates extremely rapidly allowing the virus population to escape the pull of the human immune system. A single individual may be infected year after year by antigenically novel strains. As result of this rate of mutation, the timescale of influenza evolution is a human timescale. We get the chance to observe the process of evolution in action. However, the rapid pace of evolution also causes an intrinsic link between evolutionary and ecological dynamics in the virus population. The availability of temporally spaced sequence data allows estimation of details of these dynamics unavailable in other systems. Through analysis of this data, I address open questions regarding patterns of adaptation and the effects of seasonality in the human influenza virus.

## Gene-tree species-tree

Probabilistic Analysis of gene families with respect to gene duplication, loss, and transfer

Consistency properties of species tree inference algorithms under the multispecies coalescent