previously recorded seminars


University of Tasmania

Developing a statistically powerful measure for quartet tree inference using phylogenetic and Markov invariants

Recently there has been renewed interest in phylogenetic inference methods based on phylogenetic invariants, alongside the related Markov invariants. Broadly speaking, both these approaches give rise to polynomial functions of sequence site patterns that, in expectation value, either vanish for particular evolutionary trees (in the case of phylogenetic invariants) or have well understood transformation properties (in the case of Markov invariants).

While both approaches have been valued for their intrinsic mathematical interest, it is not clear how they relate to each other, and to what extent they can be used as practical tools for inference of phylogenetic trees. By focusing on the special case of binary sequence data and quartets of taxa, we are able to view these two different polynomial-based approaches within a common framework.

We present three desirable statistical properties that we argue any invariant-based phylogenetic method should satisfy: (1) sensible behaviour under reordering of input sequences; (2) stability as the taxa evolve independently according to a Markov process; and (3) explicit dependence on the assumption of a continuous-time process. Motivated by these statistical properties, we develop and explore several new phylogenetic inference methods. In particular, we develop a statistically bias-corrected version of the Markov invariants approach which satisfies all three properties. We also extend previous work by showing that the phylogenetic in- variants can be implemented in such a way as to satisfy property (3). A simulation study shows that, in comparison to other methods, our new proposed approach based on bias-corrected Markov invariants is extremely powerful for phylogenetic inference.

Ohio State University

Using invariants for coalescent-based phylogenetic inference

The advent of rapid and inexpensive sequencing technologies has necessitated the development of computationally efficient methods for analyzing sequence data for many genes simultaneously in a phylogenetic framework. The coalescent process is the most commonly used model for linking the underlying genealogies of individual genes with the global species-level phylogeny, but inference under the coalescent model is computationally daunting in the typical inference frameworks (e.g., the likelihood and Bayesian frameworks) due to the dimensionality of the space of both gene trees and species trees. By viewing the data arising under the phylogenetic coalescent model as a collection of site patterns, the algebraic structure associated with the probability distribution on the site patterns can be used to develop computationally efficient methods for inference via phylogenetic invariants.

In this talk, I will discuss three problems that can be addressed using invariants. First, I will describe how identifiability results for four-taxon species trees based on site pattern probabilities can be used to build a quartet-based inference algorithm for trees of arbitrary size. Second, methods for rooting phylogenetic species trees inferred under the coalescent model will be discussed. Finally, the use of invariants to detect species that arose via hybridization will be described. The methods presented will be demonstrated on several phylogenomic-scale datasets. Because the methods are derived in a fully model-based framework (i.e., the coalescent process is used to model the relationship between gene trees and the species tree, and standard nucleotide substitution models (GTR+I+G and all submodels) are used for sequence-level evolution), these methods are promising approaches for computationally efficient, model-based inference for the large-scale sequence data available today.

Universitat Politècnica de Catalunya

Phylogenetic invariants: what are they and why should we care

It has been now thirty years since the introduction of phylogenetic invariants by Lake, Cavender, and Felsenstein. However, the use of phylogenetic invariants as a method of phylogenetic reconstruction has been in a dormant state for about 20 years; quoting J. Felsenstein in his 2004 book "invariants are worth attention, not for what they do for us now, but what they might lead to in the future".

During the last decade many efforts have been made by mathematicians to completely understand the structure and use of phylogenetic invariants. This has led to the characterization of different types of invariants for many different models: from the most simple Jukes-Cantor model to the general Markov model, and even mixtures of them and the coalescent. Most importantly, this has produced new and efficient methods of phylogenetic reconstruction for complex models. The use of invariants has also been used in model selection and has been crucial in proving the identifiability of parameters for certain models.

In this talk we shall introduce phylogenetic invariants, explain the main ideas that underlie the methods of phylogenetic reconstruction based on invariants and discuss the advantages and drawbacks of them.


Dalhousie University

Darwinizing Gaia

Talks in this series have largely focused on population genetic and phylogenetic methods for reconstructing micro- and macroevolutionary patterns consequent from microevolutionary processes. When natural selection is invoked, it is generally assumed to operate through the differential reproduction of favored variants among populations of physical entities, be they genes, cells, organisms or (rarely) species. The Gaia hypothesis of James Lovelock, co-developed and vigorously promoted by Lynn Margulis in the 1970s, has been very popular with the lay public. But most mainstream Darwinists scorned and still do not accept the notion. They cannot imagine global biospheric stability being selected for at any of the above levels, and do not see the Earth's biosphere as part of a population of comparable global entities engaged in reproductive competition. Most philosophers of biology would similarly argue that any global homeostatic systems (if they exist) can be only "fortuitous byproducts" of lower-level selection. I will suggest that we look at the biogeochemical cycles and other homeostatic processes that might confer stability-- rather than the individual organisms or "species" (mostly microbial) that implement them-- as the relevant units of selection. By thus focusing our attentions on the "song", not the "singers," a Darwinized Gaia might be developed. Our understanding of evolution by natural selection would however need to be stretched to accommodate differential persistence, and our definition of reproduction would need to be reworked.

Archaea and Bacteria

Auckland University

Joint Bayesian inference of bacterial ancestral recombination graphs

Homologous recombination is a central feature of bacterial evolution, yet confounds traditional phylogenetic methods. In this seminar I will present a novel approach to inferring bacterial evolution based on the ClonalOrigin model (Didelot et al., Genetics, 2010). This method permits joint Bayesian inference of the entire bacterial recombination graph and associated model parameters. The method is implemented in the BEAST 2 phylogenetic inference package. It can be easily combined with a variety of substitution models accounting for site-to-site clock rate heterogeneity as well as parametric and non-parametric models of effective population size dynamics. I will also present work on summarizing posterior distributions over the space of tree-based recombination graphs which, together with the joint inference method, aims to bridge the technological gap between recombination-aware phylogenetic inference and traditional methods.

Imperial College London

Modelling recombination in prokaryote phylogenomics

Recombination happens frequently in most bacterial and archaeal species. Traditional phylogenetic techniques do not account for this, which can greatly limit their usefulness for the analysis of genomic data. The coalescent with gene conversion accurately models the ancestry process of prokaryotes, and this can be used to simulate realistic data, but it is too complex to use in an inferential setting. Approximations have therefore been introduced, which are centred around the concept of the clonal genealogy, that is the phylogeny obtained by following the line of ancestry of the recipient of each recombination event. I will review these mathematical models and ongoing efforts to develop statistical software to perform phylogenomic analysis in recombining prokaryotes.

University of Waterloo

Microbial diversity through a total community sequencing lens

Total community approaches (omics) provide a blueprint of the microbial functions and community diversity within an environment. With genome-resolved metagenomics, this view can be refined, identifying an organism's specific contributions to pathways and processes as well as their interactions with other community members. This approach has led to a recent explosion of genome sequences for uncultured and uncharacterized microbial lineages, many with previously-unknown roles in biogeochemical cycles. My work explores the environmental importance of these novel organisms and the emerging view of the Tree of Life that stems from our new understanding of microbial diversity.

Heterogeneous substitition

Université de Lyon

Systematic errors in phylogenomic studies: on the importance of modeling pattern-heterogeneity across sites.

While all models now used in phylogenetic analyses account for rate-heterogeneity across sites, the case of pattern-heterogeneity (i.e. qualitative variation in substitution processes across nucleotide or amino-acid positions) is much less clear and has recently been the subject of some controversy. One main question is whether pattern-heterogeneity should be modelled at the level of genes (or groups of genes), or at the level of sites. Both approaches have been used in recent phylogenomic analyses of metazoans---sometimes leading to radically different conclusions---in particular concerning the early patterns of diversification within this group.

In this talk, I will first explore the empirical evidence concerning the presence, and the relative importance, of either type of heterogeneity in empirical sequence alignments. Then, I will introduce Dirichlet process mixture models accounting for site-specific amino-acid preferences. The statistical meaning of Dirichlet processes, as a non-parametric method for estimating arbitrary distributions of site-specific effects, will be explained and illustrated through simulation experiments. Finally, based on simulations implementing pattern heterogeneity simultaneously at both the gene and the site levels, I will show the importance of using models explicitly accounting for pattern-heterogeneity across sites for reconstructing accurate phylogenies.

US Fish and Wildlife Service

Modeling substitutional heterogeneity and its impact on inferring relationships

Heterogeneity in amino acid substitution is an inherent feature of most phylogenomic-scale datasets, and modeling such heterogeneity is now widely seen as important for phylogenomic inference. Site-heterogeneous substitution models such as CAT-F81 and CAT-GTR, as implemented in PhyloBayes, have been forcefully advocated for use on large datasets because they may reduce long-branch attraction artifacts that could result from not adequately modeling amino acid substitutional heterogeneity. However, site-heterogeneous models arguably became popular not because of a deep appreciation for how well they modeled substitutional heterogeneity, but rather because analyses with CAT models often resulted in trees that matched preconceived notions of animal phylogeny (e.g., sponges as the sister lineage to all other extant animals). Importantly, site-heterogeneous models have not been thoroughly compared to other methods for modeling substitutional heterogeneity such as coarse modeling of heterogeneity with data partitioning coupled with site-homogeneous models such as WAG or LG. Here, I show through analyses of simulated and empirical data that data partitioning often performs as well as, or better than, site-heterogeneous CAT models. In contrast to past claims, I demonstrate that partitioning with site-homogeneous models suppresses long-branch attraction artifacts as well as CAT-GTR and much better than CAT-F81. Analyses with data partitioning and site-homogeneous models can require orders of magnitude less computational time than popular site-heterogeneous models, while still resulting in reasonably accurate trees. Although site-heterogeneous models may describe the amino acid substitutional process much better than data partitioning with site-homogeneous models, current implementations of the most popular site-heterogeneous models do not appear to result in more accurate phylogenetic hypotheses than those inferred with partitioning. Thus, the need to model fine-scale site-heterogeneity in phylogenetic inference is called into question.

Dalhousie University

Combating phylogenetic artefacts by modeling site-specific substitution processes with mixture models and approximations

The most widely used phylogenetic models of amino acid substitution involve a single reversible empirical substitution matrix (e.g. LG, WAG, JTT etc.) and a mixture model of rate heterogeneity cross sites, such as a discretized gamma distribution. However, these models fail to capture important constraints on protein sequence evolution, heterogeneity in the substitution process across the tree, and heterogeneity across multiple proteins in a concatenated data matrix. Failure to model these features of the data can lead to artefacts in phylogenetic reconstructions, especially for "deep" phylogenetic problems. Here I focus on the importance of modeling site-specific heterogeneity in the substitution process.

The structural and functional roles of residues in proteins lead to constraints on the kinds of amino acids that may be substituted at positions over time, a feature that is not captured by the single-matrix models. Site-heterogeneous mixture models have been developed to address this issue. For example, the "CAT" mixture models (CAT-Poisson or CAT-GTR), implemented in the Phylobayes program, have been shown to successfully avoid long branch attraction problems associated with single-matrix analyses in a number of published cases. However, the utility of these and other mixture models is severely limited for very large phylogenomic analyses because of their computational time cost and memory usage. I will discuss several simple rapid and efficient approximations to these full profile mixture models. Our simulation and empirical data analyses demonstrate that these approximations ameliorate long branch attraction artefacts and, in several cases, provide more accurate estimates of phylogenies than the mixture models from which they derive.


University of Washington

A brief history of computational phylogenetics

I will discuss the history of the use of computers to infer phylogenies, starting in the late 1950s and giving particular emphasis to the introduction of the major methods in the 1960s. Much of this history I watched happen, from 1965 on. In particular I will explain the way that work in biological systematics, in population genetics, and in molecular evolution of multiple species gave rise to the early methods. I will touch on the controversies that developed in the 1970s and 1980s, a period of intense conflict over what should be the logical foundation of the reconstruction of phylogenies. Computational phylogenetics is becoming continually more statistical and continually less connected to the separable task of erecting a biological classification of organisms. Recent Twitter controversies show that arguments that were dominant and vehement in the 1980s are now taken seriously by few.

Structure and molecular evolution

University College London

What determines amino acid substitution rates?

Evolutionary and phylogenetic analyses are the basis of understanding the the origins and properties of all living systems. Darwin noted that the manner in which any organism evolves is largely determined by its interactions with other organisms and the environments they produce, on the "tangled bank" of plants, birds, insects, and worms, all "dependent upon each other in so complex a manner." This is also true at a protein level, where the selection acting on a protein for traits such as function, structure, and stability depend on the manner in which the amino acids interact, so the substitutions that occur at one site is affected by the amino acids at other sites in the protein (as well as other proteins and biomolecules). Capturing and characterising these networks is central to developing new mechanistic models of the substitution process grounded on the underlying molecular biophysics and population biology. The simulated evolution of proteins under selection for thermodynamic stability suggests connections between substitutions and other processes described by statistical physics. By using the language of statistical physics, we can develop deeper insights into the evolutionary process. By using the tools of statistical physics, we can move us towards calculating substitution rates from first principles.

University of Texas at Austin

Structural and functional constraints on protein evolution

Proteins are under selective constraints to fold stably into their native conformation and to carry out their biological function. These selective constraints shape how proteins evolve, and they cause variation in substitution rates among the sites within a given protein. In particular, sites in the core of a protein, with many residue-residue contacts, tend to be more conserved than sites on the protein surface. Further, catalytic residues in enzymes are highly conserved, and they impart a measurable increase in conservation to much of the enzyme structure, in a distance-dependent manner. (The further a site is from a catalytic residue, the less extra conservation it experiences.) Finally, protein-protein interfaces show a surprising ability for evolutionary divergence, even if they are strongly selected for function.

Fred Hutch

Using experiments to inform phylogenetic models of substitution

Computational algorithms to infer phylogenetic relationships or detect sites of positive selection are widely used in diverse branches of biology. However, anyone with a passing knowledge of modern biochemistry can recognize that the quantitative models of the evolutionary process used by these algorithms are woefully oversimplified. I will discuss prospects for making these models more realistic while keeping them computationally tractable. In particular, I will discuss how new sources of high-throughput experimental data can be leveraged to improve algorithms for the analysis of gene sequences.

Biased sampling

University of Oxford

New routes to phylogeography: a Bayesian structured coalescent approximation

Phylogeographic methods aim to infer migration trends and the history of sampled lineages from genetic data. Applications of phylogeography are broad, and in the context of pathogens include the reconstruction of transmission histories and the origin and emergence of outbreaks. Phylogeographic inference based on bottom-up population genetics models is computationally expensive, and as a result faster alternatives based on the evolution of discrete traits have become popular. In this seminar I will discuss the advantages and disadvantages of different phylogeographic methods, in particular, I will address the issue of the sensitivity of discrete trait methods to the sampling strategy. I will also present a new method called BASTA (BAyesian STructured coalescent Approximation), implemented in BEAST2, that combines the accuracy of methods based on the structured coalescent with the computational efficiency required to handle more than just few populations. I will illustrate the potentially severe implications of model choice for phylogeographic analyses by investigating the zoonotic transmission of Ebola virus and the between-species transmission of the Avian Influenza Virus.

University of Washington

Preferential sampling through time when estimating changes in effective population size

Phylodynamics seeks to estimate effective population size fluctuations from molecular sequences of individuals sampled from a population of interest. However, when analyzing sequences sampled serially through time, current methods implicitly assume either that sampling times are fixed deterministically by the data collection protocol or that their distribution does not depend on the size of the population. Through simulation, we first show that, when sampling times do probabilistically depend on effective population size, estimation methods may be systematically biased. To correct for this deficiency, we propose a new model that explicitly accounts for preferential sampling by modeling the sampling times as an inhomogeneous Poisson process dependent on effective population size. We demonstrate that in the presence of preferential sampling our new model not only reduces bias, but also improves estimation precision. Finally, we compare the performance of the currently used phylodynamic methods with our proposed model through seasonal human influenza examples. Our analysis demonstrates that influenza data sets constructed by mining sequence databases do contain strong preferential sampling signal. Accounting for this preferential sampling produces a markedly cleaner picture of influenza population dynamics.

Concordia University

Ancestral features in trees with trait-dependent diversification

As was recently shown, variation in speciation rates among lineages results in substantial biases when estimating diversification rates from phylogenetic trees. Consequently, confidence in many phylogenetic estimates for trait-dependent models of diversification from trees on extant species alone may well exceed what is possible. From a mathematical point of view a fair amount is known about the probability distribution of ancestral trees derived from single type birth and death process, while much less is known about ancestral trees derived from multi-type branching processes with type dependent rates. In this talk I will present a few results in this direction. First, there is an algorithmic way to construct an ancestral tree of the standing population of a multi-type branching process in terms of a Markov chain (of vectors of types and multiplicities). This construction allows one to get explicit formulae for calculating: (a) statistical features that describe the shape of the tree (the law of coalescence times together with types on the ancestral lineages), and (b) statistical features that link types in the standing population with the shape of the tree (the law of same-type coalescence times). Second, explicit calculations can be used to compare the effect that different branching mechanisms have on the distributions of ancestral trees. I will illustrate this in a simple example of two-type process with completely asymmetrical vs symmetrical probabilities of offspring types.

Phylo-genetic conservation

Muséum National d'Histoire Naturelle, Paris

Comparing patterns in phylogenetic and trait diversity

Studying the phylogeny led to the emergence of interdisciplinary approaches combining ecology, evolutionary biology and biogeography. The analysis of the phylogenetic relatedness among species complemented the analysis of the functional (trait-based) similarities among species, and even sometimes replaced it when phylogenetic relatedness was considered as a proxy for functional similarity. The use of phylogenetic diversity as a proxy for functional diversity as been questioned due to the observation of moderate phylogenetic signal in many field studies. From a methodological viewpoint, a fundamental difference between phylogenetic and functional analyses is that phylogeny is intrinsically dependent on a tree-like structure whereas trait data can, most of time, only be forced to adhere a tree structure, not without some loss of information. I will discuss the ways phylogenetic and functional diversity patterns can be compared and the consequences of their simultaneous analyses for conservation and community ecology.

Stony Brook University

Phylogenetic beta-diversity: a means to understand, map and conserve spatial patterns of biological diversity

Beta-diversity has long been recognized as an instrumental diversity measure providing insight as to how and why diversity varies across space. Beta-diversity also underlies most complementarity-based reserve design algorithms which quantify the extent to which an area contributes unrepresented features to an existing area or set of areas. In the early 2000 researchers started to recognize that beta-diversity could be extended to include phylogenetic information. By accounting for shared evolutionary history among assemblages/regions phylogenetic beta-diversity can provide insights into both the ecological and evolutionary mechanisms influencing variation in species diversity and the best way to conserve phylogenetic diversity in a reserve system. In this seminar I will begin by briefly reviewing various definitions and approaches to measuring and mapping beta-diversity. Then I will use a series of examples to show some of the new insights phylogenetic beta-diversity has provided to both basic science and conservation.

Simon Fraser University

Conserving phylogenetic information: indices, approaches and gaps

There seems to be increased interest in the notion that evolutionary history is worthy of management and conservation (see, e.g. Frishkoff et al. 2014; Diniz-Filho et al. 2013). The basic quantity seems to be “phylogenetic diversity” (PD) or the sum of the edge lengths connecting a candidate set of species (Faith 1992). Given a tree or network, one can produce many measures of current (or expected) (contributions to) PD, and these can be modified by other axes of value and expected costs and benefits of interventions. The technical side of the field seems to me to be in some disarray; there are overlapping terms and definitions, weak connections to other literatures (particularly community ecology), and under-tested assumptions. My presentation will offer little or no new data, but I will draw on the work of others in an attempt to partially organize the technical side of the field as I see it. Key issues concerning mapping traits and geographic scale are taken up in the following two presentations in this series.


U Edinburgh

Ebola virus epidemiology, transmission, and viral evolution from four months of sequencing in Sierra Leone (Analysis and Methods)

Adding to the work reported in Gire, et al (Science, 2014) which sequenced Ebola viruses from the first three weeks of the epidemic in Sierra Leone, we here present analyses of 150 additional viral genomes sampled from EVD cases at Kenema Government Hospital between the months of June to September 2014. We describe continued evidence for sustained human-to-human transmission with no additional zoonotic events, and preliminary results concerning new lineages from Guinea. We also characterize the epidemiological history of the limited number of exported viruses from the country. We also observe a slowing of the viral substitution rate over the course of the outbreak, consistent with the increased effect of purifying selection as the outbreak continues over time. These findings allow a closer view of viral evolution during its extended time in human populations and provide critical insights into the movement of the virus through the region.

This is the second talk in a pair of talks from collaborators Daniel Park and Gytis Dudas concerning their analysis of Ebola virus sequences.

Broad Institute

Ebola virus epidemiology, transmission, and viral evolution from four months of sequencing in Sierra Leone (Overview)

Adding to the work reported in Gire, et al (Science, 2014) which sequenced Ebola viruses from the first three weeks of the epidemic in Sierra Leone, we here present analyses of 150 additional viral genomes sampled from EVD cases at Kenema Government Hospital between the months of June to September 2014. We describe continued evidence for sustained human-to-human transmission with no additional zoonotic events, and preliminary results concerning new lineages from Guinea. We also characterize the epidemiological history of the limited number of exported viruses from the country. We also observe a slowing of the viral substitution rate over the course of the outbreak, consistent with the increased effect of purifying selection as the outbreak continues over time. These findings allow a closer view of viral evolution during its extended time in human populations and provide critical insights into the movement of the virus through the region.

This is the first talk in a pair of talks from collaborators Daniel Park and Gytis Dudas concerning their analysis of Ebola virus sequences.

Ancestral recombination graphs

A demography-aware conditional sampling distribution for inferring ancient demography and detecting introgression patterns

Complex demographic histories shape the genealogies of contemporary individuals and thus have a substantial impact on the genetic variation observed today. These genealogies are commonly modeled by the ancestral recombination graph (ARG), and we developed a novel demography-aware conditional sampling distribution (CSD) to approximate these ARGs under general demographic models. We apply this CSD in an expectation-maximization framework for demographic inference. We show that this method can accurately recover biologically relevant demographic parameters like population divergence times, migration rates, or ancestral population sizes from simulated datasets. Furthermore, we apply the CSD to detect tracts of genetic material that introgressed from Neanderthal into modern humans. Our results are in general agreement with previously published results, and we will discuss the similarities and differences, and their biological implications.

University of Southern California

An empirical view of the population pedigree

Often, the summary statistics of population genetics are framed in the setting of Kingman's coalescent or related models. These statistics can be alternatively thought of as descriptive statistics of the realized population pedigree-with-recombination, in a way that has become much more useful in the era of whole-genome sequencing. For instance, pairwise number of nucleotide differences is proportional to "effective population size", which is sometimes more usefully thought of as an estimate of the average length of the path through the pedigree to the most recent common ancestor at a randomly chosen locus (with an explicit standard error). Another example is the pairwise distribution of long tracts of IBD, which provides an estimate of a functional of the entire distribution of such paths.

Mathematical and visualization tools for working with ancestral recombination graphs

The fields of phylogenetics and population genetics share several important models including gene trees, species trees, ancestral recombination graphs (ARGs), and pedigrees. These models are all closely related and can be viewed as subgraphs of one another. Amongst them, the ARG is particularly central and if inferred efficiently can enable many applications such as inference of selection and demography. Here, I will review various helpful mathematical tools for working with ARGs, including what we call the threading algorithm, the branch graph, and the leaf trace visualization.

Viral phylodynamics

Statistical inference for phylodynamics

Phylodynamic methods are widely used to estimate demographic parameters and historical population dynamics from genealogies of individuals sampled from a population. In this phyloseminar, I will describe how we can understand genealogies in terms of basic demographic or ecological processes, and how these concepts can be used to develop statistical models for inference. In particular, I will discuss some similarities and differences between the two main modeling frameworks in phylodynamics: the coalescent and birth-death models. I will also briefly introduce some of the latest statistical methods currently used to fit these models to genealogies. I will end by discussing one of the main challenges facing the field---adequately representing the structure of complex, heterogenous populations in phylodynamic models.

Epidemic reconstruction in a phylogenetics framework

Major recent advances in genome sequencing technology make it feasible that in future epidemics, a sequence will be available for every clinical case that can be identified. In some scenarios, such as agricultural epidemics (where farm-to-farm spread is of more interest than animal-to-animal), diseases such as HIV (where most infected individuals will eventually present themselves to clinicians), and epidemics occurring in well-monitored populations such as hospital inpatients, we will as a consequence be able to acquire a set of sequences representing the pathogens infecting most or all cases in the transmission chain. Genetic data therefore provides an important new tool for the investigation of epidemics, in particular the determination of the epidemic's transmission tree, which describes which case infected which others. As the genetic diversity in a set of sequences taken from the same epidemic will not be enormous even for fast-evolving RNA viruses, the best approach would be to combine both genetic and epidemiological data. I present here a new method for transmission tree reconstruction which is integrated into the Bayesian phylogenetics framework available in BEAST. It is based on the observation that if the phylogeny is know, there is a one-to-one correspondence between possible transmission trees and partitions of the internal nodes of the tree into connected subgraphs. The MCMC procedure in BEAST has been modified to sample from the space of trees with nodes partitioned in this way, simultaneously estimating both phylogenetic tree and transmission tree. Rather than assuming that the entire tree is generated by a single coalescent process, the posterior probability of a phylogeny is now calculated based on an individual-based model of disease transmission, which can take into account epidemiological characteristics of the host cases, such as spatial location. I will outline results using simulated data and sequences from the 2003 Dutch epidemic of H7N7 avian influenza.

Imperial College London

Phylodynamics of infectious disease epidemics

The genetic diversity of many pathogens is shaped by epidemiological history. But, the dynamics of infectious disease epidemics differ in important ways from demographic processes that have traditionally been studied by population geneticists. In many epidemics, the population size and birth rate changes rapidly in a nonlinear fashion through time. Mathematical models for describing infectious disease dynamics have a long history that has run parallel to the development of modern population genetics, but until recently, there has been little communication between these fields.Interest has grown in developing a new set of mathematical models for genealogies generated by epidemic processes. These methods reveal how the effective population size of a pathogen depends on transmission rates, the number of infected hosts, and the size of the bottleneck at the time of transmission. These mathematical models have also enabled new applications of pathogen genetic data to public health. Pathogen genetic data can be informative about epidemic processes in ways that standard surveillance data are not, especially regarding the source of infections and risk factors for transmission. I will review several approaches to mathematical modeling of pathogen genealogies and present applications of these methods to HIV-1 and the recent Ebola virus epidemic in Western Africa.

Phylogenetics of cancer

Phylogenetic quantification of intra-tumour heterogeneity

Tumour heterogeneity, i.e. the genomic diversity of cancer cells within a single tumour, is thought to be the source of chemotherapy resistance. In many cancers, this heterogeneity is not limited to point mutations but includes large scale genomic rearrangements and endoreduplications that lead to aberrant copy number (CN) profiles. Reconstruction of the evolutionary tree of cancer within the patient allows us to quantify and understand the aetiology of tumour heterogeneity. In some cancers, such as high-grade serous ovarian cancer (HGSOC), CN profiles predominate. However tree inference is hindered by unknown phasing of major and minor CNs, horizontal dependencies between adjacent genomic loci and the lack of curated CN profile databases to use as a reference for probabilistic inference.

We recently developed MEDICC (Minimum Event Distance for Intra-tumour Copy number Comparisons), an algorithm for phylogenetic reconstruction based on CN profiles. MEDICC uses finite-state transducers (FSTs) to encode a minimum evolution criterion that determines pairwise evolutionary distances between CN profiles. This minimum-event distance computes the smallest number of amplification and deletions of arbitrary length that are necessary to transform one genomic profile into another. The FST-based approach allows us thereby to model dependencies between sites, similar to the problem of modelling indels on trees in traditional phylogenetics. Using this approach we are able to phase major and minor CN profiles to the parental alleles and infer trees and ancestral genomes, while minimizing the overall tree length. The distance measure is formulated such that the resulting matrix of pairwise distances has a direct mapping to a positive semi-definite kernel matrix. This allows us to perform principal component analysis in evolutionary space and use this embedding to numerically quantify tumour heterogeneity and other quantities of interest, such as the degree of clonal expansion, using spatial statistics.

I will talk about the basics of FST-based phylogenetic inference and explain how they can be used to model genomic rearrangement events with horizontal dependencies. I will explain how this approach implicitly maps genomes into a feature space in which we can quantify heterogeneity. Finally, I will present clinical results that show how this quantification of ITH can predict resistance development in the hospital.

Massachusetts General Hospital

Phylogenetic analysis of metastatic colon cancer in humans

Metastasis is the main cause of cancer morbidity and mortality. Despite its clinical significance, several fundamental questions about the metastatic process in humans remain unsolved. Does metastasis occur early or late in cancer progression? Do metastases emanate directly from the primary tumor or give rise to each other? How does heterogeneity in the primary tumor relate to the genetic composition of secondary lesions? Addressing these questions – ideally by examining the genetic makeup of tumor cells in distinct anatomic locations and reconstructing their evolutionary relationships – is crucial to improving our understanding of metastasis. I will give an overview of a simple PCR-based assay that enables the tracing of tumor lineage in patient tissue specimens. The methodology relies on somatic variation in highly mutable polyguanine (poly-G) repeats located in non-coding genomic regions. Poly-G mutations are present in a variety of human cancers. In colon carcinoma, an association exists between patient age at diagnosis and tumor mutational burden, suggesting that poly-G variants accumulate during normal division in colonic stem cells. Poorly differentiated colon carcinomas (which have a worse prognosis) have fewer mutations than well-differentiated tumors, possibly indicating a shorter mitotic history of the founder cell in these cancers. By presenting several patient case studies, I will describe how poly-G fingerprints can be used to construct phylogenetic trees that reflect the evolution of metastatic colon cancer, with an emphasis on how biological considerations inform analysis strategies.

Mini-course on genome-scale phylogeny

INRIA, Université de Lyon

Evolution of genome organization

Genome rearrangements were discovered and used to build molecular phylogenies in the 1930s. They are implied in many cancers and their evolutionary role might be of primary importance. But the mathematical and computational tools to model rearrangements are still not as efficient as the ones developed later for local mutations as nucleotide or amino-acid substitutions. In this seminar I will report the attempts to integrate genome organisations in the usual models of genome evolution. I will explain how this can improve the inference of phylogenies, as well as ancestral genomes.

Université de Lyon

Gene tree-species tree methods for comparative genomics

In this second talk of our series on genome-scale phylogeny, I build upon Gergely's introduction and present the modelling assumptions and algorithmic details behind some of the methods we and others have developed. There will be two parts to this talk. I start with the model of gene duplications and losses implemented in PHYLDOG. I present the assumptions we make and the shortcuts we take to improve the program's efficiency, and show some results on real and simulated sequence data. I notably show problems that arise when the program is confronted with data generated with a model of incomplete lineage sorting (Rasmussen and Kellis, 2012), and present avenues of research to find solutions to these problems. In the second part, I present our current efforts to use our model of gene duplication, loss, and transfer (Szöllosi et al, 2013) to infer a species tree in which speciation nodes are ordered in time. I briefly remind the forgetful viewer of what this model does and how it works, and I then explain how we devise a new MCMC algorithm to use it on data sets containing dozens of species and thousands of gene families. I finish with some perspectives of our plans uniting gene tree-species tree models and databases of gene families and phylogenetic trees.

Eötvös Loránd Tudományegyetem

Inferring gene trees with species trees

Molecular phylogeny has focused mainly on improving models for the reconstruction of gene trees based on sequence alignments. Yet, most phylogeneticists seek to reveal the history of species. Although the histories of genes and species are tightly linked, they are seldom identical, because genes duplicate, are lost or horizontally transferred. Building models describing the relationship between gene and species trees can thus improve the reconstruction of gene trees when a species tree is known, and vice-versa. Several approaches have been proposed to solve the problem in one direction or the other, but in general neither gene trees nor species trees are known. Only a few studies have attempted to jointly infer gene trees and species trees.

I introduce models that describe the relationship between gene trees and species trees. I begin with models that account for gene duplication and loss, and subsequently introduce models that account for the horizontal transfer of genes. I review results from simulations as well as empirical studies on genomic data that show that combining gene tree-species tree models with models of sequence evolution improves gene tree reconstruction. In turn, these better gene trees provide a better basis for studying genome evolution or reconstructing ancestral chromosomes and ancestral gene sequences. I also discuss the possibility of extracting information on the timing of speciation events from ancient horizontal transfer events.

Open Tree of Life

University of Michigan

Exploring graphs for mapping and synthesizing phylogenies

The emergence of graph databases has presented a potential alternative for ways of storing and querying phylogenetic trees. The Open Tree of Life has been exploring these options and ways that trees from multiple datasets or within a single dataset can be placed in a graph database. I will go over some of the ways that we do this and how we can query and synthesize trees as an alternative to supertrees and consensus trees. While still a work in progress, these methods show great promise for further development.

National Evolutionary Synthesis Center

Technical and social challenges of synthesizing phylogenetic data across the tree of life

Open Tree of Life aims to synthesize published phylogenetic data into a comprehensive tree of life. The challenges associated with the collection, curation and synthesis of both phylogenetic and taxonomic input data are both technical and social. We present the first draft of the Open Tree of Life, as well as the workflow and software tools for curating, annotating and viewing phylogenetic data. In a subsequent Phyloseminar, Stephen Smith will present details of the phylogenetic synthesis methods.

Integrating fossils into phylogenies

National Museum of Natural History

Phylogenetic Paleobiology: What do we stand to gain from integrating fossils and phylogenies in macroevolutionary analyses?

The aim of macroevolutionary science is to understand the patterns and processes responsible for generating organismal diversity in space and time. Although macroevolutionary change typically occurs over geologic timescales and has traditionally been studied by paleobiologists, comparative biologists have become increasingly interested in macroevolutionary questions, utilizing time-calibrated molecular phylogenies of extant taxa as a framework for testing hypotheses about rates of evolution. In this seminar, I’ll examine how integrating fossils and phylogenies can increase our power to test and answer fundamental questions about tempo and mode in phenotypic evolution. Integrating fossil taxa into phylogenies of extant taxa is worth the effort: on a per taxon basis, fossils contribute more information about macroevolutionary pattern and process and increase our ability to distinguish processes that leave similar signals in extant species datasets. I’ll discuss some recent work, and highlight how fossil information can be used to inform macroevolutionary inference when a combined phylogeny is lacking. One theme emerges from all of this work; we stand to gain a better understanding of macroevolution not when we approach it as biologists or paleontologists but, as G.G. Simpson recommended 60 years ago, as practitioners of both.

Including Fossil Taxa in Phylogenies: Advances and Issues

The fossil record offers a rich source of macroevolutionary data. Fossils can reveal transitional forms that could not be predicted from extant taxa alone, reveal unexpected biogeographic patterns, and provide temporal information crucial for inferring rates of evolution and correlations between evolution and abiotic events. At the same time, including fossil taxa in phylogenetic analyses presents many challenges. Currently, there are a wide variety of methods for including fossil data in phylogenetic analyses ranging from indirect use of fossil ages to inform divergence dates to simultaneous analyses of fossil and extant taxa under various optimality criteria and with varying levels of constraints. One important consideration remains that fossils typically provide only morphological data, which can lead to problems related to missing data and potential violation of common assumptions for model-based phylogeny inference methods designed primarily for molecular sequence data. Morphological character data are typically harvested from from fossils taxa not at random, but with an intentional bias towards parsimony-informative characters (with apomorphies omitted from matrices). Combined with issues related to sparse codings in large combined matrices, care must be taken to avoid spurious inferences.

UC Berkeley

The Fossilized Birth-Death Process: A Coherent Model of Fossil Calibration for Divergence Time Estimation

Accurate estimates of absolute node ages are critical for addressing a wide range of questions in evolutionary biology. Because molecular sequence data are not informative on absolute time, external data–most commonly fossil age estimates–are required to calibrate estimates of species divergence times. For Bayesian divergence-time methods, the common practice for calibration using fossil information involves placing arbitrarily-chosen and parameterized parametric distributions on internal nodes, often disregarding most of the information in the fossil record. The `fossilized birth-death' (FBD) process is a model for calibrating divergence-time estimates in a Bayesian framework, explicitly acknowledging that extant species and fossils are observations from the same macroevolutionary process. Under this model, absolute node age estimates are calibrated by a single diversification model and arbitrary calibration densities are not necessary. Moreover, the FBD model allows for inclusion of all available fossils. We performed analyses of simulated data and show that node-age estimation under the FBD model results in accurate estimates of species divergence times with realistic measures of statistical uncertainty, overcoming major limitations of standard divergence time estimation methods.

In honor of Carl Woese

University of Queensland

Carl Woese's grand view of life that just keeps getting grander

Most microorganisms cannot be grown in pure culture (or at least not easily). This has been apparent for decades by comparing the number of cells seen under a microscope to the fraction of those cells that will grow into colony forming units (typically <1%). The objective classification of cellular life by comparative rRNA analysis pioneered by Carl Woese provided the first grand view of the tree of life and also provided the reference framework upon which his friend and colleague, Norman Pace, developed ways to directly survey microbial communities via their rRNA sequences without the need to grow them. This put our degree of ignorance of the microbial world into perspective: dozens of major microbial lineages have emerged over the last 20 years that lack even a single cultured representative. New approaches, such as deep metagenomics and single cell genomics, are now transforming the rRNA-based phylogenetic outlines of the tree of life into a fully-fledged genome-based view of the tree. I will present a recent snapshot overview of the genome tree of the bacterial and archaeal domains and examples of functional insights in the context of a more complete view of microbial evolution.

Ed DeLong

How Carl Woese transformed the field of microbial ecology

The challenges of dissecting naturally occurring microbial assemblages, with respect to their community composition, interspecies interactions, functional attributes, and activities, are numerous and daunting. For many years, these challenges impeded our understanding of the properties and dynamics of microbial communities, and thus hindered development of the field of microbial ecology. Enter Carl Woese: the theory and application of molecular phylogenetics and genomics in studies of microbial evolution and ecology can be traced directly to Woese and one of his primary collaborators, Norman Pace. This lecture will trace the logic and roots of the application of molecular phylogenetics and genomics to the study of microbial ecology, through a historical review and examination of its past and current applications.

University of Colorado – Boulder

Following Carl Woese into the Natural Microbial World – The Beginnings of Metagenomics

Carl Woese, one of the great scientists of all time, died in December, 2012. Among other important contributions, he used primitive sequencing technology to compare small subunit (16S) ribosomal RNA sequences from different organisms and thereby establish the outlines of a universal tree of life. His results also put in place a sequence-based reference framework within which to understand and articulate biological diversity. Since this perspective is based on molecular sequences and not properties of organisms, it opened the door to begin to understand the kinds of organisms that make up the natural microbial world. Prior to Woese’s sequence-based reference framework, microbial ecologists had to culture organisms to study them, but not many environmental organisms, <<1%, are cultured using standard methods. Sequence surveys of environmental microbial genes and genomes – “metagenomics” - have now revolutionized understanding of microbial ecology, including its influence on human health. The seminar will discuss how metagenomics developed and the impact it has had on our understanding of environmental microbial diversity and the structure of the molecular tree of life.

Phylogenetics and language

University College London

Bobbins, Borrowing, and Bayesian Inference: Horizontal Transfer and the application of Phylogenetic Methods in Cultural Evolution studies

Researchers have applied quantitative phylogenetic methods to study human cultural and linguistic evolution. However, a common critique of this approach is that cultural evolution and biological evolution differ in important ways that make phylogenetic analyses unsuitable for cultural data. Principally, horizontal transmission (or borrowing) of cultural and linguistic traits is argued to be so pervasive as to invalidate the approach. In this talk I will address this issue by asking how much does horizontal transfer occur?, and does it matter if it does? Contra the skeptics, I will discuss studies that demonstrate that 1) many biological systems also show non-tree-like patterns of evolution, 2) cultural systems vary in the degree to which horizontal transfer occurs, and 3) borrowing does not necessarily cause big problems. Rather than being a reason to give up on the whole project, borrowing can be productively investigated using phylogenetic techniques to yield deeper insights cultural and linguistic evolution.

University of Bristol

Testing hypotheses about cultural evolution

Anthropologists had a name for the non-independence-of-species-problem way back in the 1880s. Solving "Galton's Problem", and the promise of comparative methods for testing hypotheses about cultural adaptation and correlated evolution was a major catalyst for the field of cultural phylogenetics. In this talk I will show how linguistic, cultural, and archaeological data is used in comparative phylogenetic analyses. The "treasure trove of anthropology" - our vast ethnographic record of cultures - is now being put to good use answering questions about cross-cultural similarities and differences in human social and cultural norms in a rigorous evolutionary framework.

Australian National University

Language phylogenies and cultural evolution

Charles Darwin famously noted that there were many curious parallels between the evolution of species and languages. Since then evolutionary biology and historical linguistics have used trees to conceptualise evolution. However, whilst evolutionary biology developed the vast discipline of phylogenetic methods, linguistics dabbled with computational methods before rejecting them. The last decade or so has seen the introduction of phylogenetic methods into linguistics, often with some startling results. In this talk I will present some of these studies, and discuss how phylogenetics can help us grapple with the problems of linguistic and cultural evolution. These problems range from testing population dispersal hypotheses, to investigating the shape of cultural evolution, to inferring the rates at which languages change.

Rates and Dates

Ecole Polytechnique

Understanding biodiversity patterns using the Tree of Life

Species richness results from past and current speciation, extinction and dispersal events, themselves influenced by various ecological and evolutionary processes. Estimating rates of diversification, and understanding how and why they vary over evolutionary time, geographical space, and species groups, is thus key to understanding how ecological and evolutionary processes generate biological diversity. Phylogenetic approaches are critical for making such inferences, especially in groups or regions lacking fossil data. I will illustrate how phylogenies, coupled with models of cladogenesis, can be used to test the role of ecological limits, boom-then-bust diversity dynamics, the paleoenvironment, and population dynamics on the biodiversity patterns that we observe today.

Inferring macroevolutionary processes based on phylogenetic trees

Phylogenetic trees of present-day species allow inference of the rate of speciation and extinction which led to the present-day diversity. Classically, inference methods assume a constant rate of diversification, or neglect extinction. I will discuss major limitations of this null model and will present a new framework which allows speciation and extinction rates to change through time (environmental-dependent diversification), with the number of species (density-dependent diversification), and with a trait of a species (trait-dependent diversification). For the latter model, particular focus is given to the trait being the age of a species. Issues arising in empirical data analysis, such as incomplete taxon sampling, model selection, and confidence interval estimation, will be discussed. The methods reveal interesting macroevolutionary dynamics for mammals, birds and ants, and can easily be applied to other datasets using the R packages TreePar and TreeSim available on CRAN.

Structure and molecular evolution

University of Colorado School of Medicine

Adaptation, coevolution, and convergence in the context of protein thermodynamics

Interactions within and between proteins are a fundamentally important part of how they evolve and adapt. We have been considering how and why proteins adapt, coevolve, and converge, and working to understand these concepts in the context of protein thermostability and function. We will expand from the previous talk of our collaborator, Dr. Goldstein, and discuss how and why coevolution is and should be detected, and how thermostability affects reconstruction of ancestral functions. Further, we will discuss our work on adaptive redesign in mitochondrial proteins, perhaps the largest known case of an adaptive burst in multiple metabolic proteins. The convergence between ancestral snakes and ancestral acrodont lizards is also perhaps the largest known case of adaptive convergence. We will consider what these examples tell us about the theory of how proteins appear to evolve in the context of nearly neutral versus cases of adaptive change. Further, we will discuss the impact on understanding phylogenetic relationships, and we will also discuss a unified theory of nearly neutral and adaptive evolution in the context of structure and function.

Richard Goldstein
National Institute for Medical Research, London

Simulating evolution with in silico models of protein thermodynamics

Many of the most basic issues of protein evolution are difficult to determine from the relationship between existent protein sequences. We would ideally like to analyse the complete evolutionary record: what mutations were attempted when in what lineage, which ones were deleterious or advantageous and by how much, which ones were accepted, and how these substitutions affected further mutations and the overall evolution of protein properties. In the absence of available biological data, we can create our own - simulate protein evolution in silico, such as in our work modelling how proteins would evolve given their need to be thermodynamically stable. These simulations allow us to explore a range of phenomena and develop a conceptual framework that tells us which questions may be interesting and important to consider in real proteins. Such simulations can also illuminate which conditions are necessary and/or sufficient to explain observed protein characteristics. We consider how evolution of protein thermostability explains why proteins are generally marginally stable, why eukaryotes may have more disordered proteins than prokaryotes, and what the consequences of this are for biochemical networks. We also consider how various locations in a protein can co-evolve, and how this can inform the next generation of substitution models.

Protein Structural, Biophysical, and Genomic Underpinnings of Protein Sequence Evolution

Common models for amino acid substitution assume that each site evolves independently according to average properties in the absence of a genomic, protein structural or functional context. Two characterizations of amino acid substitution will be presented. One approach extends a population genetic model to inter-specific genomic data and a second approach evaluates the effects of selection for protein folding and protein-protein interaction on sequence evolution. Several take home lessons include the importance of considering linkage independent of protein structure, the importance of negative pleiotropy (or not statements in folding and binding), and the nature of the co-evolution of sites and how it links standard substitution models with covarion models when binding function is conserved and when it changes.


John P. Huelsenbeck and Sebastian Höhna
UC Berkeley and Stockholm University

RevBayes: An R like Environment for Bayesian phylogenetic inference

RevBayes is a computer program that uses directed acyclic graphs (DAG's) to specify any type of model, to hold the model and data in memory, and to compute the likelihood of the parameters of the model. DAG's provide a framework for the construction of modular models. Models can easily be extended and/or parts of the model exchanged (e.g., the substitution process and clock model) and several models can be combined. The design of RevBayes should allow the implementation of any extension to existing models. RevBayes is mainly developed for Bayesian phylogenetic analyses, but it can be extended to any inference on probabilistic models.

In this talk, I will give a brief introduction to the concept of DAG's and how they are used to construct a model. Once the model is specified, I will show how to simulate new observations under the model and how to estimate its parameters. I will demonstrate this in the RevLanguage, which is an R-like language for building DAG's for phylogenetic problems. The RevLanguage is used interactively to specify the model, as done with R. I will show how a full phylogenetic model is specified, step-by-step. I will mainly focus on various standard substitution models, relaxed clock models, and divergence times priors. Specifically, I will show a new birth-death model with speciation and extinction rates varying over time and use this in a integrative analysis. In the integrative analysis I condition only on the alignment (only the alignment is considered to be known) and estimate the tree and divergence times simultaneously as well as the speciation and extinction rates.

Example files for the demonstration are available here.

Introduction to HyPhy: Hypothesis testing using Phylogenies

HyPhy is an open-source software package for the analysis of genetic sequences using techniques in phylogenetics, molecular evolution, and machine learning. It features a complete graphical user interface (GUI) and a rich scripting language for limitless customization of analyses. Additionally, HyPhy features support for parallel computing environments (via message passing interface) and it can be compiled as a shared library and called from other programming environments such as Python or R.

UMass Boston and University of Paris

Introduction to phytools and phangorn: phylogenetics tools for R

phytools is a new multifunctional phylogenetics package for the R statistical computing environment. The focus of the package is on methods for phylogenetic comparative biology; however it also includes tools for simulation, phylogeny input/output, manipulation, and even inference. The phytools library is designed for maximum interoperability with other important R phylogenetics packages such as ape, geiger, and phangorn.

phangorn is a package for phylogenetic reconstruction and analysis in the R language. Previously it was only possible to estimate phylogenetic trees with distance methods in R. phangorn, now offers the possibility of reconstructing phylogenies with distance based methods, maximum parsimony or maximum likelihood (ML) and performing Hadamard conjugation. Extending the general ML framework, this package provides the possibility of estimating mixture and partition models. Furthermore, phangorn offers several functions for comparing trees, phylogenetic models or splits, simulating character data and performing congruence analyses.

Beyond IID

The Poisson Indel Process

The key component of a probabilistic joint approach to tree and alignment inference is a Continuous Time Markov Chain (CTMC) over strings. Ideally, this CTMC should support tractable inference algorithms and should be easily extensible to support a wide range of evolutionary models. The classical string-valued CTMC, the TKF91 model (Thorne et al., 1991), is limited in both of these axes. Previous work has focussed on increasing the complexity of the TKF91 model, making the inference problem computationally more difficult (Miklos et al., 2004).

In this work, we present a new stochastic process, the Poisson Indel Process (PIP), which allows simple and practical inference algorithms. Efficient computations are based on an exchangeable representation and on Poisson processes. This representation gives a natural way of extending the capacity of the model while keeping inference computationally practical.

We used this process to design a joint Bayesian estimator over alignments and trees. We evaluated both consensus trees and alignments against standard baselines on synthetic and real data. These experiments demonstrate that competitive trees and alignments can be inferred using a Bayesian model equipped with a PIP prior.

Accurate reconstruction of insertion-deletion histories by statistical phylogenetics

The "multiple sequence alignment" is a computational artifact. In nature there is no such thing; rather, an alignment represents a partial summary either of indel history, or of structural similarity. Here we show, via evolutionary simulation tests, that all currently-available multiple alignment tools introduce systematic biases into downstream evolutionary analysis - particularly when used to reconstruct histories of insertions and deletions.

I will present our unification of Felsenstein's "pruning" algorithm and "progressive alignment" to build a fast, linearly-scaling approximate-maximum-likelihood phylogenetic alignment/reconstruction algorithm. Inference of evolutionary history in this framework displays a clear improvement in accuracy over non-statistical phylogenetic reconstructions and a massive improvement in performance over slow-running MCMC statistical reconstructions.

Evolutionary genomics

UC Riverside

Fungal phylogenomics: Getting lost in the moldy forest

Fungi occupy diverse ecological niches in roles from nutrient cycling in rainforest floors to aggressive plant and animal pathogens. Molecular phylogenetics has helped resolve many of branches on the Fungal tree of life and enabling studies of evolution across this diverse kingdom. The genome sequences from hundreds of fungi now permit the study of change in genes and gene content in this phylogenetic context and to connect molecular evolution with adaptation to ecological niches or changes in lifestyles. I will describe our work in studies contrasting pathogenic and non-pathogenic fungi and efforts to unravel the evolution of multicellularity in fungi comparing unicellular basal fungi with multicellular mushrooms and molds.

The development of tools for data mining and use of fungal genomics is also driving the pace of molecular biology and genetics of fungi. I will highlight new approaches to make this easier and the ways data integration can inform and transform studies of functional biology of fungi.

Bayesian inference of ancient human demography from individual genome sequences

Besides their value for biomedicine, individual genome sequences represent a rich source of information about human evolution. I will describe an effort to estimate key evolutionary parameters from the genome sequences of six individuals from diverse human populations. We have used a Bayesian approach based on coalescent theory to extract information about ancestral population sizes, divergence times, and migration rates from inferred genealogies at many neutrally evolving loci from across the genome. We introduce new methods for accounting for gene flow between populations and integrating over possible phasings of diploid genotypes. I will also describe a custom pipeline for genotype inference to mitigate possible biases from heterogeneous sequencing technologies, coverage levels, and read lengths. Our analysis indicates that the San of Southern Africa diverged from other human populations 108--157 thousand years ago (kya), that Eurasian populations diverged 38--64 kya, and that the effective population size of the ancestors of all modern humans was ~9,000.

Locating protein-coding sequences under selection for additional, overlapping functions in 29 mammalian genomes

The degeneracy of the genetic code allows protein-coding DNA and RNA sequences to simultaneously encode additional, overlapping functional elements. A sequence in which both protein-coding and additional overlapping functions have evolved under purifying selection should show increased evolutionary conservation compared to typical protein-coding genes -- especially at synonymous sites. We developed a method to systematically locate short regions within known ORFs that show conspicuously low estimated rates of synonymous substitution, based on phylogenetic codon rate models and likelihood ratio tests.

We applied this method to genome alignments of 29 placental mammals, resulting in more than 10,000 “synonymous constraint elements” (SCEs) with resolution down to nine-codon windows. These are found within more than a quarter of all human protein-coding genes and contain ~2% of their synonymous sites. We collected numerous lines of evidence that the observed synonymous constraint in these regions reflects selection on overlapping functional elements including splicing regulatory elements, dual-coding genes, RNA secondary structures, microRNA target sites, and developmental enhancers. We also ruled out certain alternative explanations such as codon usage bias and neutral rate variation.

Our initial results show that overlapping functional elements are common in mammalian genes, despite the vast genomic landscape. Furthermore, anticipating the future availability of additional mammalian and vertebrate genomes, we are currently developing Bayesian codon modeling methods to measure synonymous rates at even higher resolutions, perhaps eventually allowing the detection of individual regulator binding sites embedded in protein-coding ORFs.


U Tennessee

Making comparative methods as easy as ABC

For decades, biologists have addressed evolutionary and ecological questions using measurements of species traits, phylogenies, and an assortment of comparative methods. Unfortunately, while there is a large assortment of these methods, they are still fairly limited and development of new methods is slow. It took seven years between the introduction of using a simple Brownian motion model for looking at trait evolution (Felsenstein, 1985) and the use of this same model for looking at rates of trait evolution (Garland, 1992), and an additional 14 years to more powerful tests using a small modification of the basic model (O'Meara et al., 2006). Still other promising methods are described and even tested but remain unavailable to empiricists because they are not put into software. As a result, the questions empiricists can ask about the world are limited by the research productivity of the few dozen scientists who develop and implement new methods in phylogenetics. We describe a new approach based on Approximate Bayesian Computation and implemented in R that will allow researchers to easily develop their own models for trait evolution without requiring them to have specialized mathematical or computational knowledge.

New Frontiers for the Comparative Analysis of Diversification

We're building the tree of life, but what can we do with it? It seems clear that there is a wealth of information about evolution in the structure of this tree. There are some methods that can use phylogenetic trees to test macroevolutionary models, but the range of models that we can test is still severely limited. In some cases, such as the estimation of extinction rates from phylogenetic trees, current methods have proven controversial. We are now beginning to develop and implement methods that use tree-of-life scale data to answer key questions in evolution. I will review three new approaches developed in my lab for analyzing comparative datasets: MECCA, fossil-Medusa, and reversible-jump MCMC. I argue that these methods represent the next generation of comparative methods that will open the door to analyzing a much broader range of models with large datasets.

What poultry breeders and guinea pigs have to tell us about statistical nonmolecular phylogenetics

We are far from having an understanding of the determination of morphological characters at the genome level, so most evolutionary biologists working on them still need to use phenotypic approaches. I will discuss the prospects for using the tools of quantitative genetics, which has faced the same dilemma for the past century. I will use as examples three projects of my own. One, which is joint work with Fred Bookstein, adapts the tools of morphometrics, of which he is a chief developer, to modeling change of morphological forms on phylogenies. The second is a similar project that asks how to best place fossil forms into a phylogeny of present-day species when there is molecular data enabling us to get a good estimate of the phylogeny for those species. The third models discrete 0/1 characters using the Threshold Model developed by Sewall Wright for his work on guinea pigs. All of these lead to asking whether we can connect Brownian Motion models with quantitative genetics models. In all such cases we will have limits on what we can infer, and need to be aware of the need to carry that uncertainty through any subsequent inference using these results.

Infectious disease

Accurate estimation of evolutionary attributes of coding sequences and evolutionary fingerprinting

Codon substitution models have facilitated the interpretation of evolutionary forces operating on genomes. Most of these models, however, assume a single rate of non-synonymous substitution irrespective of the nature of amino acids being exchanged. Recent developments have shown that models which allow for amino acid pairs to have different rates of substitution offer improved fit over single rate models. However, these approaches have been limited by the necessity for large alignments in their estimation or the adoption of a particular residue exchangeability scale. We present an alternative procedure which assigns substitution rates between amino acid pairs can be subdivided into a few rate classes, dependent on the information content of the alignment. This procedure permits us to infer generalizable models for specific genes, organisms and taxonomic clades.

Phylogenetic challenges in the retroviridae branch of the tree of life

The representation of all virus families within a single phylogenetic tree may be a misleading description of their evolutionary history. First, it is unlikely that all viruses originated from a unique common ancestor. Second, viruses (retroviruses in particular) can integrate into the host genome and be transmitted vertically as well horizontally. Third, different viral genera can evolve according to dramatically different molecular clocks. Three paradigmatic examples from the retroviridae family will be considered here: the simian foamy viruses (SFVs); the primate T-lymphotropic viruses (PTLVs), which include HTLV and STLV, and the primate lentiviruses (PLVs), which include SIV, HIV-1 and HIV-2. SFV is an example of an ancient virus that has been co-evolving with its primate hosts over the last 30 million years. PTLVs emerged around 300 thousand years ago and are characterized by frequent interspecies transmissions and multiple introductions into human populations since prehistoric times. PLVs have a much more recent origin and only within the last 200 years have been able to spread successfully within the human population. The complex relationship between population dynamics and evolutionary time-scale of these retroviruses, as well as the challenge of their integration within the tree of life will be discussed.

Phylogenetic diffusion models and their applications in viral epidemiology

Emerging infectious diseases continue to appear all over the world, and importantly, they have also risen significantly over time after. Having the potential to quickly adapt to new hosts and environments, RNA viruses are prime candidates to emerge as global threats to human health. Their rapid rate of evolution, however, also turns viral genomes into valuable resources to reconstruct the spatial and temporal processes that are shaping epidemic or endemic dynamics.

In this seminar, I will highlight recent developments in phylogenetic diffusion models that tie together sequence evolution and geographic history in a coherent statistical framework. Both discrete and continuous phylogeographic models have recently been implemented in a Bayesian statistical approach. I will position this approach among other popular phylogeographic methods, and then focus on applications in viral molecular epidemiology to demonstrate their use. Finally, I will hint at future extensions that may provide entirely new opportunities for phylogeographic hypothesis testing.

Adaptation and migration in the human influenza virus

The influenza A virus infects approximately 500 million individuals each year. Owing to its RNA makeup, influenza mutates extremely rapidly allowing the virus population to escape the pull of the human immune system. A single individual may be infected year after year by antigenically novel strains. As result of this rate of mutation, the timescale of influenza evolution is a human timescale. We get the chance to observe the process of evolution in action. However, the rapid pace of evolution also causes an intrinsic link between evolutionary and ecological dynamics in the virus population. The availability of temporally spaced sequence data allows estimation of details of these dynamics unavailable in other systems. Through analysis of this data, I address open questions regarding patterns of adaptation and the effects of seasonality in the human influenza virus.

Gene-tree species-tree

Probabilistic Analysis of gene families with respect to gene duplication, loss, and transfer

Consistency properties of species tree inference algorithms under the multispecies coalescent

The end of lineage sorting: inferring species trees using *BEAST


Dynamic homology and phylogenetic systematics

A Bayesian perspective on alignment