previously recorded seminars
Unfortunately we were unable to convert a few of the old seminar videos for YouTube; please accept our apologies.
Consistency properties of species tree inference algorithms under the multispecies coalescent
Probabilistic Analysis of gene families with respect to gene duplication, loss, and transfer
The influenza A virus infects approximately 500 million individuals each year. Owing to its RNA makeup, influenza mutates extremely rapidly allowing the virus population to escape the pull of the human immune system. A single individual may be infected year after year by antigenically novel strains. As result of this rate of mutation, the timescale of influenza evolution is a human timescale. We get the chance to observe the process of evolution in action. However, the rapid pace of evolution also causes an intrinsic link between evolutionary and ecological dynamics in the virus population. The availability of temporally spaced sequence data allows estimation of details of these dynamics unavailable in other systems. Through analysis of this data, I address open questions regarding patterns of adaptation and the effects of seasonality in the human influenza virus.
Phylogenetic diffusion models and their applications in viral epidemiology
Emerging infectious diseases continue to appear all over the world, and importantly, they have also risen significantly over time after. Having the potential to quickly adapt to new hosts and environments, RNA viruses are prime candidates to emerge as global threats to human health. Their rapid rate of evolution, however, also turns viral genomes into valuable resources to reconstruct the spatial and temporal processes that are shaping epidemic or endemic dynamics.
In this seminar, I will highlight recent developments in phylogenetic diffusion models that tie together sequence evolution and geographic history in a coherent statistical framework. Both discrete and continuous phylogeographic models have recently been implemented in a Bayesian statistical approach. I will position this approach among other popular phylogeographic methods, and then focus on applications in viral molecular epidemiology to demonstrate their use. Finally, I will hint at future extensions that may provide entirely new opportunities for phylogeographic hypothesis testing.
Phylogenetic challenges in the retroviridae branch of the tree of life
The representation of all virus families within a single phylogenetic tree may be a misleading description of their evolutionary history. First, it is unlikely that all viruses originated from a unique common ancestor. Second, viruses (retroviruses in particular) can integrate into the host genome and be transmitted vertically as well horizontally. Third, different viral genera can evolve according to dramatically different molecular clocks. Three paradigmatic examples from the retroviridae family will be considered here: the simian foamy viruses (SFVs); the primate T-lymphotropic viruses (PTLVs), which include HTLV and STLV, and the primate lentiviruses (PLVs), which include SIV, HIV-1 and HIV-2. SFV is an example of an ancient virus that has been co-evolving with its primate hosts over the last 30 million years. PTLVs emerged around 300 thousand years ago and are characterized by frequent interspecies transmissions and multiple introductions into human populations since prehistoric times. PLVs have a much more recent origin and only within the last 200 years have been able to spread successfully within the human population. The complex relationship between population dynamics and evolutionary time-scale of these retroviruses, as well as the challenge of their integration within the tree of life will be discussed.
Accurate estimation of evolutionary attributes of coding sequences and evolutionary fingerprinting
Codon substitution models have facilitated the interpretation of evolutionary forces operating on genomes. Most of these models, however, assume a single rate of non-synonymous substitution irrespective of the nature of amino acids being exchanged. Recent developments have shown that models which allow for amino acid pairs to have different rates of substitution offer improved fit over single rate models. However, these approaches have been limited by the necessity for large alignments in their estimation or the adoption of a particular residue exchangeability scale. We present an alternative procedure which assigns substitution rates between amino acid pairs can be subdivided into a few rate classes, dependent on the information content of the alignment. This procedure permits us to infer generalizable models for specific genes, organisms and taxonomic clades.
What poultry breeders and guinea pigs have to tell us about statistical nonmolecular phylogenetics
We are far from having an understanding of the determination of morphological characters at the genome level, so most evolutionary biologists working on them still need to use phenotypic approaches. I will discuss the prospects for using the tools of quantitative genetics, which has faced the same dilemma for the past century. I will use as examples three projects of my own. One, which is joint work with Fred Bookstein, adapts the tools of morphometrics, of which he is a chief developer, to modeling change of morphological forms on phylogenies. The second is a similar project that asks how to best place fossil forms into a phylogeny of present-day species when there is molecular data enabling us to get a good estimate of the phylogeny for those species. The third models discrete 0/1 characters using the Threshold Model developed by Sewall Wright for his work on guinea pigs. All of these lead to asking whether we can connect Brownian Motion models with quantitative genetics models. In all such cases we will have limits on what we can infer, and need to be aware of the need to carry that uncertainty through any subsequent inference using these results.
We're building the tree of life, but what can we do with it? It seems clear that there is a wealth of information about evolution in the structure of this tree. There are some methods that can use phylogenetic trees to test macroevolutionary models, but the range of models that we can test is still severely limited. In some cases, such as the estimation of extinction rates from phylogenetic trees, current methods have proven controversial. We are now beginning to develop and implement methods that use tree-of-life scale data to answer key questions in evolution. I will review three new approaches developed in my lab for analyzing comparative datasets: MECCA, fossil-Medusa, and reversible-jump MCMC. I argue that these methods represent the next generation of comparative methods that will open the door to analyzing a much broader range of models with large datasets.
For decades, biologists have addressed evolutionary and ecological questions using measurements of species traits, phylogenies, and an assortment of comparative methods. Unfortunately, while there is a large assortment of these methods, they are still fairly limited and development of new methods is slow. It took seven years between the introduction of using a simple Brownian motion model for looking at trait evolution (Felsenstein, 1985) and the use of this same model for looking at rates of trait evolution (Garland, 1992), and an additional 14 years to more powerful tests using a small modification of the basic model (O'Meara et al., 2006). Still other promising methods are described and even tested but remain unavailable to empiricists because they are not put into software. As a result, the questions empiricists can ask about the world are limited by the research productivity of the few dozen scientists who develop and implement new methods in phylogenetics. We describe a new approach based on Approximate Bayesian Computation and implemented in R that will allow researchers to easily develop their own models for trait evolution without requiring them to have specialized mathematical or computational knowledge.
Locating protein-coding sequences under selection for additional, overlapping functions in 29 mammalian genomes
The degeneracy of the genetic code allows protein-coding DNA and RNA sequences to simultaneously encode additional, overlapping functional elements. A sequence in which both protein-coding and additional overlapping functions have evolved under purifying selection should show increased evolutionary conservation compared to typical protein-coding genes -- especially at synonymous sites. We developed a method to systematically locate short regions within known ORFs that show conspicuously low estimated rates of synonymous substitution, based on phylogenetic codon rate models and likelihood ratio tests.
We applied this method to genome alignments of 29 placental mammals, resulting in more than 10,000 “synonymous constraint elements” (SCEs) with resolution down to nine-codon windows. These are found within more than a quarter of all human protein-coding genes and contain ~2% of their synonymous sites. We collected numerous lines of evidence that the observed synonymous constraint in these regions reflects selection on overlapping functional elements including splicing regulatory elements, dual-coding genes, RNA secondary structures, microRNA target sites, and developmental enhancers. We also ruled out certain alternative explanations such as codon usage bias and neutral rate variation.
Our initial results show that overlapping functional elements are common in mammalian genes, despite the vast genomic landscape. Furthermore, anticipating the future availability of additional mammalian and vertebrate genomes, we are currently developing Bayesian codon modeling methods to measure synonymous rates at even higher resolutions, perhaps eventually allowing the detection of individual regulator binding sites embedded in protein-coding ORFs.
Bayesian inference of ancient human demography from individual genome sequences
Besides their value for biomedicine, individual genome sequences represent a rich source of information about human evolution. I will describe an effort to estimate key evolutionary parameters from the genome sequences of six individuals from diverse human populations. We have used a Bayesian approach based on coalescent theory to extract information about ancestral population sizes, divergence times, and migration rates from inferred genealogies at many neutrally evolving loci from across the genome. We introduce new methods for accounting for gene flow between populations and integrating over possible phasings of diploid genotypes. I will also describe a custom pipeline for genotype inference to mitigate possible biases from heterogeneous sequencing technologies, coverage levels, and read lengths. Our analysis indicates that the San of Southern Africa diverged from other human populations 108--157 thousand years ago (kya), that Eurasian populations diverged 38--64 kya, and that the effective population size of the ancestors of all modern humans was ~9,000.
Fungi occupy diverse ecological niches in roles from nutrient cycling in rainforest floors to aggressive plant and animal pathogens. Molecular phylogenetics has helped resolve many of branches on the Fungal tree of life and enabling studies of evolution across this diverse kingdom. The genome sequences from hundreds of fungi now permit the study of change in genes and gene content in this phylogenetic context and to connect molecular evolution with adaptation to ecological niches or changes in lifestyles. I will describe our work in studies contrasting pathogenic and non-pathogenic fungi and efforts to unravel the evolution of multicellularity in fungi comparing unicellular basal fungi with multicellular mushrooms and molds.
The development of tools for data mining and use of fungal genomics is also driving the pace of molecular biology and genetics of fungi. I will highlight new approaches to make this easier and the ways data integration can inform and transform studies of functional biology of fungi.
Accurate reconstruction of insertion-deletion histories by statistical phylogenetics
The "multiple sequence alignment" is a computational artifact. In nature there is no such thing; rather, an alignment represents a partial summary either of indel history, or of structural similarity. Here we show, via evolutionary simulation tests, that all currently-available multiple alignment tools introduce systematic biases into downstream evolutionary analysis - particularly when used to reconstruct histories of insertions and deletions.
I will present our unification of Felsenstein's "pruning" algorithm and "progressive alignment" to build a fast, linearly-scaling approximate-maximum-likelihood phylogenetic alignment/reconstruction algorithm. Inference of evolutionary history in this framework displays a clear improvement in accuracy over non-statistical phylogenetic reconstructions and a massive improvement in performance over slow-running MCMC statistical reconstructions.
The key component of a probabilistic joint approach to tree and alignment inference is a Continuous Time Markov Chain (CTMC) over strings. Ideally, this CTMC should support tractable inference algorithms and should be easily extensible to support a wide range of evolutionary models. The classical string-valued CTMC, the TKF91 model (Thorne et al., 1991), is limited in both of these axes. Previous work has focussed on increasing the complexity of the TKF91 model, making the inference problem computationally more difficult (Miklos et al., 2004).
In this work, we present a new stochastic process, the Poisson Indel Process (PIP), which allows simple and practical inference algorithms. Efficient computations are based on an exchangeable representation and on Poisson processes. This representation gives a natural way of extending the capacity of the model while keeping inference computationally practical.
We used this process to design a joint Bayesian estimator over alignments and trees. We evaluated both consensus trees and alignments against standard baselines on synthetic and real data. These experiments demonstrate that competitive trees and alignments can be inferred using a Bayesian model equipped with a PIP prior.
Introduction to phytools and phangorn: phylogenetics tools for R
phytools is a new multifunctional phylogenetics package for the R statistical computing environment. The focus of the package is on methods for phylogenetic comparative biology; however it also includes tools for simulation, phylogeny input/output, manipulation, and even inference. The phytools library is designed for maximum interoperability with other important R phylogenetics packages such as ape, geiger, and phangorn.
phangorn is a package for phylogenetic reconstruction and analysis in the R language. Previously it was only possible to estimate phylogenetic trees with distance methods in R. phangorn, now offers the possibility of reconstructing phylogenies with distance based methods, maximum parsimony or maximum likelihood (ML) and performing Hadamard conjugation. Extending the general ML framework, this package provides the possibility of estimating mixture and partition models. Furthermore, phangorn offers several functions for comparing trees, phylogenetic models or splits, simulating character data and performing congruence analyses.
Introduction to HyPhy: Hypothesis testing using Phylogenies
HyPhy is an open-source software package for the analysis of genetic sequences using techniques in phylogenetics, molecular evolution, and machine learning. It features a complete graphical user interface (GUI) and a rich scripting language for limitless customization of analyses. Additionally, HyPhy features support for parallel computing environments (via message passing interface) and it can be compiled as a shared library and called from other programming environments such as Python or R.
RevBayes: An R like Environment for Bayesian phylogenetic inference
RevBayes is a computer program that uses directed acyclic graphs (DAG's) to specify any type of model, to hold the model and data in memory, and to compute the likelihood of the parameters of the model. DAG's provide a framework for the construction of modular models. Models can easily be extended and/or parts of the model exchanged (e.g., the substitution process and clock model) and several models can be combined. The design of RevBayes should allow the implementation of any extension to existing models. RevBayes is mainly developed for Bayesian phylogenetic analyses, but it can be extended to any inference on probabilistic models.
In this talk, I will give a brief introduction to the concept of DAG's and how they are used to construct a model. Once the model is specified, I will show how to simulate new observations under the model and how to estimate its parameters. I will demonstrate this in the RevLanguage, which is an R-like language for building DAG's for phylogenetic problems. The RevLanguage is used interactively to specify the model, as done with R. I will show how a full phylogenetic model is specified, step-by-step. I will mainly focus on various standard substitution models, relaxed clock models, and divergence times priors. Specifically, I will show a new birth-death model with speciation and extinction rates varying over time and use this in a integrative analysis. In the integrative analysis I condition only on the alignment (only the alignment is considered to be known) and estimate the tree and divergence times simultaneously as well as the speciation and extinction rates.
Example files for the demonstration are available here.
Structure and molecular evolution
Protein Structural, Biophysical, and Genomic Underpinnings of Protein Sequence Evolution
Common models for amino acid substitution assume that each site evolves independently according to average properties in the absence of a genomic, protein structural or functional context. Two characterizations of amino acid substitution will be presented. One approach extends a population genetic model to inter-specific genomic data and a second approach evaluates the effects of selection for protein folding and protein-protein interaction on sequence evolution. Several take home lessons include the importance of considering linkage independent of protein structure, the importance of negative pleiotropy (or not statements in folding and binding), and the nature of the co-evolution of sites and how it links standard substitution models with covarion models when binding function is conserved and when it changes.
Simulating evolution with in silico models of protein thermodynamics
Many of the most basic issues of protein evolution are difficult to determine from the relationship between existent protein sequences. We would ideally like to analyse the complete evolutionary record: what mutations were attempted when in what lineage, which ones were deleterious or advantageous and by how much, which ones were accepted, and how these substitutions affected further mutations and the overall evolution of protein properties. In the absence of available biological data, we can create our own - simulate protein evolution in silico, such as in our work modelling how proteins would evolve given their need to be thermodynamically stable. These simulations allow us to explore a range of phenomena and develop a conceptual framework that tells us which questions may be interesting and important to consider in real proteins. Such simulations can also illuminate which conditions are necessary and/or sufficient to explain observed protein characteristics. We consider how evolution of protein thermostability explains why proteins are generally marginally stable, why eukaryotes may have more disordered proteins than prokaryotes, and what the consequences of this are for biochemical networks. We also consider how various locations in a protein can co-evolve, and how this can inform the next generation of substitution models.
Adaptation, coevolution, and convergence in the context of protein thermodynamics
Interactions within and between proteins are a fundamentally important part of how they evolve and adapt. We have been considering how and why proteins adapt, coevolve, and converge, and working to understand these concepts in the context of protein thermostability and function. We will expand from the previous talk of our collaborator, Dr. Goldstein, and discuss how and why coevolution is and should be detected, and how thermostability affects reconstruction of ancestral functions. Further, we will discuss our work on adaptive redesign in mitochondrial proteins, perhaps the largest known case of an adaptive burst in multiple metabolic proteins. The convergence between ancestral snakes and ancestral acrodont lizards is also perhaps the largest known case of adaptive convergence. We will consider what these examples tell us about the theory of how proteins appear to evolve in the context of nearly neutral versus cases of adaptive change. Further, we will discuss the impact on understanding phylogenetic relationships, and we will also discuss a unified theory of nearly neutral and adaptive evolution in the context of structure and function.
Rates and Dates
Inferring macroevolutionary processes based on phylogenetic trees
Phylogenetic trees of present-day species allow inference of the rate of speciation and extinction which led to the present-day diversity. Classically, inference methods assume a constant rate of diversification, or neglect extinction. I will discuss major limitations of this null model and will present a new framework which allows speciation and extinction rates to change through time (environmental-dependent diversification), with the number of species (density-dependent diversification), and with a trait of a species (trait-dependent diversification). For the latter model, particular focus is given to the trait being the age of a species. Issues arising in empirical data analysis, such as incomplete taxon sampling, model selection, and confidence interval estimation, will be discussed. The methods reveal interesting macroevolutionary dynamics for mammals, birds and ants, and can easily be applied to other datasets using the R packages TreePar and TreeSim available on CRAN.
Understanding biodiversity patterns using the Tree of Life
Species richness results from past and current speciation, extinction and dispersal events, themselves influenced by various ecological and evolutionary processes. Estimating rates of diversification, and understanding how and why they vary over evolutionary time, geographical space, and species groups, is thus key to understanding how ecological and evolutionary processes generate biological diversity. Phylogenetic approaches are critical for making such inferences, especially in groups or regions lacking fossil data. I will illustrate how phylogenies, coupled with models of cladogenesis, can be used to test the role of ecological limits, boom-then-bust diversity dynamics, the paleoenvironment, and population dynamics on the biodiversity patterns that we observe today.
Phylogenetics and language
Language phylogenies and cultural evolution
Charles Darwin famously noted that there were many curious parallels between the evolution of species and languages. Since then evolutionary biology and historical linguistics have used trees to conceptualise evolution. However, whilst evolutionary biology developed the vast discipline of phylogenetic methods, linguistics dabbled with computational methods before rejecting them. The last decade or so has seen the introduction of phylogenetic methods into linguistics, often with some startling results. In this talk I will present some of these studies, and discuss how phylogenetics can help us grapple with the problems of linguistic and cultural evolution. These problems range from testing population dispersal hypotheses, to investigating the shape of cultural evolution, to inferring the rates at which languages change.
Anthropologists had a name for the non-independence-of-species-problem way back in the 1880s. Solving "Galton's Problem", and the promise of comparative methods for testing hypotheses about cultural adaptation and correlated evolution was a major catalyst for the field of cultural phylogenetics. In this talk I will show how linguistic, cultural, and archaeological data is used in comparative phylogenetic analyses. The "treasure trove of anthropology" - our vast ethnographic record of cultures - is now being put to good use answering questions about cross-cultural similarities and differences in human social and cultural norms in a rigorous evolutionary framework.
Bobbins, Borrowing, and Bayesian Inference: Horizontal Transfer and the application of Phylogenetic Methods in Cultural Evolution studies
Researchers have applied quantitative phylogenetic methods to study human cultural and linguistic evolution. However, a common critique of this approach is that cultural evolution and biological evolution differ in important ways that make phylogenetic analyses unsuitable for cultural data. Principally, horizontal transmission (or borrowing) of cultural and linguistic traits is argued to be so pervasive as to invalidate the approach. In this talk I will address this issue by asking how much does horizontal transfer occur?, and does it matter if it does? Contra the skeptics, I will discuss studies that demonstrate that 1) many biological systems also show non-tree-like patterns of evolution, 2) cultural systems vary in the degree to which horizontal transfer occurs, and 3) borrowing does not necessarily cause big problems. Rather than being a reason to give up on the whole project, borrowing can be productively investigated using phylogenetic techniques to yield deeper insights cultural and linguistic evolution.
In honor of Carl Woese
Following Carl Woese into the Natural Microbial World – The Beginnings of Metagenomics
Carl Woese, one of the great scientists of all time, died in December, 2012. Among other important contributions, he used primitive sequencing technology to compare small subunit (16S) ribosomal RNA sequences from different organisms and thereby establish the outlines of a universal tree of life. His results also put in place a sequence-based reference framework within which to understand and articulate biological diversity. Since this perspective is based on molecular sequences and not properties of organisms, it opened the door to begin to understand the kinds of organisms that make up the natural microbial world. Prior to Woese’s sequence-based reference framework, microbial ecologists had to culture organisms to study them, but not many environmental organisms, <<1%, are cultured using standard methods. Sequence surveys of environmental microbial genes and genomes – “metagenomics” - have now revolutionized understanding of microbial ecology, including its influence on human health. The seminar will discuss how metagenomics developed and the impact it has had on our understanding of environmental microbial diversity and the structure of the molecular tree of life.
How Carl Woese transformed the field of microbial ecology
The challenges of dissecting naturally occurring microbial assemblages, with respect to their community composition, interspecies interactions, functional attributes, and activities, are numerous and daunting. For many years, these challenges impeded our understanding of the properties and dynamics of microbial communities, and thus hindered development of the field of microbial ecology. Enter Carl Woese: the theory and application of molecular phylogenetics and genomics in studies of microbial evolution and ecology can be traced directly to Woese and one of his primary collaborators, Norman Pace. This lecture will trace the logic and roots of the application of molecular phylogenetics and genomics to the study of microbial ecology, through a historical review and examination of its past and current applications.
Carl Woese's grand view of life that just keeps getting grander
Most microorganisms cannot be grown in pure culture (or at least not easily). This has been apparent for decades by comparing the number of cells seen under a microscope to the fraction of those cells that will grow into colony forming units (typically <1%). The objective classification of cellular life by comparative rRNA analysis pioneered by Carl Woese provided the first grand view of the tree of life and also provided the reference framework upon which his friend and colleague, Norman Pace, developed ways to directly survey microbial communities via their rRNA sequences without the need to grow them. This put our degree of ignorance of the microbial world into perspective: dozens of major microbial lineages have emerged over the last 20 years that lack even a single cultured representative. New approaches, such as deep metagenomics and single cell genomics, are now transforming the rRNA-based phylogenetic outlines of the tree of life into a fully-fledged genome-based view of the tree. I will present a recent snapshot overview of the genome tree of the bacterial and archaeal domains and examples of functional insights in the context of a more complete view of microbial evolution.
Integrating fossils into phylogenies
The Fossilized Birth-Death Process: A Coherent Model of Fossil Calibration for Divergence Time Estimation
Accurate estimates of absolute node ages are critical for addressing a wide range of questions in evolutionary biology. Because molecular sequence data are not informative on absolute time, external data–most commonly fossil age estimates–are required to calibrate estimates of species divergence times. For Bayesian divergence-time methods, the common practice for calibration using fossil information involves placing arbitrarily-chosen and parameterized parametric distributions on internal nodes, often disregarding most of the information in the fossil record. The `fossilized birth-death' (FBD) process is a model for calibrating divergence-time estimates in a Bayesian framework, explicitly acknowledging that extant species and fossils are observations from the same macroevolutionary process. Under this model, absolute node age estimates are calibrated by a single diversification model and arbitrary calibration densities are not necessary. Moreover, the FBD model allows for inclusion of all available fossils. We performed analyses of simulated data and show that node-age estimation under the FBD model results in accurate estimates of species divergence times with realistic measures of statistical uncertainty, overcoming major limitations of standard divergence time estimation methods.
The fossil record offers a rich source of macroevolutionary data. Fossils can reveal transitional forms that could not be predicted from extant taxa alone, reveal unexpected biogeographic patterns, and provide temporal information crucial for inferring rates of evolution and correlations between evolution and abiotic events. At the same time, including fossil taxa in phylogenetic analyses presents many challenges. Currently, there are a wide variety of methods for including fossil data in phylogenetic analyses ranging from indirect use of fossil ages to inform divergence dates to simultaneous analyses of fossil and extant taxa under various optimality criteria and with varying levels of constraints. One important consideration remains that fossils typically provide only morphological data, which can lead to problems related to missing data and potential violation of common assumptions for model-based phylogeny inference methods designed primarily for molecular sequence data. Morphological character data are typically harvested from from fossils taxa not at random, but with an intentional bias towards parsimony-informative characters (with apomorphies omitted from matrices). Combined with issues related to sparse codings in large combined matrices, care must be taken to avoid spurious inferences.
Phylogenetic Paleobiology: What do we stand to gain from integrating fossils and phylogenies in macroevolutionary analyses?
The aim of macroevolutionary science is to understand the patterns and processes responsible for generating organismal diversity in space and time. Although macroevolutionary change typically occurs over geologic timescales and has traditionally been studied by paleobiologists, comparative biologists have become increasingly interested in macroevolutionary questions, utilizing time-calibrated molecular phylogenies of extant taxa as a framework for testing hypotheses about rates of evolution. In this seminar, I’ll examine how integrating fossils and phylogenies can increase our power to test and answer fundamental questions about tempo and mode in phenotypic evolution. Integrating fossil taxa into phylogenies of extant taxa is worth the effort: on a per taxon basis, fossils contribute more information about macroevolutionary pattern and process and increase our ability to distinguish processes that leave similar signals in extant species datasets. I’ll discuss some recent work, and highlight how fossil information can be used to inform macroevolutionary inference when a combined phylogeny is lacking. One theme emerges from all of this work; we stand to gain a better understanding of macroevolution not when we approach it as biologists or paleontologists but, as G.G. Simpson recommended 60 years ago, as practitioners of both.
Open Tree of Life
Technical and social challenges of synthesizing phylogenetic data across the tree of life
Open Tree of Life aims to synthesize published phylogenetic data into a comprehensive tree of life. The challenges associated with the collection, curation and synthesis of both phylogenetic and taxonomic input data are both technical and social. We present the first draft of the Open Tree of Life, as well as the workflow and software tools for curating, annotating and viewing phylogenetic data. In a subsequent Phyloseminar, Stephen Smith will present details of the phylogenetic synthesis methods.
Exploring graphs for mapping and synthesizing phylogenies
The emergence of graph databases has presented a potential alternative for ways of storing and querying phylogenetic trees. The Open Tree of Life has been exploring these options and ways that trees from multiple datasets or within a single dataset can be placed in a graph database. I will go over some of the ways that we do this and how we can query and synthesize trees as an alternative to supertrees and consensus trees. While still a work in progress, these methods show great promise for further development.
Mini-course on genome-scale phylogeny
Inferring gene trees with species trees
Molecular phylogeny has focused mainly on improving models for the reconstruction of gene trees based on sequence alignments. Yet, most phylogeneticists seek to reveal the history of species. Although the histories of genes and species are tightly linked, they are seldom identical, because genes duplicate, are lost or horizontally transferred. Building models describing the relationship between gene and species trees can thus improve the reconstruction of gene trees when a species tree is known, and vice-versa. Several approaches have been proposed to solve the problem in one direction or the other, but in general neither gene trees nor species trees are known. Only a few studies have attempted to jointly infer gene trees and species trees.
I introduce models that describe the relationship between gene trees and species trees. I begin with models that account for gene duplication and loss, and subsequently introduce models that account for the horizontal transfer of genes. I review results from simulations as well as empirical studies on genomic data that show that combining gene tree-species tree models with models of sequence evolution improves gene tree reconstruction. In turn, these better gene trees provide a better basis for studying genome evolution or reconstructing ancestral chromosomes and ancestral gene sequences. I also discuss the possibility of extracting information on the timing of speciation events from ancient horizontal transfer events.
Gene tree-species tree methods for comparative genomics
In this second talk of our series on genome-scale phylogeny, I build upon Gergely's introduction and present the modelling assumptions and algorithmic details behind some of the methods we and others have developed. There will be two parts to this talk. I start with the model of gene duplications and losses implemented in PHYLDOG. I present the assumptions we make and the shortcuts we take to improve the program's efficiency, and show some results on real and simulated sequence data. I notably show problems that arise when the program is confronted with data generated with a model of incomplete lineage sorting (Rasmussen and Kellis, 2012), and present avenues of research to find solutions to these problems. In the second part, I present our current efforts to use our model of gene duplication, loss, and transfer (Szöllosi et al, 2013) to infer a species tree in which speciation nodes are ordered in time. I briefly remind the forgetful viewer of what this model does and how it works, and I then explain how we devise a new MCMC algorithm to use it on data sets containing dozens of species and thousands of gene families. I finish with some perspectives of our plans uniting gene tree-species tree models and databases of gene families and phylogenetic trees.
Genome rearrangements were discovered and used to build molecular phylogenies in the 1930s. They are implied in many cancers and their evolutionary role might be of primary importance. But the mathematical and computational tools to model rearrangements are still not as efficient as the ones developed later for local mutations as nucleotide or amino-acid substitutions. In this seminar I will report the attempts to integrate genome organisations in the usual models of genome evolution. I will explain how this can improve the inference of phylogenies, as well as ancestral genomes.
Phylogenetics of cancer
Phylogenetic analysis of metastatic colon cancer in humans
Metastasis is the main cause of cancer morbidity and mortality. Despite its clinical significance, several fundamental questions about the metastatic process in humans remain unsolved. Does metastasis occur early or late in cancer progression? Do metastases emanate directly from the primary tumor or give rise to each other? How does heterogeneity in the primary tumor relate to the genetic composition of secondary lesions? Addressing these questions – ideally by examining the genetic makeup of tumor cells in distinct anatomic locations and reconstructing their evolutionary relationships – is crucial to improving our understanding of metastasis. I will give an overview of a simple PCR-based assay that enables the tracing of tumor lineage in patient tissue specimens. The methodology relies on somatic variation in highly mutable polyguanine (poly-G) repeats located in non-coding genomic regions. Poly-G mutations are present in a variety of human cancers. In colon carcinoma, an association exists between patient age at diagnosis and tumor mutational burden, suggesting that poly-G variants accumulate during normal division in colonic stem cells. Poorly differentiated colon carcinomas (which have a worse prognosis) have fewer mutations than well-differentiated tumors, possibly indicating a shorter mitotic history of the founder cell in these cancers. By presenting several patient case studies, I will describe how poly-G fingerprints can be used to construct phylogenetic trees that reflect the evolution of metastatic colon cancer, with an emphasis on how biological considerations inform analysis strategies.
Tumour heterogeneity, i.e. the genomic diversity of cancer cells within a single tumour, is thought to be the source of chemotherapy resistance. In many cancers, this heterogeneity is not limited to point mutations but includes large scale genomic rearrangements and endoreduplications that lead to aberrant copy number (CN) profiles. Reconstruction of the evolutionary tree of cancer within the patient allows us to quantify and understand the aetiology of tumour heterogeneity. In some cancers, such as high-grade serous ovarian cancer (HGSOC), CN profiles predominate. However tree inference is hindered by unknown phasing of major and minor CNs, horizontal dependencies between adjacent genomic loci and the lack of curated CN profile databases to use as a reference for probabilistic inference.
We recently developed MEDICC (Minimum Event Distance for Intra-tumour Copy number Comparisons), an algorithm for phylogenetic reconstruction based on CN profiles. MEDICC uses finite-state transducers (FSTs) to encode a minimum evolution criterion that determines pairwise evolutionary distances between CN profiles. This minimum-event distance computes the smallest number of amplification and deletions of arbitrary length that are necessary to transform one genomic profile into another. The FST-based approach allows us thereby to model dependencies between sites, similar to the problem of modelling indels on trees in traditional phylogenetics. Using this approach we are able to phase major and minor CN profiles to the parental alleles and infer trees and ancestral genomes, while minimizing the overall tree length. The distance measure is formulated such that the resulting matrix of pairwise distances has a direct mapping to a positive semi-definite kernel matrix. This allows us to perform principal component analysis in evolutionary space and use this embedding to numerically quantify tumour heterogeneity and other quantities of interest, such as the degree of clonal expansion, using spatial statistics.
I will talk about the basics of FST-based phylogenetic inference and explain how they can be used to model genomic rearrangement events with horizontal dependencies. I will explain how this approach implicitly maps genomes into a feature space in which we can quantify heterogeneity. Finally, I will present clinical results that show how this quantification of ITH can predict resistance development in the hospital.
The genetic diversity of many pathogens is shaped by epidemiological history. But, the dynamics of infectious disease epidemics differ in important ways from demographic processes that have traditionally been studied by population geneticists. In many epidemics, the population size and birth rate changes rapidly in a nonlinear fashion through time. Mathematical models for describing infectious disease dynamics have a long history that has run parallel to the development of modern population genetics, but until recently, there has been little communication between these fields.Interest has grown in developing a new set of mathematical models for genealogies generated by epidemic processes. These methods reveal how the effective population size of a pathogen depends on transmission rates, the number of infected hosts, and the size of the bottleneck at the time of transmission. These mathematical models have also enabled new applications of pathogen genetic data to public health. Pathogen genetic data can be informative about epidemic processes in ways that standard surveillance data are not, especially regarding the source of infections and risk factors for transmission. I will review several approaches to mathematical modeling of pathogen genealogies and present applications of these methods to HIV-1 and the recent Ebola virus epidemic in Western Africa.
Major recent advances in genome sequencing technology make it feasible that in future epidemics, a sequence will be available for every clinical case that can be identified. In some scenarios, such as agricultural epidemics (where farm-to-farm spread is of more interest than animal-to-animal), diseases such as HIV (where most infected individuals will eventually present themselves to clinicians), and epidemics occurring in well-monitored populations such as hospital inpatients, we will as a consequence be able to acquire a set of sequences representing the pathogens infecting most or all cases in the transmission chain. Genetic data therefore provides an important new tool for the investigation of epidemics, in particular the determination of the epidemic's transmission tree, which describes which case infected which others. As the genetic diversity in a set of sequences taken from the same epidemic will not be enormous even for fast-evolving RNA viruses, the best approach would be to combine both genetic and epidemiological data. I present here a new method for transmission tree reconstruction which is integrated into the Bayesian phylogenetics framework available in BEAST. It is based on the observation that if the phylogeny is know, there is a one-to-one correspondence between possible transmission trees and partitions of the internal nodes of the tree into connected subgraphs. The MCMC procedure in BEAST has been modified to sample from the space of trees with nodes partitioned in this way, simultaneously estimating both phylogenetic tree and transmission tree. Rather than assuming that the entire tree is generated by a single coalescent process, the posterior probability of a phylogeny is now calculated based on an individual-based model of disease transmission, which can take into account epidemiological characteristics of the host cases, such as spatial location. I will outline results using simulated data and sequences from the 2003 Dutch epidemic of H7N7 avian influenza.
Phylodynamic methods are widely used to estimate demographic parameters and historical population dynamics from genealogies of individuals sampled from a population. In this phyloseminar, I will describe how we can understand genealogies in terms of basic demographic or ecological processes, and how these concepts can be used to develop statistical models for inference. In particular, I will discuss some similarities and differences between the two main modeling frameworks in phylodynamics: the coalescent and birth-death models. I will also briefly introduce some of the latest statistical methods currently used to fit these models to genealogies. I will end by discussing one of the main challenges facing the field---adequately representing the structure of complex, heterogenous populations in phylodynamic models.
Ancestral recombination graphs
Mathematical and visualization tools for working with ancestral recombination graphs
The fields of phylogenetics and population genetics share several important models including gene trees, species trees, ancestral recombination graphs (ARGs), and pedigrees. These models are all closely related and can be viewed as subgraphs of one another. Amongst them, the ARG is particularly central and if inferred efficiently can enable many applications such as inference of selection and demography. Here, I will review various helpful mathematical tools for working with ARGs, including what we call the threading algorithm, the branch graph, and the leaf trace visualization.
An empirical view of the population pedigree
Often, the summary statistics of population genetics are framed in the setting of Kingman's coalescent or related models. These statistics can be alternatively thought of as descriptive statistics of the realized population pedigree-with-recombination, in a way that has become much more useful in the era of whole-genome sequencing. For instance, pairwise number of nucleotide differences is proportional to "effective population size", which is sometimes more usefully thought of as an estimate of the average length of the path through the pedigree to the most recent common ancestor at a randomly chosen locus (with an explicit standard error). Another example is the pairwise distribution of long tracts of IBD, which provides an estimate of a functional of the entire distribution of such paths.
A demography-aware conditional sampling distribution for inferring ancient demography and detecting introgression patterns
Complex demographic histories shape the genealogies of contemporary individuals and thus have a substantial impact on the genetic variation observed today. These genealogies are commonly modeled by the ancestral recombination graph (ARG), and we developed a novel demography-aware conditional sampling distribution (CSD) to approximate these ARGs under general demographic models. We apply this CSD in an expectation-maximization framework for demographic inference. We show that this method can accurately recover biologically relevant demographic parameters like population divergence times, migration rates, or ancestral population sizes from simulated datasets. Furthermore, we apply the CSD to detect tracts of genetic material that introgressed from Neanderthal into modern humans. Our results are in general agreement with previously published results, and we will discuss the similarities and differences, and their biological implications.
Ebola virus epidemiology, transmission, and viral evolution from four months of sequencing in Sierra Leone (Overview)
Adding to the work reported in Gire, et al (Science, 2014) which sequenced Ebola viruses from the first three weeks of the epidemic in Sierra Leone, we here present analyses of 150 additional viral genomes sampled from EVD cases at Kenema Government Hospital between the months of June to September 2014. We describe continued evidence for sustained human-to-human transmission with no additional zoonotic events, and preliminary results concerning new lineages from Guinea. We also characterize the epidemiological history of the limited number of exported viruses from the country. We also observe a slowing of the viral substitution rate over the course of the outbreak, consistent with the increased effect of purifying selection as the outbreak continues over time. These findings allow a closer view of viral evolution during its extended time in human populations and provide critical insights into the movement of the virus through the region.
This is the first talk in a pair of talks from collaborators Daniel Park and Gytis Dudas concerning their analysis of Ebola virus sequences.