next on phyloseminar.org
Combating phylogenetic artefacts by modeling site-specific substitution processes with mixture models and approximations
The most widely used phylogenetic models of amino acid substitution involve a single reversible empirical substitution matrix (e.g. LG, WAG, JTT etc.) and a mixture model of rate heterogeneity cross sites, such as a discretized gamma distribution. However, these models fail to capture important constraints on protein sequence evolution, heterogeneity in the substitution process across the tree, and heterogeneity across multiple proteins in a concatenated data matrix. Failure to model these features of the data can lead to artefacts in phylogenetic reconstructions, especially for "deep" phylogenetic problems. Here I focus on the importance of modeling site-specific heterogeneity in the substitution process.
The structural and functional roles of residues in proteins lead to constraints on the kinds of amino acids that may be substituted at positions over time, a feature that is not captured by the single-matrix models. Site-heterogeneous mixture models have been developed to address this issue. For example, the "CAT" mixture models (CAT-Poisson or CAT-GTR), implemented in the Phylobayes program, have been shown to successfully avoid long branch attraction problems associated with single-matrix analyses in a number of published cases. However, the utility of these and other mixture models is severely limited for very large phylogenomic analyses because of their computational time cost and memory usage. I will discuss several simple rapid and efficient approximations to these full profile mixture models. Our simulation and empirical data analyses demonstrate that these approximations ameliorate long branch attraction artefacts and, in several cases, provide more accurate estimates of phylogenies than the mixture models from which they derive.
Modeling substitutional heterogeneity and its impact on inferring relationships
Heterogeneity in amino acid substitution is an inherent feature of most phylogenomic-scale datasets, and modeling such heterogeneity is now widely seen as important for phylogenomic inference. Site-heterogeneous substitution models such as CAT-F81 and CAT-GTR, as implemented in PhyloBayes, have been forcefully advocated for use on large datasets because they may reduce long-branch attraction artifacts that could result from not adequately modeling amino acid substitutional heterogeneity. However, site-heterogeneous models arguably became popular not because of a deep appreciation for how well they modeled substitutional heterogeneity, but rather because analyses with CAT models often resulted in trees that matched preconceived notions of animal phylogeny (e.g., sponges as the sister lineage to all other extant animals). Importantly, site-heterogeneous models have not been thoroughly compared to other methods for modeling substitutional heterogeneity such as coarse modeling of heterogeneity with data partitioning coupled with site-homogeneous models such as WAG or LG. Here, I show through analyses of simulated and empirical data that data partitioning often performs as well as, or better than, site-heterogeneous CAT models. In contrast to past claims, I demonstrate that partitioning with site-homogeneous models suppresses long-branch attraction artifacts as well as CAT-GTR and much better than CAT-F81. Analyses with data partitioning and site-homogeneous models can require orders of magnitude less computational time than popular site-heterogeneous models, while still resulting in reasonably accurate trees. Although site-heterogeneous models may describe the amino acid substitutional process much better than data partitioning with site-homogeneous models, current implementations of the most popular site-heterogeneous models do not appear to result in more accurate phylogenetic hypotheses than those inferred with partitioning. Thus, the need to model fine-scale site-heterogeneity in phylogenetic inference is called into question.
Systematic errors in phylogenomic studies: on the importance of modeling pattern-heterogeneity across sites.
While all models now used in phylogenetic analyses account for rate-heterogeneity across sites, the case of pattern-heterogeneity (i.e. qualitative variation in substitution processes across nucleotide or amino-acid positions) is much less clear and has recently been the subject of some controversy. One main question is whether pattern-heterogeneity should be modelled at the level of genes (or groups of genes), or at the level of sites. Both approaches have been used in recent phylogenomic analyses of metazoans â sometimes leading to radically different conclusions, in particular concerning the early patterns of diversification within this group.
In this talk, I will first explore the empirical evidence concerning the presence, and the relative importance, of either type of heterogeneity in empirical sequence alignments. Then, I will introduce Dirichlet process mixture models accounting for site-specific amino-acid preferences. The statistical meaning of Dirichlet processes, as a non-parametric method for estimating arbitrary distributions of site-specific effects, will be explained and illustrated through simulation experiments. Finally, based on simulations implementing pattern heterogeneity simultaneously at both the gene and the site levels, I will show the importance of using models explicitly accounting for pattern-heterogeneity across sites for reconstructing accurate phylogenies.