Accounting for exposition and secondary structure in protein evolution: models and gains
Résumé
It has been recognized for a long time that substitution processes vary depending on structural configurations. However, this information is not (or rarely) used in phylogenetic studies, even though the structure of dozen thousands of proteins has been elucidated. Here we reinvestigate the question in order to fill this gap. We used a very large dataset comprising 4,389 protein alignments with structural annotations to estimate new amino-acid substitution matrices for various structural configurations. Moreover, we used an independent sample of 500 alignments to evaluate the gain in tree likelihood brought by these new matrices. Various ways to combine these models (matrices) were envisaged, namely, separate analysis based on available annotations, mixtures (assuming no structural information), and a combination of both based on an estimated parameter that reflects the reliability of structural annotations. Our results show that separate analysis and mixtures are nearly equivalent in average, while our confidence-based approach is best thanks to its ability to detect poorly annotated proteins. Highest likelihood values are obtained with six structural categories combining exposed/buried and alpha/beta/other status of the sites; the average gain is as high as 1.16 AIC points per site, compared to standard WAG model. This six-category model is closely followed by the two-category exposed/buried model, while the secondary structure-based three-category model is worse, but still better than WAG. All these likelihood gains induce significant topological changes in the trees being inferred, indicating that our models should be used routinely by phylogeneticists.