A new type of Hidden Markov Models to predict complex motif organization in protein sequences

Raluca Uricaru 1 Laurent Brehelin 1 Eric Rivals 1
1 MAB - Méthodes et Algorithmes pour la Bioinformatique
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier
Abstract : Proteins are composed of structural units, called domains, which can be detected using computational methods. In many families, proteins contain several domains, or several conserved sequence parts, called motifs. During the evolution, rearrangements (swaps or circular permutations) may alter the relative order of these units (domains or motifs) [Bornberg et al. 05], while tandem duplications can change their number. In a family, it results in a variable organization of the units along the chain. Profile HMMs (pHMMs) are the preferred models to represent motifs or domains. They serve to recognize new members of a protein family. Profile HMMs have a linear structure which can model a single unit, and if iterated they can detect repeats of this unit. To perform a "multiple tagging", that is to identify the sequence regions corresponding to each possible motif, one needs to combine the results of each pHMM. This was done following heuristic criteria (e.g., choosing the best scoring motif when the detections overlap). Up to now, we lack a method to perform a multiple tagging automatically and optimally according to a global criterion. To solve this problem, we generalize pHMMs to a novel structure called Cyclic Profile HMMs (CpHMMs), which can predict the most probable motif organization of a protein. It can cope with repeated units whose number varies, and with changes in the relative order of the units. We used CpHMMs to tag multiple motifs in the Pentatrico-Peptide Repeats (PPR) protein family of plants. PPR proteins contain tandem repeats of PPR motifs (named P, L, S) as well as other motifs (E, E+, Dyw). Half of the PPR proteins form the PPRP subfamily, while the other half is represented by the PCMP subfamily. PPRPs can be defined using a PROSITE like regular expression where letters represent motifs: (P*-S*)*, and the organization of PCMPs is described by: (P-L-S*)*-[E-[E+ -[Dyw]]] [Lurin et al. 04]. We used the motifs pHMMs as input to build a CpHMM that can process the entire motif sequence. It allows to obtain the globally optimal "multiple tagging" solution and also measures its statistical significance. Thus, we obtained an automatic, "multiple tagging" tool and applied it first to the Arabidopsis PCMP subfamily. We validated our results compared to the manual motif annotation available for the PCMPs of Arabidopsis and found less than 10% discordance. We were able to retrieve the expected PCMP classes distribution [Rivals et al. 06]. After this validation, we used our tool to annotate the PPRP subfamily in Arabidopsis and to perform the first motif annotation of the complete PPR family of rice (522 proteins). This allows to compare the PPR motif organization between the model plants of monocotyledons and dicotyledons. In conclusion, we prove the capacity of the CpHMM structure to solve the problem of "multiple tagging" and verified its efficiency for a large number of sequences. CpHMMs are a versatile structure that adapts to new situations, like the PPR family of other species, or even to other protein families with complex unit organization (eg leucine rich repeats). In future work, we will concentrate on the fine identification of new subfamilies.
Liste complète des métadonnées

Contributeur : Eric Rivals <>
Soumis le : mercredi 13 décembre 2006 - 14:57:48
Dernière modification le : jeudi 11 janvier 2018 - 06:26:13


  • HAL Id : lirmm-00120171, version 1



Raluca Uricaru, Laurent Brehelin, Eric Rivals. A new type of Hidden Markov Models to predict complex motif organization in protein sequences. Integrative Post-Genomics, 2006, Lyon, pp.43, 2006. 〈lirmm-00120171〉



Consultations de la notice