Bias and Benefit Induced by Intra-Species Paralogy in Guilt by Association Methods to Predict Protein Function
Abstract
Most genomes contain large amounts of orphan genes. For example, 60\% of the P.falciparum genes (the main causal agent of Malaria) lack functional annotation. New approaches, commonly referred as Guilt by Association (GBA) methods, have been proposed to functionally annotate the orphan genes. These methods are based on classification (non-supervised or supervised) applied on post-genomic data. They typically use transcriptome and protein interaction data, but do not exploit paralogy. However, homologous genes from the same species (i.e. paralogues) tend to share similar function(s). Results: This article focus on the effects of paralogy on GBA methods. We illustrate the strength of the dependence between functional annotations and paralogy, and show that applying any GBA method without accounting for this form of homology has two opposite effects: it leads us to over-estimate the method accuracy with orphan genes, and to lose the benefit of paralogy. We present and discuss a resampling algorithm to correctly estimate the performance of any GBA method, as well as a simple, general scheme that can be combined with any GBA method in order to benefit from paralogy. Both procedures are used to measure the bias and benefit on transcriptomic data from Yeast and P.falciparum, two organisms that are well and poorly annotated, respectively. Our results show that both the bias and the benefit induced by paralogy may be substantial, depending on the GBA methods to be considered, the data and the organisms. Conclusions: The search of annotated paralogues should be incorporated in the design of any GBA methods. Moreover, our resampling procedure should be used routinely to obtain predictions with unbiased performance estimates, which is of first importance, eg, to chose among contradictory predictions, or to select target genes to perform wet experiments.