Tabach et al., 2013 Correlated conservation
The phylogenetic profiles of approximately 20,000 C. elegans proteins was determined in 85 genomes, representing diverse taxa of the eukaryotic tree of life: 33 animals, 6 land plants, 1 alga, 31 Ascomycota fungi, 3 Basidiomycota fungi and 12 protists. A Bayesian approach was used to integrate the phylogenetic profile analysis with predictions from diverse transcriptional coregulation and proteome interaction data sets to assign a probability for each protein for a role in a small RNA pathway. A non-binary method of phylogenetic profiling was developed and used to cluster all protein sequences encoded by C. elegans genes. BLAST scores were normalized to the length of the query sequence and for relative phylogenetic distance between C. elegans and the queried organism. The matrix of 864,644 conservation scores for the 10,054 C. elegans proteins in the 86 genomes was queried either with a single protein to generate a ranking of other C. elegans proteins with the most similar pattern of conservation values or using a more global hierarchical clustering method. Correlation coefficients were calculated using the normalized phylogenetic profile matrix (NPP) and genes were rank ordered.