Protein Family Expansions and Biological Complexity

Christine Vogel# [1],[2], Cyrus Chothia [2]
[1] MRC Laboratory of Molecular Biology, Cambridge, UK ; [2] Institute for Cellular and Molecular Biology, University of Texas at Austin, Austin TX, USA; # Correspondence to: cvogel at mail AND utexas AND edu

Introduction


Scope: We analyse domain superfamily expansions across a variety of eukaryotic genomes.

Summary:
During the course of evolution, new proteins are produced very largely as the result of gene duplication, divergence and, in many cases, combination. This means that proteins, or protein domains belong to families or, in the cases where their relationships can only be recognised on the basis of structure, superfamilies whose members are descended from a common ancestor. The size of superfamilies can vary greatly. Also, during the course of evolution organisms of increasing complexity has arisen. In this paper we determine the identity of those superfamilies whose relative sizes in different organisms is highly correlated to the different complexity of the organisms
As a measure of the complexity of 38 uni- and multicellular eukaryotes we took the number of different cell types of which they are composed. Of 1219 different superfamilies, there are 194 whose sizes in the 38 organisms have a high correlation with the number of cell types in the organisms. We give outline descriptions of these superfamilies. Half are involved in extra-cellular processes or regulation and smaller proportions in other types of activity. Half of all superfamilies have no significant correlation with complexity. We also determined whether the expansions of large superfamilies correlate with each other. We found three large clusters of correlated expansions: one involves vertebrate and plant superfamilies, one those in vertebrates and one those in plants.


Analysis of 38 eukaryotic genomes

Data files

Files

The following raw files display clusters of similar abundance profiles. All abundance profile have been hierarchically clustered, and split into groups according to different Pearson correlation coefficient (R-value) cutoffs. The files show either all 1219 superfamilies, with abundance >= 1 in at least one of the genomes; or the 299 superfamilies with abundance >= 25 in at least one of the genomes.
Each file lists the parameters, the genomes and their average number of cell types, the distribution of all superfamilies across the seven major function categories. and the clusters of similar abundance profiles.

Files from the publication:


Links


C. Vogel cvogel at mail utexas edu -- May 2006