The characterization of network community structure has profound implications in several scientific areas. Therefore, testing the algorithms developed to establish the optimal division of a network into communities is a fundamental problem in the field. We performed here a highly detailed evaluation of community detection algorithms, which has two main novelties: 1) using complex closed benchmarks, which provide precise ways to assess whether the solutions generated by the algorithms are optimal; and, 2) A novel type of analysis, based on hierarchically clustering the solutions suggested by multiple community detection algorithms, which allows to easily visualize how different are those solutions. Surprise, a global parameter that evaluates the quality of a partition, confirms the power of these analyses. We show that none of the community detection algorithms tested provide consistently optimal results in all networks and that Surprise maximization, obtained by combining multiple algorithms, obtains quasi-optimal performances in these difficult benchmarks.
HECT ubiquitin ligases are key components of the ubiquitin-proteasome system, which is present in all eukaryotes. In this study, the patterns of emergence of HECT genes in plants are described. Phylogenetic and structural data indicate that viridiplantae have six main HECT subfamilies, which arose before the split that separated green algae from the rest of plants. It is estimated that the common ancestor of all plants contained seven HECT genes. Contrary to what happened in animals, the number of HECT genes has been kept quite constant in all lineages, both in chlorophyta and streptophyta, although evolutionary recent duplications are found in some species. Several of the genes found in plants may have originated very early in eukaryotic evolution, given that they have clear similarities, both in sequence and structure, to animal genes. Finally, in Arabidopsis thaliana, we found significant correlations in the expression patterns of HECT genes and some ancient, broadly expressed genes that belong to a different ubiquitin ligase family, called RBR. These results are discussed in the context of the evolution of the gene families required for ubiquitination in plants.
How to determine the community structure of complex networks is an open question. It is critical to establish the best strategies for community detection in networks of unknown structure. Here, using standard synthetic benchmarks, we show that none of the algorithms hitherto developed for community structure characterization perform optimally. Significantly, evaluating the results according to their modularity, the most popular measure of the quality of a partition, systematically provides mistaken solutions. However, a novel quality function, called Surprise, can be used to elucidate which is the optimal division into communities. Consequently, we show that the best strategy to find the community structure of all the networks examined involves choosing among the solutions provided by multiple algorithms the one with the highest Surprise value. We conclude that Surprise maximization precisely reveals the community structure of complex networks.
Most proteins of the TRIM family (also known as RBCC family) are ubiquitin ligases that share a peculiar protein structure, characterized by including an N-terminal RING finger domain closely followed by one or two B-boxes. Additional protein domains found at their C termini have been used to classify TRIM proteins into classes. TRIMs are involved in multiple cellular processes and many of them are essential components of the innate immunity system of animal species. In humans, it has been shown that mutations in several TRIM-encoding genes lead to diverse genetic diseases and contribute to several types of cancer. They had been hitherto detected only in animals. In this work, by comprehensively analyzing the available diversity of TRIM and TRIM-like protein sequences and evaluating their evolutionary patterns, an improved classification of the TRIM family is obtained. Members of one of the TRIM subfamilies defined, called Subfamily A, turn to be present not only in animals, but also in many other eukaryotes, such as fungi, apusozoans, alveolates, excavates and plants. The rest of subfamilies are animal-specific and several of them originated only recently. Subfamily A proteins are characterized by containing a MATH domain, suggesting a potential evolutionary connection between TRIM proteins and a different type of ubiquitin ligases, known as TRAFs, which contain quite similar MATH domains. These results indicate that the TRIM family emerged much earlier than so far thought and contribute to our understanding of its origin and diversification. The structural and evolutionary links with the TRAF family of ubiquitin ligases can be experimentally explored to determine whether functional connections also exist.
The analysis of complex networks permeates all sciences, from biology to sociology. A fundamental, unsolved problem is how to characterize the community structure of a network. Here, using both standard and novel benchmarks, we show that maximization of a simple global parameter, which we call Surprise (S), leads to a very efficient characterization of the community structure of complex synthetic networks. Particularly, S qualitatively outperforms the most commonly used criterion to define communities, Newman and Girvan's modularity (Q). Applying S maximization to real networks often provides natural, well-supported partitions, but also sometimes counterintuitive solutions that expose the limitations of our previous knowledge. These results indicate that it is possible to define an effective global criterion for community structure and open new routes for the understanding of complex networks.
The patterns of emergence and diversification of the families of ubiquitin ligases provide insights about the evolution of the eukaryotic ubiquitination system. U-box ubiquitin ligases (UULs) are proteins characterized by containing a peculiar protein domain known as U box. In this study, the origin of the animal UUL genes is described.
Phylogenetic and structural data indicate that six of the seven main UUL-encoding genes found in humans (UBE4A, UBE4B, UIP5, PRP19, CHIP and CYC4) were already present in the ancestor of all current metazoans and the seventh (WDSUB1) is found in placozoans, cnidarians and bilaterians. The fact that only 4 - 5 genes orthologous to the human ones are present in the choanoflagellate Monosiga brevicollis suggests that several animal-specific cooptions of the U box to generate new genes occurred. Significantly, Monosiga contains five additional UUL genes that are not present in animals. One of them is also present in distantly-related protozoans. Along animal evolution, losses of UUL-encoding genes are rare, except in nematodes, which lack three of them. These general patterns are highly congruent with those found for other two families (RBR, HECT) of ubiquitin ligases.
Finding that the patterns of emergence, diversification and loss of three unrelated families of ubiquitin ligases (RBR, HECT and U-box) are parallel indicates that there are underlying, linage-specific evolutionary forces shaping the complexity of the animal ubiquitin system.
How to extract useful information from complex biological networks is a major goal in many fields, especially in genomics and proteomics. We have shown in several works that iterative hierarchical clustering, as implemented in the UVCluster program, is a powerful tool to analyze many of those networks. However, the amount of computation time required to perform UVCluster analyses imposed significant limitations to its use.
We describe the suite Jerarca, designed to efficiently convert networks of interacting units into dendrograms by means of iterative hierarchical clustering. Jerarca is divided into three main sections. First, weighted distances among units are computed using up to three different approaches: a more efficient version of UVCluster and two new, related algorithms called RCluster and SCluster. Second, Jerarca builds dendrograms based on those distances, using well-known phylogenetic algorithms, such as UPGMA or Neighbor-Joining. Finally, Jerarca provides optimal partitions of the trees using statistical criteria based on the distribution of intra- and intercluster connections. Outputs compatible with the phylogenetic software MEGA and the Cytoscape package are generated, allowing the results to be easily visualized.
The four main advantages of Jerarca in respect to UVCluster are: 1) Improved speed of a novel UVCluster algorithm; 2) Additional, alternative strategies to perform iterative hierarchical clustering; 3) Automatic evaluation of the hierarchical trees to obtain optimal partitions; and, 4) Outputs compatible with popular software such as MEGA and Cytoscape.
RBR ubiquitin ligases are components of the ubiquitin-proteasome system present in all eukaryotes. They are characterized by having the RBR (RING – IBR – RING) supradomain. In this study, the patterns of emergence of RBR genes in plants are described.
Phylogenetic and structural data confirm that just four RBR subfamilies (Ariadne, ARA54, Plant I/Helicase and Plant II) exist in viridiplantae. All of them originated before the split that separated green algae from the rest of plants. Multiple genes of two of these subfamilies (Ariadne and Plant II) appeared in early plant evolution. It is deduced that the common ancestor of all plants contained at least five RBR genes and the available data suggest that this number has been increasing slowly along streptophyta evolution, although losses, especially of Helicase RBR genes, have also occurred in several lineages. Some higher plants (e. g. Arabidopsis thaliana, Oryza sativa) contain a very large number of RBR genes and many of them were recently generated by tandem duplications. Microarray data indicate that most of these new genes have low-level and sometimes specific expression patterns. On the contrary, and as occurs in animals, a small set of older genes are broadly expressed at higher levels.
The available data suggests that the dynamics of appearance and conservation of RBR genes is quite different in plants from what has been described in animals. In animals, an abrupt emergence of many structurally diverse RBR subfamilies in early animal history, followed by losses of multiple genes in particular lineages, occurred. These patterns are not observed in plants. It is also shown that while both plants and animals contain a small, similar set of essential RBR genes, the rest evolves differently. The functional implications of these results are discussed.
In Drosophila melanogaster, dosage compensation is mediated by the action of the dosage compensation complex (DCC). How the DCC recognizes the fly X chromosome is still poorly understood. Characteristic sequence signatures at all DCC binding sites have not hitherto been found.
In this study, we compare the known binding sites of the DCC with oligonucleotide profiles that measure the specificity of the sequences of the D. melanogaster X chromosome. We show that the X chromosome regions bound by the DCC are enriched for a particular type of short, repetitive sequences. Their distribution suggests that these sequences contribute to chromosome recognition, the generation of DCC binding sites and/or the local spreading of the complex. Comparative data indicate that the same sequences may be involved in dosage compensation in other Drosophila species.
These results offer an explanation for the wild-type binding of the DCC along the Drosophila X chromosome, contribute to delineate the forces leading to the establishment of dosage compensation and suggest new experimental approaches to understand the precise biochemical features of the dosage compensation system.
HECT ubiquitin ligases (HECT E3s) are key components of the eukaryotic ubiquitin-proteasome system and are involved in the genesis of several human diseases. In this study, I analyze the patterns of diversification of HECT E3s since animals emerged in order to provide the right framework to understand the functional data available for proteins of this family.
I show that the current classification of HECT E3s into three groups (NEDD4-like E3s, HERCs and single-HECT E3s) is fundamentally incorrect. First, the existence of a "Single-HECT E3s" group is not supported by phylogenetic analyses. Second, the HERC proteins must be divided into two subfamilies (Large HERCs, Small HERCs) that are evolutionarily very distant, their structural similarity being due to convergence and not to a common origin. Sequence and structural analyses show that animal HECT E3s can be naturally classified into 16 subfamilies. Almost all of them appeared either before animals originated or in early animal evolution. More recently, multiple gene losses have occurred independently in some lineages (nematodes, insects, urochordates), the same groups that have also lost genes of another type of E3s (RBR family). Interestingly, the emergence of some animal HECT E3s precedes the origin of key cellular systems that they regulate (TGF-β and EGF signal transduction pathways; p53 family of transcription factors) and it can be deduced that distantly related HECT proteins have been independently co-opted to perform similar roles. This may contribute to explain why distantly related HECT E3s are involved in the genesis of multiple types of cancer.
The complex evolutionary history of HECT ubiquitin ligases in animals has been deciphered. The most appropriate model animals to study them and new theoretical and experimental lines of research are suggested by these results.
Cullins are proteins involved in ubiquitination through their participation in multisubunit ubiquitin ligase complexes. In this study, I use comparative genomic data to establish the pattern of emergence and diversification of cullins in eukaryotes.
The available data indicate that there were three cullin genes before the unikont/bikont split, which I have called Culα, Culβ and Culγ. Fungal species have quite strictly conserved these three ancestral genes, with only occasional lineage-specific duplications. On the contrary, several additional genes appeared in the animal or plant lineages. For example, the human genes Cul1, Cul2, Cul5, Cul7 and Parc all derive from the ancestral Culα gene. These results, together with the available functional data, suggest that three different types of ubiquitin ligase cullin-containing complexes were already present in early eukaryotic evolution: 1) SCF-like complexes with Culα proteins; 2) Culβ/BTB complexes; and, 3) Complexes containing Culγ and DDB1-like proteins. Complexes containing elongins have arisen more recently and perhaps twice independently in animals and fungi.
Most of the known types of cullin-containing ubiquitin ligase complexes are ancient. The available data suggest that, since the origin of eukaryotes, complex diversity has been mostly generated by combining closely related subunits, while radical innovations, giving rise to novel types of complexes, have been scarce. However, several protist groups not examined so far contain highly divergent cullins, indicating that additional types of complexes may exist.
The characterization of the global functional structure of a cell is a major goal in bioinformatics and systems biology. Gene Ontology (GO) and the protein-protein interaction network offer alternative views of that structure.
This study presents a comparison of the global structures of the Gene Ontology and the interactome of Saccharomyces cerevisiae. Sensitive, unsupervised methods of clustering applied to a large fraction of the proteome led to establish a GO-interactome correlation value of +0.47 for a general dataset that contains both high and low-confidence interactions and +0.58 for a smaller, high-confidence dataset.
The structures of the yeast cell deduced from GO and interactome are substantially congruent. However, some significant differences were also detected, which may contribute to a better understanding of cell function and also to a refinement of the current ontologies.
Transposable elements are selfish genetic sequences which only occasionally provide useful functions to their host species. In addition, models of mobile element evolution assume a second type of selfishness: elements of different familes do not cooperate, but they independently fight for their survival in the host genome.
We show that recombination events among distantly related Athila retrotransposons have led to the generation of new Athila lineages. Their pattern of diversification suggests that Athila elements survive in Arabidopsis by a combination of selfish replication and of amplification of highly diverged copies with coding potential. Many Athila elements are non-autonomous but still conserve intact open reading frames which are under the effect of negative, purifying natural selection.
The evolution of these mobile elements is far more complex than hitherto assumed. Strict selfish replication does not explain all the patterns observed.
The comparison of DNA sequences is a traditional problem in genomics and bioinformatics. Many new opportunities emerge due to the improvement of personal computers, allowing the implementation of novel strategies of analysis.
We describe a new program, called UVWORD, which determines the number of times that each DNA word present in a sequence (target) is found in a second sequence (source), a procedure that we have called oligonucleotide profiling. On a standard computer, the user may search for words of a size ranging from k = 1 to k = 14 nucleotides. Average counts for groups of contiguous words may also be established. The rate of analysis on standard computers is from 3.4 (k = 14) to 16 millions of words per second (1 ≤ k ≤ 8). This makes feasible the fast screening of even the longest known DNA molecules.
We show that the combination of the ability of analyzing words of relatively long size, which occur very rarely by chance, and the fast speed of the program allows to perform novel types of screenings, complementary to those provided by standard programs such as BLAST. This method can be used to determine oligonucleotide content, to characterize the distribution of repetitive sequences in chromosomes, to determine the evolutionary conservation of sequences in different species, to establish regions of similar DNA among chromosomes or genomes, etc.
Classification procedures are widely used in phylogenetic inference, the analysis of expression profiles, the study of biological networks, etc. Many algorithms have been proposed to establish the similarity between two different classifications of the same elements. However, methods to determine significant coincidences between hierarchical and non-hierarchical partitions are still poorly developed, in spite of the fact that the search for such coincidences is implicit in many analyses of massive data.
We describe a novel strategy to compare a hierarchical and a dichotomic non-hierarchical classification of elements, in order to find clusters in a hierarchical tree in which elements of a given "flat" partition are overrepresented. The key improvement of our strategy respect to previous methods is using permutation analyses of ranked clusters to determine whether regions of the dendrograms present a significant enrichment. We show that this method is more sensitive than previously developed strategies and how it can be applied to several real cases, including microarray and interactome data. Particularly, we use it to compare a hierarchical representation of the yeast mitochondrial interactome and a catalogue of known mitochondrial protein complexes, demonstrating a high level of congruence between those two classifications. We also discuss extensions of this method to other cases which are conceptually related.
Our method is highly sensitive and outperforms previously described strategies. A PERL script that implements it is available at .
Sequencing of the genomes of several Drosophila allows for the first precise analyses of how global sequence patterns change among multiple, closely related animal species. A basic question is whether there are characteristic features that differentiate chromosomes within a species or between different species.
We explored the euchromatin of the chromosomes of seven Drosophila species to establish their global patterns of DNA sequence diversity. Between species, differences in the types and amounts of simple sequence repeats were found. Within each species, the autosomes have almost identical oligonucleotide profiles. However, X chromosomes and autosomes have, in all species, a qualitatively different composition. The X chromosomes are less complex than the autosomes, containing both a higher amount of simple DNA sequences and, in several cases, chromosome-specific repetitive sequences. Moreover, we show that the right arm of the X chromosome of Drosophila pseudoobscura, which evolved from an autosome 10 – 18 millions of years ago, has a composition which is identical to that of the original, left arm of the X chromosome.
The consistent differences among species, differences among X chromosomes and autosomes and the convergent evolution of X and neo-X chromosomes demonstrate that strong forces are acting on drosophilid genomes to generate peculiar chromosomal landscapes. We discuss the relationships of the patterns observed with differential recombination and mutation rates and with the process of dosage compensation.
The imprint of natural selection on gene sequences is often difficult to detect. A plethora of methods have been devised to detect genetic changes due to selective processes. However, many of those methods depend heavily on underlying assumptions regarding the mode of change of DNA sequences and often require sophisticated mathematical treatments that made them computationally slow. The development of fast and effective methods to detect modifications in the selective constraints of genes is therefore of great interest.
We describe UVPAR, a program designed to quickly test for changes in the functional constraints of duplicate genes. Starting with alignments of the proteins encoded by couples of duplicate genes in two different species, UVPAR detects the regions in which modifications of the functional constraints in the paralogs occurred since both species diverged. Sequences can be analyzed with UVPAR in just a few minutes on a standard PC computer. To demonstrate the power of the program, we first show how the results obtained with UVPAR compare to those based on other approaches, using data for vertebrate Hox genes. We then describe a comprehensive study of the RBR family of ubiquitin ligases in which we have performed 529 analyses involving 14 duplicate genes in seven model species. A significant increase in the number of functional shifts was observed for the species Danio rerio and for the gene Ariadne-2.
These results show that UVPAR can be used to generate sensitive analyses to detect changes in the selection constraints acting on paralogs. The high speed of the program allows its application to genome-scale analyses.