Over the past two decades, there has been a disorderly explosion of biological data, exponentially increasing in volume with time. To keep pace with the broad classes of new sequence, structural, and functional data arising from compilations of genomic and proteomic data in particular, many powerful approaches have been developed for unearthing meaningful themes and hypotheses from within the jumble. Yet there is still a critical need for improved techniques enabling fast and comprehensive analysis of large sequence data sets, especially to access the biologically useful context that can be extracted from this information. There is a particular demand for easy-to-use techniques to aid experimental biologists in finding useful starting points for analyzing diverse superfamilies of proteins. Here we address one of these techniques, sequence similarity networks (). A relatively new application of methods commonly used to summarize protein-protein interactions on a large scale
, sequence similarity networks—here, in which the interrelationships between proteins are described as a collection of independent pairwise alignments between sequences—represent an attractive adjunct approach to multiple sequence alignments and phylogenetic trees. Moreover, they offer several important capabilities unavailable to these methods. First, they provide a fast and easy to compute framework for observing relationships among very large sets of evolutionarily related proteins; more importantly, when visualized they also allow the perception of trends in orthogonal information—viz., function-related information—mapped onto the context of sequence similarity. Because they provide access to these relationships in an intuitively accessible manner and are easy to create and manipulate, these networks fill a need that is not currently well-addressed by other tools. By enabling the visualization of extremely large sets of related sequences, networks provide advantages unmet by phylogenetic trees, particularly in showing all relationships that score above a user-defined similarity cut-off rather than only the small number of optimally scoring connections. Also, for the same amount of computation, a much larger set of sequences can be analyzed using a network than could be used to infer a tree. Furthermore, there are restrictions on the number of sequences that can be usefully considered in generating a multiple sequence alignment, in part due to the practical limitations of viewing alignments of hundreds of sequences. The corresponding benefit of visualizing a sequence similarity network, rather than analyzing it numerically, is that the displayed network can be overlaid with as many types of derived and orthogonal information as spring to mind. The network can then be interactively explored to see how these different features coalesce into trends (or don't) when viewed in the context of sequence similarity. Additionally, using interactive software to visualize the networks (e.g. 
) and to link to other types of information such as three-dimensional structures (e.g. 
) allows the evaluation of individual and sets of edges, enabling an informed researcher to decide how much to trust the relationships implied by the network structure.
Sequence similarity network topology changes in a predictable way with the stringency of the threshold.
There has already been a great deal of interest in generating sequence similarity networks. Enright and colleagues recognized that visualizing a network of protein similarity information
was a useful extension to basic protein sequence clustering methods (e.g. BLASTCLUST
). They then used the MCL algorithm—designed for clustering very large networks—to identify natural sequence similarity “families” (ideally, rough functional classes) in a network of the protein universe
. A number of other groups followed with innovative approaches to cluster all known proteins and visualize them as attractive, enigmatic maps (e.g. 
). More recently, there have been efforts to use sequence similarity networks for more discrete sets of related proteins
, and PFAM has released its classification of families into the more general clans
, creating many three-level hierarchies bundling sequences into families, and families into clans
. Work by Medini et al.
began with a sequence similarity network of the protein universe, but also isolated one small and interesting region of the network. Using more careful analyses, they made inferences about the evolution of specific protein families from the isolated region. In our own work, we have begun to use sequence similarity networks to provide context for the analysis of individual proteins that are members of superfamilies
, to show the relative outlier status of specific functional classes within a large superfamily
, and to illustrate the correlation with lineage of conservation patterns for active site residues in a specific family of enzymes
But before sequence similarity networks can be adopted for broad use, it is important to understand their strengths and weaknesses. In particular, these types of networks need to be validated in comparison to better-understood approaches. A primary motivation of this work is to address whether there is a compelling quantitative argument that sequence similarity networks can competently depict sequence similarity relationships, allowing them to be used as a framework to guide hypotheses about functional relationships. Although it has long been recognized that sequence similarity is an imperfect proxy for functional similarity, a fundamental dogma of structural biology—that sequence conservation infers structural conservation, which in turn implies functional conservation—has been extensively and effectively applied to infer functional properties on every scale. Consistent with this view, our results demonstrate that visualized sequence similarity networks perform well in representing sequence similarity information, and indeed the visualized relationships correlate well with known functional relationships. In contrast to the formal network representations of sequence similarity represented by previous studies describing algorithms for network generation, we have shown how well the displayed relationships reflect various measures of sequence and evolutionary distance, using relevant examples and quantitative assessments. Additionally, we introduce a concept: the most valuable feature of sequence similarity networks is not the optimal or most accurate display of sequence similarity, but rather the flexible visualization of many alternate protein attributes for all or nearly all sequences in a superfamily. To illustrate the results, we have used three well-studied superfamilies with nuanced functional annotations. This work is especially applicable to the study of individual superfamilies, and is complementary to previous work in this area that typically shows that networks can group all known proteins in agreement with broad definitions of functional similarity (e.g. 
Here we demonstrate, using example data sets of G-protein coupled receptors (GPCRs), kinases, and the crotonase superfamily of enzymes, that sequence similarity networks recapitulate much of the information present in phylogenetic trees, that the relationships implied by networks are in agreement with known sequence and structural relationships, that networks incorporate a number of practical benefits that improve on current techniques for relating sequences, and finally, that visualization of similarity networks enables the perception of trends from the context of sequence similarity, initiating fruitful hypotheses. Finally, we report a new result relevant to the evolution of domain variation in the crotonase superfamily of enzymes that was obtained from analysis of sequence similarity networks.