|Home | About | Journals | Submit | Contact Us | Français|
SkyLine, a high-throughput homology modeling pipeline tool, detects and models true sequence homologs to a given protein structure. Structures and models are stored in SkyBase with links to computational function annotation, as calculated by MarkUs. The SkyLine/SkyBase/MarkUs technology represents a novel structure-based approach that is more objective and versatile than other protein classification resources. This structure-centric strategy provides a multidimensional organization and coverage of protein space at the levels of family, function, and genome. The concept of “modelability”, the ability to model sequences on related structures, provides a reliable criterion for membership in a protein family (“leverage”) and underlies the unique success of this approach. The overall procedure is illustrated by its application to START domains, which comprise a Biomedical Theme for the Northeast Structural Genomics Consortium (NESG) as part of the Protein Structure Initiative (PSI). START domains are typically involved in the non-vesicular transport of lipids. While 19 experimentally determined structures are available, the family, whose evolutionary hierarchy is not well determined, is highly sequence diverse, and the ligand-binding potential of many family members is unknown. The SkyLine/SkyBase/MarkUs approach provides significant insights and predicts: 1) many more family members (~4,000) than any other resource; 2) the function for a large number of unannotated proteins; 3) instances of START domains in genomes from which they were thought to be absent; and 4) the existence of two types of novel proteins, those containing dual START domain and those containing N-terminal START domains.
As the number of post-genomic sequences and structures continue to grow, both high-throughput experimental and computational methodologies are needed to organize these large bodies of data, principally according to computational tools. A major goal of the Protein Structure Initiative (PSI; http://www.nigms.nih.gov/Initiatives/PSI/) has been to develop new technologies to facilitate the solution of a large number of sequence unique protein structures to provide structural coverage of protein sequence space, from different perspectives, including the biological [1–5]. These structures can then be used as templates in homology modeling with the goals of providing higher structural information content regarding biological function and the structural coverage of protein sequence space. In this paper, we introduce a structure-based approach for the coverage (or leverage) and organization of protein sequence space and illustrate how the computational methodology transforms the current view of a particular protein family, the START domains, which is the subject of a Biomedical Theme at the Northeast Structural Genomics Consortium (NESG, http://www.nesg.org).
We developed a computational pipeline (SkyLine) for the automated, high-throughput detection of sequence homologs and the comparative modeling of their sequences . As schematically depicted in Figure 1, the SkyLine process begins with a protein structure: The sequence of the structure is used as a seed for PSI-BLAST profile searches  against the NCBI non-redundant protein sequence database (NRdb, ), and the structure itself serves as a template to construct as many “reliable” homology models for these sequence homologs. In comparison to the usual homology modeling problem of finding a template with which to model a given sequence of unknown structure, Skyline addresses the inverse problem of determining how many sequences exist for which a particular structure may serve as a template [4,9,10]. SkyLine modeling uses structure evaluation to test whether a sequence is structurally consistent with its template. A model is considered “reliable” if 1) the pG score, which is a log-transformed, length-normalized integration over the residue by residue Prosa II profile , is ≥ 0.7 , and 2) the percent coverage of the detected sequence relative to the template sequence is ≥ 75%, ensuring that the protein coded by the detected sequence contains a biologically significant number of secondary structural elements. While we recognize that there are newer measures of model reliability available, our focus has not been on model details but rather whether a reasonable model can be built. To this end, the pG score has served as a fast and effective means of evaluating modelability.
SkyLine runs have been performed for all structures solved by NESG to date and a large fraction of the PDB defined by a 60% redundancy cutoff, whose 14 K structures likely represent virtually all protein functional types. SkyLine provides a means for defining protein families using structure-based criteria. Calculating the biophysical properties of structures and models provides a more detailed, complementary approach to analyzing families. MarkUs exploits global and local structural relationships among proteins to search for the conservation of biologically meaningful structural and functional motifs (which may be another way to view protein space) and provides access to many tools for computational function annotation (, http://luna.bioc.columbia.edu/honiglab/mark-us). Structures, models, and annotations derived from the MarkUs sever are stored in a publicly available database, SkyBase (http://126.96.36.199/nesg3/nesg.php).
As illustrated in Figure 2, SkyBase may be queried, in a multi-factorial manner by incorporating the sequence into the online search window and/or by selecting a wide range of search criteria, such as model reliability (pG), model template, sequence identity to template, species, etc. The user of the SkyBase front-end can drill down to the level of individual model and, as listed in Figure 2, obtain a large amount of information on the model as well as access to external resources such as MarkUs , GenBank  and the PDB . A Jmol window is also provided on the output page in which various user-selected representations of the model may be manipulated and rotated.
Our check for modelability/reliability (pG ≥ 0.7 and a coverage of sequence by template ≥ 75%) is independent of e-value; in most cases, there are anywhere from several to hundreds and even thousands of models for a given sequence that fit these criteria. SkyLine has a definition for discerning a “best” model per sequence, but we have found that it is most helpful to let the user have access to all of the models as well as to data such as the ProsaII structure evaluation profiles and the PSI-BLAST profiles. Different users will have different priorities, and there may be a particular region that is modeled well in one calculation versus the others. The box marked “OUTPUT” at the bottom of Figure 2 shows that any of the models retrieved according to the search criteria may be subsequently examined for more quantitative information.
The modeling step in SkyLine is key to the success of this approach, as it serves as a filter for true positives among the many false positive hits often included in PSI-BLAST results, especially if the sequence/PSI-BLAST profile inclusion e-value is relaxed beyond the typical cutoff of 0.001. Hence, this notion of a “reliability test” works just as well for models where the inclusion e-value is much larger, i.e. 0.001< e <100, and, thus, allows for the detection of true remote homolgs. We refer to this concept as “modelability”, and its application has allowed for the discovery of previously undetected protein family members and protein families, as illustrated in the next section.
Members of the steroidogenic acute regulatory (StAR)-related lipid transfer (START) domain family function in the binding and non-vesicular transport of lipid and other ligands [15–18]. Since many START domains appear in multi-domain proteins, they may serve as lipid sensors that signal biological responses. The START domain module is typically 210 residues long, and, currently, 19 experimentally determined structures are available in the PDB [19–21]. The three main classes of START domains- classical (CSD, from mammals, ), birch antigen (BA, from plants, [22–23]) and bacterial (BAC, )- share a common topology (Figure 3A): A C-terminal alpha-helix packed against a core beta-sheet provides support for a hydrophobic tunnel, which has been shown to accommodate lipid molecules in classical and BA START domains [25,26].
START domains are found in 15 distinct human proteins and are designated StARD1 through StARD15 . Lipid specificity is known for only about half of these, and genetic disorders involved in cancers, autoimmune diseases and obesity have been found in all 15. Structures are known for four human START domains (StARD2/PCTP, StARD3/MLN64, StARD5, and StARD13) and one mouse START domains (StARD4). The phosphatidylcholine transfer protein (PCTP or StARD2; PDB id 1LN1, ) is depicted in Figure 3; its structure was solved in the presence of a phosphatidylcholine analog, which is accommodated in the domain's long hydrophobic groove, as illustrated in Figure 3D. The birch antigen START domains (not shown) contain similarly deep ligand-binding pockets . NESG has contributed all of the bacterial START domains structure solved to date , and these structures are especially valuable because they provide the first structural pictures of this class of START domains about which very little is known. As illustrated in Figure 4A, the presence of the C-terminal packing helix suggests that bacterial START domains also accommodate ligands, however the absence of the three most N-terminal secondary structural elements (α1, β1 and β2) of human StarD2 (Figure 3A) suggests that the binding site is more shallow.
Intriguingly, the evolutionary hierarchy of START domains is not well established. Thus, the START domain functional super-family is an excellent target for our computational structure-based modeling and annotation approach.
SkyLine was used to search for the existence of all instances of START domains in the NCBI non-redundant sequence database. Each of the 19 available START domain structures, whose PDB identifiers and classifications are listed in the first two columns of Table 1, was used both as a sequence seed for PSI-BLAST searches against the NRdb and as the structural template for modeling the detected homologs
The results from the SkyLine runs are summarized in Table 1. The number of sequence-unique reliable models per structure is given in Column 3. The non-redundant total across all structures, i.e. the true leverage or “universe” of predicted START domains, is 3,886. Note that this number is much less than the sum across the 19 structures (13,720), which is due to the fact that more than one START domain structure may detect and serve as the template for the reliable modeling of a given sequence; these multiple models per sequence are retrievable through SkyBase (see “output” in Figure 2). This number of START domains far exceeds the number of START domains catalogued in any other available resource. This immediately raises the question of whether many of the SkyLine results may be false positive. This issue is addressed below, in the context of a specific genome (Arabidopsis thaliana) for which a detailed analysis is conducted. Of course, the most direct and only true way to test the efficacy of the prediction of novel sequences is to subject these discoveries to experimental analyses. Thus far, all of the many novel sequences tested have provided hypotheses that have been experimentally confirmed. One example involves the yeast genome, from which START domains were previously thought to be absent. SkyLine predicts the presence of four yeast START domains, one of which was recently characterized in the lab of Catherine Clarke at UCLA .
In order to compare our results for the number of START domains with other resources, we break down the analysis into four groupings as follows: 1) The number of predicted START domains based on SkyLine runs of CSD structures is 911; 2) the number of predicted START domains based on SkyLine runs of BA structures is 1,933; and 3) the number of predicted START domains based on SkyLine runs of BAC structures is 2,009. The sum of models from groups 1, 2 and 3 is 4,853, which provides evidence that START structure from the different classes can detect and reliably model sequences from other classes. As of Oct 2009, the number of START domains for Pfam families  representing classic START domains is 609, and the current similar number in the SMART database  in genomic mode is 314. Hence, for classical START domains, SkyLine predicts ~300 and ~600 more START domains than Pfam and SMART, respectively. The SMART database in “normal mode” reports a total of 817 START domains in 817 proteins, i.e. all instances of START domains appear once in each parent protein. A similar statement can be made based on Pfam results. SkyLine predicts a total of 33 proteins containing two instances of START domains; in each case the second, C-terminal instance is a novel discovery. Furthermore, no resources report any Archaeal START domains, and, in fact, it is stated in recent review articles that there are no START domains in Archaea. Similarly to the discovery of START domains in yeast genomes, SkyLine predicts a total of 15 sequence-unique, reliably modeled Archaeal sequences. Finally, column 8 of Table 1 shows that SkyLine produces a non-redundant total of 61 START domains in protein sequences that are annotated as “hypothetical”, “unnamed”, or “unknown” in sequence databases.
Table 1 makes additional interesting points related to the modeling procedure. The comparison of columns 3 and 4 show that utilizing an e-value ≤ 0.001 results in ~2,500 more START domains than impressing a sequence identity ≥ 30% between homolog and template sequences. Finally, columns 6 and 7 show that the approach predicts a significant number (~100) of remote homologs.
There is little overlap in the SkyLine results between classical and bacterial START domains. However, for example, human COQ10A (AAH47444), a Polyketide cyclase, is detected and reliably modeled by 1T17, a bacterial START domain. Note that this human START likely does not bind a lipid as its ligand. Bacterial START domains are thought to take polyketides as a ligand, hence these two diverse START domains may share in common the bacterial ligand, suggesting an evolutionary linkage.
START domain-containing proteins are more broadly represented in plants than in animals . A recent study identified 35 putative START domains in the Arabidopsis thaliana genome with a sequence-based approach (BLASTP), and, thus, these data provide an opportunity to test the efficacy of our structure-based approach in comparison . SkyLine detects all 35 sequences at the PSI-BLAST stage and builds reliable models for 32 of them. Reliable models are built for the three remaining sequences upon refinement of the modeling alignments. Therefore, at least for sequences in this genome, the SkyLine modeling criteria are more conservative than sequence criteria, and while there are no false positives, it may be necessary to examine the modeling results in greater detail.
SkyLine expands the set of 35 sequences in several ways. First, reliable models are built for a group of sequences that include sequence variants of the 35 BLASTP sequences as well as two sequences that have not previously been annotated as START domains by any other resource (Table 2, rows 1 and 2). Furthermore, SkyLine predicts the presence of a START domain at the N-terminus of protein BAB01397, which contains a kinase domain at its C-terminus. According to the Conserved Domain Architecture Retrieval Tool , a similar architecture is also detected in a grapevine protein. This is an exciting result both because it has previously been thought that START domains appear invariably at the C-termini of multi-domain proteins and because START domains appear in concert with a kinase domain only in plant genomes. The inverse architecture (kinase-START) is found in two wheat proteins, WSK1 and WSK2 , where both domains are necessary to provide resistance to a devastating fungal disease. This suggests that the Arabidopsis protein BAB01397 may perform a similar defensive function. More intriguingly, SkyLine predicts, with high reliability, the presence of a second START domain in ten sequences of the reference set of 35 Arabidopsis proteins (Table 2, rows 3–12). This discovery represents the first instances of multiple START domain-containing proteins. For example, each of the 817 START domains reported by the SMART database is predicted to contain a single START domain; similar results are reported in Pfam, the NCBI Conserved Domain Database (CDD, ) and other available resources.
All of the novel Arabidopsis sequences predicted by SkyLine are listed in Table 2. Structural models for the Arabidopsis START domains can be retrieved and analyzed using the tool MAPArT (Models of the Membrane-Associated proteins of Arabidopsis thaliana; http://188.8.131.52/araba2_ts/at_search.php), which is a SkyBase-derived resource that contains additional functional annotation performed for the NSF-funded Arabidopsis 2010 project. This example illustrates specifically how SkyBase may be used to organize protein sequence space at multiple levels, i.e. according to genome, family and function.
Because of the central role of START domains in non-vesicular lipid transport, it is of great importance both to understand the membrane associating function of START domains and to design inhibitors of START domain function. At least eight bacterial genomes contain START domains and some of these are pathogenic strains. Bacterial START domains are thought to be involved in the maintenance of membrane integrity, thus, it would be desirable to predict experimentally testable hypotheses of membrane-binding function and to find inhibitors for this class as well. SkyLine is designed to quickly detect a set of true homologs of a given structure, rather than necessarily providing the best possible models. However, experience has revealed that the models built with the SkyLine procedure are often reliable enough for accurate family classification and function annotation as predicted by calculations of 1) biophysical properties, such as surface curvature and electrostatic properties, 2) sequence conservation, and 3) ligand-binding propensities, as is performed in MarkUs. Figures 3 and and44 display the conservation of residues in the ligand-binding pocket , the electrostatic surface potential , and the calculation of the volume of the ligand-binding pocket [35,36] for human and bacterial START domains, respectively. Information obtained from the calculation of biophysical properties of START domains, such as those depicted in Figures 3 and and4,4, is critical in guiding and interpreting experiments. For example, highly conserved residues in the predicted ligand-binding pockets of START domains may be substituted in order to provide insight into the mechanism of ligand binding. Electrostatic potential calculations provide clues to the membrane-adsorption properties of lipid-binding START domains. And the calculation of ligand-binding volumes may be used to screen chemical libraries to detect potential ligands.
The comparison of panels C and D in Figure 3 illustrates that the ligand-binding pocket volume calculated for a START domain whose structure was solved in the presence of ligand faithfully represents the molecular shape of the ligand. Panels D and E show two views of the calculated ligand-binding pocket volume for a START domain whose structure was solved in the absence of ligand. Whether these types of calculations, for START domain structures without ligands and for high-quality homology models, have merit remains to be determined. We plan to use these calculated volumes to aid high-throughput experimental docking studies of START domain inhibitors. The integration of computational and experimental docking results into a single database will facilitate the design of more potent START domain inhibitors.
The information obtained from the computational analysis of START domain structures and models is stored in the START domain database (http://184.108.40.206/data/start/start.html) and can be used for guiding and interpreting the different kinds of experiments, for example, the cellular, molecular biological and biophysical characterization of the membrane and ligand binding functions of START domains.
We have outlined in this work how structural information can be used to provide novel insights about a single protein family that are well beyond what is possible with sequence alone. An essential feature of our approach is the use of modelability as a criterion to evaluate sequence relationships. Using a filtering based on modeling allows the relaxation of sequence-based criteria and thus significantly expands the number of homologs that can be identified. Together with the use of structural alignments and a range of functional information, we are now able to explore protein sequence/structure/function space in ways that were not previously possible.
Among the ~4,000 START domains identified in this work are a large number of remote homologs, including those from genomes in which START domains were previously not identified and within novel START domain-containing sequences for which our method predicts the presence of a second START domain. However, since so many of these sequences represent potential new instances of START domains, it is critical that these predictions be validated by experimental approaches, both functional studies, e.g. in the case of novel yeast START domains , and structure determination. To this end, NESG has instituted a target selection strategy in order to test these novel predictions and to provide comprehensive coverage of our expanded START domain universe.
The efficacy of our strategy is clearly evident in the new insights about the START domain family that we have provided. A critical aspect of the approach is the identification of a large number of putative relationships and their subsequent filtering based on modelability as done in SkyLine or functional criteria as used in MarkUs. However we recognize that the meaningfulness of a finding is often best determined by a researcher who has used computational tools with a hypothesis in mind or with the goal of identifying new hypotheses that can then be subjected to experimental verification. We believe that the computational infrastructure we have established will greatly facilitate this type of discovery process. More generally, our results highlight the fact that, combined with appropriate computational tools, structural genomics can have major impact on biological research in ways that are just now becoming evident.
B.H. acknowledges the support of National Institutes of Health Grants GM030518, GM074958, and CA121852. D.M acknowledges the support of National Institutes of Health Grants GM074958 and GM071700 and of National Science Foundation Grant NSF0738311.