Membrane protein annotation pipeline
The annotation pipeline consists of five main steps (Fig. , Methods). In the first step, we started with 1,779,528 sequences in Pfam-A (01/09/07) [
11]. Of these, 172,079 are predicted by TMHMM [
21] to have one or more transmembrane α-helices (TMHs); 99,937 have three or more TMHs. Because of the difficulty of accurately identifying signal peptides and possible errors in TMH prediction, only integral membrane protein sequences predicted to have at least three TMHs. These sequences belong to 598 unique Pfam families (Supplementary Table 1) were selected, of which 476 appear in at least one of the 34 organisms of interest (Fig. ). Organisms were selected based on the following considerations: completeness of the genome, model organism, relevance to human disease, diversity within each kingdom, and the availability of genomic DNA for cloning and expression. The 122 families with no representatives include: photosystem-related families (e.g., PF00421) that are not found in any of the selected organisms, families that are only found in a single organism (e.g., PF03303), and families with no characterized function (e.g., PF06836 and PF07099).
In the second step, sequences for the 34 genomes were collected from UniProt, Ensemble and organism-specific sequencing projects. In total, there were 21,379 proteins predicted to have at least three TMHs, corresponding to the “integral membrane genomes” (IMGs) of the 34 organisms (Supplementary Fig. 1). We were able to annotate between 41% (Plasmodium falciparum) and 93% (Mus musculus) of each IMG with Pfam family IDs; 16 of the genomes had more than 25% of their IMG unannotated, suggesting that there are many undiscovered membrane protein families.
In the fourth step, each PSSM was compared to the whole IM database, using profile-profile alignments to identify related proteins.
The fifth step estimated the impact of newly solved atomic-resolution structures on the coverage of membrane protein sequence space. While it would be ideal to use crystal structures resulting from the target selection pipeline described here, our experimental pipeline based on yeast expression has not yet yielded any structures. To demonstrate the utility of this final step in the pipeline, we therefore use other recently solved membrane protein structures that were not part of our yeast target selection scheme. As of October 2008, the CSMP determined crystallographic structures for the following seven integral membrane proteins: the Escherichia coli Ammonia Channel Amtb (PDB ID: 2ns1), the Nitrosomonas europaea Rh50 Ammonium transport protein (PDB ID: 3bhs), two structures of E. coli lactose permease (PDB IDs: 2cfq, 2v8n), the archaeal aquaporin AqpM (PDB ID: 2f2b), a mutant structure of E. coli AqpZ (PDB ID: 2o9d), and an aquaglyceroporin from Plasmodium falciparum (PDB ID: 3c02).
These seven structures fall into three Pfam protein families: the ammonia channel and ammonium transport protein are members of the PF00909 ammonium transporter family, the two aquaporins and the aquaglyceroporin are members of the PF00230 major intrinsic protein (MIP) family, and lactose permease is a member of the PF01306 LacY proton/sugar transporter family. The structures were used as input to the final step of the computational pipeline to calculate how many sequences can be modeled based on these structures (i.e., the modeling leverage).
To assess the value of each new CSMP template for comparative modeling, models for sequences that could be modeled using both a CSMP structure and any non-CSMP membrane structure as templates were compared. To take partially modeled sequences into account, the comparison is performed at the residue level. Additionally, a “transmembrane region” was defined for each of the CSMP template structures that included all amino acid residues from the first TMH residue to the last TMH residue.
There were a total of 178,627 sequences for all membrane template-based modeling calculations, 13,317 of which were sequences based on CSMP templates. Of this set, 11,240 sequences with 1,108,633 residues in transmembrane regions could be modeled using both a CSMP template and another template. 18% of the residues (199,684) were modeled with higher target-template sequence identity with a CSMP template than any other available membrane protein structure, demonstrating the value of these additional structures for comparative modeling of membrane proteins. For individual models, the lactose permease structures had the best modeling leverage, with 24% of residues modeled with higher target-template sequence identity using the CSMP structures 2v8n and 2cfq. The aquaporin and aquaglyceroporin structures (2f2b, 2o9d, and 3c02) had less impact, with 8% of residues modeled the best with the CSMP template (Supplementary Table 2).
Calculating the impact of new membrane protein structures on coverage of membrane protein sequence space will aid in assessing target selection efforts by structural genomics consortia. Furthermore, this modeling approach is applicable to any new membrane protein structure.
Membrane protein family distribution in the three kingdoms of life
All sequences representing membrane protein families from each genome were collected and the number of times each family appeared in each genome was counted. Counts were assembled into a matrix (
http://salilab.org/projects/integral_membrane_proteins/memb_counts.txt.gz). The counts ranged from 0 counts of a family in an organism to 1,468 for rhodopsin-like GPCRs in the mouse genome, demonstrating that some families are highly represented in multiple genomes and others are rare or restricted to only a few organisms. There are 13,139, 2,079, 1,956, and 30 families with 0, 1, 2–49, and 50–1468 representatives, respectively.
Target selection for the structural genomics of integral membrane proteins in yeast
Two subsets of target proteins for structural studies were selected. First, we aimed to maximize the coverage of the Saccharomyces cerevisiae IMG while minimizing the number of targets for expression. Second, we also selected a number of targets to further PMT’s functional and clinical studies of ABC and SLC membrane transporters in drug disposition.
Target selection for sequence leverage Pfam annotations were used to cover all membrane protein families in yeast and the associations between multiple sequence profiles were used to select sequences that are absent from Pfam (Methods). There are 621 predicted IM sequences in yeast and these were the input to our computational annotation pipeline. Of these, 490 sequences could be annotated with 165 unique Pfam membrane protein families and 131 could not be annotated with a Pfam identifier. Of the 165 annotated families, 79 were represented by a single sequence, meaning the family appeared only once in the yeast genome.
The 79 singletons initiated our target list. For the remaining 83 annotated families, two sequences were selected from each family to improve the likelihood of successful structural characterization for that family. These two members were selected to ensure optimal coverage of each family (Methods), which is especially important for larger families. For example, the major facilitator family (MFS) has 57 sequences, the most of any membrane protein family in yeast. The MFS sequences fall into two major clusters, one with 44 MFS members and one with six. VBA1_YEAST, which is associated with 24 MFS-annotated sequences in the first cluster (55%) and MCH4_YEAST, which is associated with five MFS sequences in the second cluster (83%) were selected.
Of the 131 unannotated sequences, 16 were in two completely unannotated clusters of 8 sequences each, 62 hit no other sequences, six sequences fell in two unannotated clusters of three sequences each, and 14 fell into seven clusters of two sequences each. The remaining 33 sequences were associated with at least one other annotated sequence and were discarded. Because two sequences were selected from each unannotated cluster, there are an additional 98 targets. Complete coverage of the yeast genome therefore requires 347 targets out of the 621 IMG proteins. If a target fails in any stage of the experimental process, a similar yeast target can be selected for a subsequent trial [
23].
The results of our computational annotation pipeline were entered into an experimental structure determination pipeline, as detailed in the Results and Discussion and a companion paper [
23].
Target selection for biological significance Two of the targets, the yeast genes STE6 and YN_99, code for ATP-binding cassette transporters that are homologous to human multidrug transporters in the B and G families, respectively. There are 48 characterized ABC transporters in the human genome and 18 are disease-associated [
7,
8,
26]. There are many atomic structures available for isolated nucleotide binding domains (NBDs) from ABC transporters, and these structures have been successfully used to assess the role of interface-disrupting point mutants with clinical phenotypes in human ABC transporters [
20]. However, a molecular level understanding of the clinical impact of genetic variation requires high-resolution structural data for the substrate-binding transmembrane domains (TMDs) of these proteins, providing the rationale for their inclusion into the target list.
In humans, ABCB1 (also known as MDR1) and other members of the B family, such as ABCB4 (MDR3) and ABCB11 (BSEP), are associated with multidrug resistance in cancer therapy. ABCB4 and ABCB11 are also associated with several forms of cholestasis [
13]. Our collaborators at the PMT have identified 29 non-synonymous single nucleotide polymorphisms in these proteins. The STE6 structure would be particularly useful for structural modeling of sequence variations in humans because the domain organization of two TMDs and two nucleotide-binding domains NBDs is the same as in the human transporters ABCB1, ABCB11 and ABCB4 (Supplementary Fig. 2). The most similar structurally characterized homolog of the ABCB family is currently the
Staphylococcus aureus transporter Sav1866 [
5]. This transporter has only a single TMD and a single NBD that forms a homodimer; thus, it is not an ideal template for modeling the four domain multidrug resistance-associated transporters from the ABCB family.
The yeast nucleoside transporter target YAL022C (FUN26) [
39] is homologous to the equilibrative nucleoside transporters ENT1 (SLC29A1) and ENT2 (SLC29A2). The PMT has identified two non-synonymous SNPs in SLC29A1as well as two non-synonymous SNPs and seven insertion/deletion mutations in SLC29A2 [
22].
Additional structural data from these transporter families will be invaluable for interpreting the results of functional studies and suggesting molecular mechanisms for clinical phenotypes.
The final set of 384 targets was entered into the structural characterization pipeline of the CSMP [
23]. Of these targets, 273 are significantly related to at least one human gene. In all, 1,249 human sequences are significantly similar to the 273 yeast sequences, suggesting that about 40% of the human IMG has a corresponding gene in yeast (Supplementary Fig. 3a, b). Our clustering of the yeast IMG is generally in agreement with the manual “clans” clustering in Pfam (Supplementary Fig. 3c) [
11].
Defining the scope of membrane protein structural genomics
The scope of structural genomics of membrane proteins is the number of target structures needed to achieve some desired coverage of the membrane protein sequence space. Current comparative modeling coverage of integral membrane protein sequences in UniProt [
38] was examined first. Next, the total number of structures required for desired sequence coverage of the 598 Pfam integral membrane protein families described above was calculated.
The ModBase database [
30] contains 806,266 models of 1,733,721 sequences the complete UniProt (as of 6/1/2005) for which the target-template identity is at least 30%. Of these, 61,749 models for 55,161 unique sequences are predicted by TMHMM to contain at least three TMHs. This estimate suggests that domains in only approximately 8% of integral membrane proteins can be currently modeled at reasonable accuracy (implied by the 30% target-template sequence identity) using available template structures (Supplementary Fig. 4).
To improve the coverage, it would be ideal to select sequences for structural characterization that yielded the greatest improvement in the number of modelable sequences based on the 598 Pfam integral membrane families. At 30% sequence identity, the 375,155 sequences in these families fall into 13,395 clusters. Thus, a representative structure from such a cluster provides a reasonable template for comparative modeling of the other sequences in its cluster. Using a target selection strategy where sequences from the largest clusters are selected for structural characterization first, 90% of the sequences in the currently known integral membrane families could be covered by 2,454 structures. In contrast, a random selection of crystallographic targets would require approximately eight times more structures (i.e., 20,000) to achieve the same coverage. For 70% coverage of sequence space, a more realistic goal, the ranking by cluster size requires 504 structures versus 2,500 for the random selection (Fig. ).
Applications of the membrane protein annotation pipeline
Identification of unannotated homologs in seven membrane protein families related to multidrug resistance In the 34 genomes, 793 sequences were annotated as coming from one of seven Pfam-A families with experimentally established links to multidrug resistance (MDR). Of these sequences, 292 were not described by either the “Protein name” field in UniProt or the “DEFINITION” field in Genbank as MDR-related, but rather with descriptions such as “conserved membrane protein” or “uncharacterized protein”.
Between 2% (mouse) and 8% (Mycobacterium tuberculosis) of the IMG of each organism is devoted to MDR. Furthermore, pathogenic organisms tend to have higher percentages of MDR membrane proteins in their genomes. For example, the pathogens M. tuberculosis,Cryptosporidium parvum, Cryptosporidium hominis, Pseudomonas aeruginosa, and Leishmania major, and the obligate parasite Mycoplasma pneumoniae all had more than 5% of their IMGs devoted to MDR.
Tracing the evolutionary history of human ABC transporters ABC transporters are found in all three kingdoms of life. These proteins couple ATP binding, hydrolysis, and release to substrate transport across a membrane. They share a common architecture consisting of combinations of transmembrane domains (TMDs) and nucleotide-binding domains (NBDs). While the NBDs are well conserved, the TMDs, which contain the substrate binding sites, are more divergent.
The 72 TMDs in the human ABC transporter superfamily were associated with 669 unique sequences in the 34 organisms. In total, there were 16,503 connections between the human TMDs and sequences in the IM database (Fig. ).
Identification of new membrane protein families Finally, the analysis suggests that there exist additional unidentified membrane protein families. Out of the 21,385 sequences of membrane proteins in the selected genomes, 4,389 (21%) could not be annotated with a Pfam membrane protein family.
Of the 51 putative new families, 27 and 16 had one or more Pfam-B identifiers, respectively (11/10/08). Because groups of TMHs may act as a functional unit, a family definition needs to cover as long a stretch of conserved TMHs as possible; our analysis extends the membrane region coverage of 10 Pfam-A families. For example, in Eukaryotic cluster 14, the analysis indicates a conserved group of 3 TMHs, whereas the Pfam-A family hit Mpv17_PMP22 (PF04117) covers either one or two of the TMHs. Furthermore, the latest version of Pfam-A now includes new families, such as DuoxA (PF10204), Tmp39 (PF10271), and DUF2453 (PF10507) that each correlate with one of our newly identified eukaryotic families. Finally, Bacterial Cluster 4 has no Pfam-A or B classification in the conserved membrane region.