The overall SSGCID structure determination pipeline involves a number of activities distributed between the Target Selection, Cloning & Expression Screening, Protein Production, Crystallization, and Data Collection & Structure Solution teams at the five different locations (see ). In order to maximize the likelihood of success of each target, yet minimize the cost-per-structure, we have adapted a multi-pronged serial escalation approach, whereby targets initially enter a standard high-throughput bacterial protein expression system (Tier 1), and enter more expensive “rescue pathways” (Tiers 2–9) only after failing the initial approach. In addition, there are a number of tiers dedicated to specialized activities such as NMR structure solution (Tier 10), co-crystallization (Tier 11), ligand screening (Tiers 12–15) and RNA targets (Tier 16), as well as expression of constructs supplied by requestors from the scientific community (Tier 0). A more detailed description of each activity within the SSGCID structure determination pipeline follows.
SSGCID Structure Determination Pipeline
A series of bioinformatic and manual filters are used by the SSGCID Target Selection Team (at SBRI) to select proteins predicted from a single representative genome sequence for each of the 31 bacterial, eukaryotic and viral genera indicated above. Positive selection criteria include sequence similarity to known drug targets, documented or potential roles in cell growth, pathogenesis, or drug resistance, as well as markers of infection and vaccine candidates. Negative filters include physical properties (such as size, amino acid composition, presence of transmembrane domains and low complexity sequences) predictive of difficulty in soluble expression and/or crystallization and close similarity to sequences already present in PDB or already targeted by other Structural Genomics projects. In order to achieve our annual goal of determining one or more structures from each of ~50 different proteins; we anticipate selecting at least 500 targets each year. In addition, we expect a smaller number (50–100 annually) of targets from community requests (see below).
The initial round of target selection (Batch01) involved identification of potential drug targets in three bacterial species (Burkholderia pseudomallei
, Brucella melitenesis
, and Rickettsia prowazekii
), by virtue of their sequence similarity (>50% over >75% of their length) with protein sequences in the Drug
]. A series of physical screens were subsequently used to eliminate proteins longer than 500 amino acids, containing more than eight cysteine residues and/or containing any transmembrane spanning domains (except for N-terminal signal sequences) predicted using Tm
) and/or Tm
]. The remaining candidate proteins were screened for near-identity (>95% similarity over >80% of their length) to proteins with known structure or those selected by other structural genomics centers by Blast
P searching against TargetDB. This ultimately resulted in 196 targets from these three species, which were supplemented with 13 ftsZ
orthologues selected from several different species within the Burkholderia
, and Rickettsia
genera. FtsZ was chosen because of its particular interest as a bacterial drug target.
For Batch02, a list of 42 bacterial drug targets being actively pursued by pharmaceutical and academic researchers was compiled by literature survey, and their orthologues were identified within representative B. pseudomallei, B. melitenesis, R. prowazekii, Mycobacterium tuberculosis, Bartonella henselae, and Borrelia burgdorferi genomes. These were then screened through similar filters (except that the size limit was raised to 750 amino acids) as described for Batch01, resulting in 143 additional targets.
Both of these approaches were combined in Batch03 to identify an additional 1477 targets from five bacterial (M. tuberculosis, B. henselae, B. burgdorferi, Anaplasma phagocytophilum, and Ehrlichia chaffeensis) and seven eukaryotic (Babesia bovis, Coccidioides immitis, Cryptosporidium parvum, Encephalitozoon cuniculi, Entamoeba histolytica, Giardia lamblia, and Toxoplasma gondii) species.
Batch04 marked a significant departure from most structural genomics efforts by selection of five RNA riboswitch elements from three bacterial species. Riboswitches are non-coding RNA elements that bind small-molecule metabolites with high affinity and specificity and regulate the expression of associated genes [3
]. These targets include a thiamine pyrophosphate-sensing (thi-box) riboswitch from M. tuberculosis
, an S-adenosyl methionine (SAM-II) riboswitch from B. melitensis
, two pre-queuosine-1 or 7-aminomethyl-7-deazaguanine (preQ1
) riboswitches from Bacillus anthracis
and one (preQ1
) riboswitch from Listeria monocytogenes
To date, the four batches described above have resulted in 1834 targets being approved for entry into the SSGCID structure determination pipeline, as well as an additional 67 targets from Community Requests (see below). Target selection Batch05 is currently in progress and will include orthologues from an additional 45 hand-selected potential drug targets in all bacterial and eukaryotic species above. We also anticipate selecting a number of viral targets (Batch06) in the coming months, as well as two different approaches (Batch07 and Batch08) to identify target proteins with potential roles in pathogenesis.
Cloning and Expression Screening
Targeted genes are initially PCR amplified from genomic DNA (bacteria) or cDNA (eukaryotes and viruses) and cloned into bacterial expression vectors (Tier 1) using a ligation-independent cloning (LIC) methodology [4
]. Both vectors (BG1861 and AVA0421) are derivatives of pET14b, are regulated by the T7 promoter, and contain the amp
gene encoding ampicillin resistance. BG1861 yields protein constructs with a minimal N-terminal His6
-Tag: MAHHHH-HHM-ORF, while AVA0421 yields protein constructs with an N-terminal His6
-Tag and a 3C protease cleavage site: MAHHHHHHMGTLEAQTQGPGSM-ORF [5
]. Cleavage of the His6
-Tag by 3C protease yields proteins with an N-terminal sequence: GPGSM-ORF. Cloning steps are relatively high throughput, being carried out in 96-well plates and trays. The resultant plasmids are transformed into the BL21(D3) bacterial host, grown in auto-induction medium [5
], the cells lysed, the supernatant passed over Ni2+
beads, and soluble protein quantified by SDS-PAGE. Once again all steps are carried out in 96-well format, so the screening proceeds relatively rapidly. Glycerol stocks of all clones are made at this stage and DNA prepared for sequencing to confirm that the correct target has been cloned and does not contain frame-shifts or premature stop codons.
Targets that fail to express sufficient soluble protein in Tier 1 are prepared for cell-free expression (Tier 2) by PCR amplification using a common primer set and vectors (pEU-E01-LIC1 and -LIC2) re-engineered to facilitate LIC. DNA from the resultant clones is transcribed in vitro
using SP6 RNA polymerase to produce sufficient RNA for small-scale expression testing using the ENDEXT® Wheat Germ cell-free protein synthesis system [6
]. Unfortunately, attempts to use linear template obtained by PCR for small-scale expression testing generally showed low expression levels and were of limited utility in predicting the success (or failure) of large-scale purification with a plasmid template. Thus, Tier 2 screening is currently not carried out in high throughput mode. While we have thus far screened only a small number (<50) of targets in Tier 2, our results indicate that the majority produce protein, with about half of them being soluble. This is in agreement with the experience of other laboratories [7
], and suggests that cell-free protein synthesis may provide a valuable technique for “rescuing” targets that fail to produce soluble protein in E. coli
Failure to make soluble protein in Tier 2 results in entry into Tier 3 for synthetic gene construction and cloning into a different bacterial expression vector (pET28-HisSMT). Targets that fail cloning in Tier 1 and those from organisms with difficult to obtain DNA (generally community requests) are moved directly to Tier 3. At least four different constructs (terminal/internal deletions, point mutations, and codon-optimization) are designed for each synthetic gene and clones are screened for soluble expression in much the same way as Tier 1, except that eluate from the magnetic nickel bead purification is screened for protein content using a high-throughput Caliper LC90 capillary electrophoresis system. The synthetic genes are designed using Gene Composer™ software [11
] to harmonize the codon usage of the gene to the E. coli
expression host. A comparison of the native and codon-optimized constructs showed little difference in the success rate [12
], although only bacterial targets have been tested to date.
Additional “rescue” Tiers are planned, but have net yet been implemented. Targets that produce only small amounts of soluble protein in Tiers 1–3 will be screened for improved protein production in the presence of different additives (Tier 4), and we will carry out refolding (Tier 5) to attempt rescue of clones producing insoluble protein. Expression of orthologues in bacterial (Tier 6) or by cell-free (Tier 7) systems will also be attempted, especially for targets that express soluble protein, but fail to crystallize. A subset of targets that fail to express soluble protein in Tiers 1–7 will be selected for baculovirus expression (Tier 8). This will be particularly important for eukaryotic and viral targets likely to contain post-translational modifications (e.g. viral capsid proteins). A limited number of high-value eukaryotic targets will also be tried in mammalian cell expression systems (Tier 9) before being abandoned.
Cloned targets that produce soluble protein are scaled-up and protein purified in milligram quantities at the different protein production facilities. Most scaled-up growth of Tier 1 targets is carried out at the UW-PPG using a LEX 48 Bioreactor, a novel air-based system specifically designed to support the typical needs of HTP protein production labs. The LEX uses compressed air to mix and oxygenate bacterial media and regulate culture temperature, allowing simultaneous upscale of up to 48 different targets in individual 1 L bottles or 24 targets in 2 L bottles. Tier 1 protein purification is carried out at both the UW-PPG and SBRI, and involves lysis, clarification, Ni2+ affinity chromatography, size exclusion chromatography (SEC), protein characterization, and final concentration before packing and shipping. Together, these sites have the capacity to purify 10–15 proteins per week, with yields typically ranging from 10–150 mg of protein at >95% purity.
Tier 2 protein production occurs at SBRI using automated protein production and purification on the Protemist® DTII. Yields obtained in the large-scale reaction have typically been rather modest, with only low- to submilligram quantities of purified proteins (at ~0.6–0.8 mg/ml), produced from each Protemist® run. We are currently exploring whether these yields can be improved using an automated repeat-batch desk-top machine currently in beta-testing at CellFree Sciences.
After purification, all protein samples are delivered to deCODE for crystallization screening. High-throughput crystallization is performed using sophisticated liquid handling devices including Matrix Maker™, Drop Maker™, and the new nanovolume Microcapillary Protein Crystallization System (MPCS™). The MPCS™ technology, which was developed with funding from the Accelerated Technologies Center for Gene to 3D Structure (a PSI-2 Specialized Center) is capable of producing diffraction-ready crystals in the plastic MPCS CrystalCard, and has already been used to solve the structure of one SSGCID target [13
Data Collection and Structure Solution
Diffraction screening of crystals is routinely conducted on deCODE’s in-house XRD (X-ray diffraction) systems, with data collection and structure solution attempted on all crystals that diffract to better than 3.0 Å resolution. Initially, most crystals required a final, high resolution data set to be collected at a synchrotron radiation source, but the recent addition of a new automated Rigaku “Ultimate Homelab” system has allowed collection of many final data-sets in-house, especially for ligand-bound structures in Tier 11 (see below). So far, all SSGCID structures have been solved by molecular replacement techniques using homologous PDB structures, pointing to one of the advantages inherent to our Target Selection strategy. Only three targets have failed structure solution by molecular replacement and these have been scheduled for selenomethionine (Se-met) protein production and MAD (multiwavelength anomalous diffraction) phasing.
Not all proteins are amenable to crystallization, and therefore, proteins less than 150 amino acids in length that fail to crystallize are 15N-labeled and two-dimensional NMR data collected for Heteronuclear Single Quantum Coherence (HSQC) screening. Those proteins that show good 1H-15N HSQC spectra are then 13C- and 15N-labeled to allow collection of the full suite of three-dimensional NMR experiments needed to assign the backbone and determine the solution structure (Tier 10). Ample spectrometer time is available to collect all the NMR data at two sites: UW-NMR and PNNL. However, while data collection is not a bottleneck, chemical shift assignment and final NMR-based structure determination is laborious and time-consuming. Because SSGCID resources are limited for this activity, we expect only 6–8 structures to be determined by NMR each year. To date the structure of two targets have been solved by this method, although several others are nearing completion.