Many proteins do not function as monomers in the cell, but interact with partners in stable or transient complexes. Therefore, to understand their function, characterisation of subcomplexes of multi-component entities is necessary
[1]–
[4]. Characterisation of protein complexes has received considerable attention in the post-genomic era and large scale experimental and bioinformatic studies have identified the subunit content of many protein complexes. These subunits exist in a continuum from completely unstructured proteins that fold upon binding to those that fold individually and subsequently dock together
[5]–
[10]. Although the components of many protein complexes have been catalogued using proteomics methods (e.g.
[11]), recombinant expression of intact complexes for structural studies remains a major challenge. In particular, careful experimental validation of complexes predicted from high throughput studies is necessary to filter out transient, unstable or non-existent complexes prior to commencement of recombinant expression trials
[12].
A common strategy for obtaining protein complexes is to express single proteins separately and then reconstitute complexes from purified components. Various experimental approaches for assembling protein complexes under
in vitro conditions have been developed
[13]–
[15]. Although these methods can be efficient, the formation of protein complexes is dependent on soluble expression of each component. In many cases when heterologous expression systems are employed, complex subunits cannot fold in the absence of their partners and so co-expression strategies are employed to produce subunits in the same host cell
[16],
[17]. Co-expression facilitates soluble complex formation by allowing co-folding or stabilisation through binding of protein partners. This can reduce or prevent aggregation or degradation, and alleviates the need for
in vitro purification and reconstitution
[18]. Several studies have revealed how co-expression can perform better than reconstitution from separately purified components
[4],
[19].
Among various systems to produce heterologous proteins for structural and functional studies, protein expression in
Escherichia coli is the most commonly used system because it is genetically simple, inexpensive for producing large quantities of proteins and permits the isotopic or heavy atom labelling of proteins that is necessary for some structural methods. However, when full-length eukaryotic proteins are produced in
E. coli, aggregation and insolubility problems often arise resulting in low yields
[20]. Contributing factors include large size, susceptibility to proteases, intrinsic segmental flexibility or requirements for post-translational modifications. In fields such as structural biology, expression of more stable sub-full-length protein constructs is a common strategy, but this necessitates prediction of domain boundaries in order to design constructs. Multiple sequence alignments are the most common tool for domain prediction and are used to guide subsequent trial-and-error PCR subcloning experiments. One problem with this approach is that many proteins are poorly understood and have no significant sequence similarity with others, precluding this approach. In these cases secondary structure predictions and order/disorder predictors can help identify folded domains. Several convenient meta server tools exist that combine different secondary structure and order predictors with additional information sources to provide more accurate domain predictions and even associated automated primer design, for example ProteinCCD
[21] and the SGC Domain Boundary Analyser
[22]. Such tools can be very valuable, but do not always result in successful expression, in part because they are generally low resolution and even small variations at the edges of construct can affect the level of expression and stability of the products in an unpredictable manner.
For such problematic targets, a number of random library-based strategies have been developed that generate large collections of randomly truncated or fragmented constructs and couple these to a screen or selection process to identify rare soluble clones
[23]–
[28] reviewed in
[29]–
[31]. The ESPRIT technology developed in our laboratory uses exonuclease III/mung bean nuclease protocols
[32] to generate unidirectionally or bidirectionally truncated construct libraries. Tens of thousands of clones can then be screened in a colony array format using efficiency of
in vivo biotinylation of a fused biotin acceptor peptide to enrich soluble clones from the library
[28]. Positive clones are then further validated in 96 well plates by automated affinity chromatography purification
[33]. ESPRIT has been used to express a number of challenging proteins for further structural study
[27],
[34]–
[39].
There has been no detailed description of library methods being used to express protein complexes directly i.e. incorporating co-expression approaches. Here, we have established a high-throughput automated strategy in which a library of constructs is screened for solubility in the presence of an interacting bait protein. As such, it is similar in concept to two-hybrid methods but in the context of recombinant over-expression of multi-milligram quantities of material required for many downstream applications including structural biology and vaccine research. Soluble protein complexes identified by this method can either result from association of pre-folded partners or inter-folded polypeptide chains. To demonstrate the isolation of both types of complexes, we used subunits from the heterotrimeric influenza RNA polymerase that comprises three subunits: PA, PB1 and PB2. This complex catalyses the transcription and replication of the viral RNA genome in the nucleus of infected cells
[40]. The PB2 subunit has been shown to interact with importin α to achieve nuclear localisation
[27]. For many years the polymerase subunits resisted all attempts at soluble recombinant expression due their relatively large sizes (PA: 716 aa; PB1: 757aa; PB2: 759aa) and their lack of homology with other proteins which prevented domain identification using multiple sequence alignments. The PB2 subunit was previously studied using the ESPRIT method leading to the expression and subsequent structure solution of a series of novel domains key to viral function
[27],
[37],
[39] reviewed in
[41],
[42].
Here we screened PB2 gene libraries against bait proteins with the aim of isolating purifiable complexes. Firstly a 5′ truncation library of the gene encoding the polymerase PB2 subunit was screened against importin α1 that has been shown to bind the purified C terminus of PB2 when mixed
in vitro [39]. Secondly a 3′ truncation library of the same subunit was screened against a poorly behaving, marginally soluble C-terminal construct isolated from the PB1 polymerase subunit in an earlier ESPRIT experiment (data not shown). A similar PB1 construct was recently shown by X-ray crystallography to form an inter-folded complex with a short N-terminal fragment of PB2
[43], explaining its poor behaviour in isolation. In both experiments, a series of soluble complexes were isolated, some of which were similar to structurally validated forms, while others may be of potential interest in future functional studies.
The application of ESPRIT in this co-expression format (CoESPRIT) provides a powerful way of identifying well-expressing soluble complexes for in vitro and in vivo biochemical and structural characterisation, as well as immunisation, high throughput screening and other applications that require multi-milligram quantities of material. Additionally the same format has the potential for co-expression of other interacting proteins such as chaperones and modifying enzymes, widening the repertoire of expression tools for obtaining sufficient quantities of purified protein complexes.