Most GFP-based techniques fuse either fragments of genomic libraries or individual clones from cDNA libraries to the coding sequence of GFP, then express the fusions in cells or tissues and determine their subcellular localizations by microscopic inspection. Subsequently, the respective cDNAs or genes are rescued from the cells or tissues, cloned and sequenced. Such strategies have already been conducted on a genome-wide scale in yeast [5
] and have identified the localization of so-far uncharacterized proteins, or fragments thereof. The GFP-tagged proteins can be immediately followed in living cells by time-lapse microscopy to determine their cellular dynamics, which adds a further level of information to such screens. At least 50% of the cDNAs isolated in this way are already known and well characterized, however [6
]. Furthermore, the same cDNA clones are isolated several-fold in one screen, as the primary criterion for selection is simply localization [5
]. These aspects are major disadvantages of such morphological screens and make them inefficient. For example, in an attempt to isolate novel nuclear-envelope proteins, 550,000 starting cDNA clones were required to identify 27 clones localizing to this compartment, of which only two proved to be novel [9
When tagging cDNA libraries with GFP, consideration must also be given to the effect of the reporter on masking targeting signals contained within the expressed proteins. Amino-terminal fusions of GFP to target proteins potentially block signal sequences associated with import into mitochondria or the endoplasmic reticulum, for example. Conversely, when using either random DNA fragments or even non-full-length cDNAs (of which there are significant numbers in cDNA libraries), the expressed proteins may appear to clearly localize, but the recorded localization may be aberrant, resulting simply from exposing a peptide sequence normally hidden in the full-length protein. This was clearly demonstrated in the 'motif-trap method' by which a large number of cryptic mitochondrial targeting signals were isolated - many corresponding to sequences derived from non-coding genomic DNA [10
]. In an attempt to circumvent the problem of hidden amino-terminal targeting sequences, in one study [11
] cDNAs were cloned from a library containing cDNA fragments upstream of GFP, and a retrovirus-mediated expression system was used to determine the cellular localizations of the encoded fusion products. Although this expression system is highly effective, the authors themselves concede that none of their cDNAs was full-length, and that the interpretation of the localization results is dependent upon the targeting sequences being present in the partial cDNA [11
]. Thus, strategies using GFP tagging of whole cDNA or genomic libraries generate significant amounts of redundant or inaccurate data, all of which are time-consuming, and therefore expensive, to eliminate.
Methods are therefore now being devised to focus more rapidly specifically on those localizations of interest. For example, one possibility is first to isolate GFP-positive cells from the non-fluorescent cells using fluorescence-activated cell sorting (FACS), which is able to sort thousands of GFP-expressing cells within minutes into individual wells of multiwell plates, and subsequently to clone them. In this way only GFP-expressing cells have to be examined microscopically, which increases the speed of analysis. An improved variant of such an approach was described recently [7
] with the aim of identifying proteins localizing to the nucleus. Pichon and co-workers first mildly permeabilized intact cells with detergent, in order to remove cytosolic but not nuclear GFP-fusion proteins, and then sorted the remaining GFP-positive cells using FACS. This resulted in a 70-fold enrichment of cells expressing GFP-fusion proteins in the nucleus compared to cultures that had not been treated and sorted.
Clearly, tagging sequenced full-length cDNAs on an individual basis retains the advantages but overcomes many drawbacks of the approaches described above [12
]. One advantage is the availability of a large clone resource from genome projects, the cDNA sequences of which can be prescreened for already-known genes or species variants, so that only novel cDNAs need to be GFP-tagged and screened. In addition, different versions of full-length GFP fusions - tagged at either the amino or the carboxyl terminus - can be generated and compared, helping to circumvent the risk of masking targeting sequences. Indeed, as expected, often only one version of a GFP-tagged protein shows proper subcellular localization [13
]. Although the tagging of full-length cDNAs is a relatively low-throughput process and is reliant upon the identification of novel cDNAs by other means such as systematic sequencing [14
], it has a further clear advantage that no additional cloning is required once an interesting localization has been identified. Tagging of full-length cDNAs suffered until recently from the problem that conventional restriction-enzyme-based cloning had to be used, which is tedious and virtually impossible to do for any large set of molecules [12
]. To overcome this problem, we have recently devised a method that uses a recombination-based cloning system to systematically tag with GFP open reading frames of full-length cDNAs that have been identified and sequenced by large-scale genome projects [13
]. The whole procedure is amenable to automation, and other characterization studies (for example, mutagenesis, protein dynamics and identification of interacting partners) can follow the localization screen immediately without further generation of new reagents or lengthy cloning procedures to identify the full-length cDNAs.