|Home | About | Journals | Submit | Contact Us | Français|
Taking each coding sequence from the human genome in turn and identifying the subcellular localization of the corresponding protein would be a significant contribution to understanding the function of each of these genes and to deciphering functional networks. This article highlights current approaches aimed at achieving this goal.
The spatial and temporal regulation of biochemical reactions in eukaryotic cells is achieved by a high degree of compartmentalization. Each protein is part of a functional biochemical network and all proteins within a particular network are at least once in their lifetime localized close to each other, within (or at) a particular organelle or compartment. This facilitates interactions and yet allows the segregation of different networks. Exchange of information between different organelles, and of proteins between networks, is essential for the proper function of the cell as an entity and is achieved by the active transport of material.
One of the best examples of such an assembly of networks is the secretory pathway. Secretory proteins move sequentially through the distinct membrane-bounded organelles of this pathway, receiving at each step specific enzymatic modifications necessary for their quality control and proper function. The communication and specific transfer of material between membrane organelles is mediated by distinct small membrane-bounded transport carrier vesicles containing a myriad of regulatory proteins. A key feature of any protein functionally involved in the secretory pathway is its permanent or transient localization to one of the appropriate transport carriers or organelles. Extending this concept to the whole cell, the determination of the subcellular localization of a novel protein is one of the essential steps in resolving its function. This includes imaging not only the protein's steady-state distribution but also the changes in localization that can occur in response to environmental conditions, during specific stages of the cell cycle or of cell differentiation. Indeed, changes in localization can also be caused by the breakdown of remote but functionally related organelles and/or cellular structures, such as Golgi fragmentation resulting from microtubule reorganization (see for example Figure 1c,d).
Although studies to follow these dynamic events have been a difficult task in the past, the availability of green fluorescent protein (GFP) and its spectral variants has now facilitated localization experiments particularly aimed at observing protein dynamics in living cells [1,2,3,4]. The cDNA encoding GFP was cloned several years ago and encodes a 27 kDa protein that emits green fluorescence when excited with blue light, without the need for any co-factors. Thus, any cDNA can be fused with the coding sequence of GFP, and the localization of the expressed GFP fusion can be followed in living cells. This unique feature of GFP has led to the development of a number of 'localization screening assays', which can be performed in a systematic 'high-throughput' manner as typically required for large-scale post-genome projects.
Most GFP-based techniques fuse either fragments of genomic libraries or individual clones from cDNA libraries to the coding sequence of GFP, then express the fusions in cells or tissues and determine their subcellular localizations by microscopic inspection. Subsequently, the respective cDNAs or genes are rescued from the cells or tissues, cloned and sequenced. Such strategies have already been conducted on a genome-wide scale in yeast [5,6] and have identified the localization of so-far uncharacterized proteins, or fragments thereof. The GFP-tagged proteins can be immediately followed in living cells by time-lapse microscopy to determine their cellular dynamics, which adds a further level of information to such screens. At least 50% of the cDNAs isolated in this way are already known and well characterized, however [6,7,8,9]. Furthermore, the same cDNA clones are isolated several-fold in one screen, as the primary criterion for selection is simply localization . These aspects are major disadvantages of such morphological screens and make them inefficient. For example, in an attempt to isolate novel nuclear-envelope proteins, 550,000 starting cDNA clones were required to identify 27 clones localizing to this compartment, of which only two proved to be novel .
When tagging cDNA libraries with GFP, consideration must also be given to the effect of the reporter on masking targeting signals contained within the expressed proteins. Amino-terminal fusions of GFP to target proteins potentially block signal sequences associated with import into mitochondria or the endoplasmic reticulum, for example. Conversely, when using either random DNA fragments or even non-full-length cDNAs (of which there are significant numbers in cDNA libraries), the expressed proteins may appear to clearly localize, but the recorded localization may be aberrant, resulting simply from exposing a peptide sequence normally hidden in the full-length protein. This was clearly demonstrated in the 'motif-trap method' by which a large number of cryptic mitochondrial targeting signals were isolated - many corresponding to sequences derived from non-coding genomic DNA . In an attempt to circumvent the problem of hidden amino-terminal targeting sequences, in one study  cDNAs were cloned from a library containing cDNA fragments upstream of GFP, and a retrovirus-mediated expression system was used to determine the cellular localizations of the encoded fusion products. Although this expression system is highly effective, the authors themselves concede that none of their cDNAs was full-length, and that the interpretation of the localization results is dependent upon the targeting sequences being present in the partial cDNA . Thus, strategies using GFP tagging of whole cDNA or genomic libraries generate significant amounts of redundant or inaccurate data, all of which are time-consuming, and therefore expensive, to eliminate.
Methods are therefore now being devised to focus more rapidly specifically on those localizations of interest. For example, one possibility is first to isolate GFP-positive cells from the non-fluorescent cells using fluorescence-activated cell sorting (FACS), which is able to sort thousands of GFP-expressing cells within minutes into individual wells of multiwell plates, and subsequently to clone them. In this way only GFP-expressing cells have to be examined microscopically, which increases the speed of analysis. An improved variant of such an approach was described recently  with the aim of identifying proteins localizing to the nucleus. Pichon and co-workers first mildly permeabilized intact cells with detergent, in order to remove cytosolic but not nuclear GFP-fusion proteins, and then sorted the remaining GFP-positive cells using FACS. This resulted in a 70-fold enrichment of cells expressing GFP-fusion proteins in the nucleus compared to cultures that had not been treated and sorted.
Clearly, tagging sequenced full-length cDNAs on an individual basis retains the advantages but overcomes many drawbacks of the approaches described above [12,13]. One advantage is the availability of a large clone resource from genome projects, the cDNA sequences of which can be prescreened for already-known genes or species variants, so that only novel cDNAs need to be GFP-tagged and screened. In addition, different versions of full-length GFP fusions - tagged at either the amino or the carboxyl terminus - can be generated and compared, helping to circumvent the risk of masking targeting sequences. Indeed, as expected, often only one version of a GFP-tagged protein shows proper subcellular localization . Although the tagging of full-length cDNAs is a relatively low-throughput process and is reliant upon the identification of novel cDNAs by other means such as systematic sequencing , it has a further clear advantage that no additional cloning is required once an interesting localization has been identified. Tagging of full-length cDNAs suffered until recently from the problem that conventional restriction-enzyme-based cloning had to be used, which is tedious and virtually impossible to do for any large set of molecules . To overcome this problem, we have recently devised a method that uses a recombination-based cloning system to systematically tag with GFP open reading frames of full-length cDNAs that have been identified and sequenced by large-scale genome projects [13,14]. The whole procedure is amenable to automation, and other characterization studies (for example, mutagenesis, protein dynamics and identification of interacting partners) can follow the localization screen immediately without further generation of new reagents or lengthy cloning procedures to identify the full-length cDNAs.
Several bioinformatic tools have been developed with the aim of predicting protein localization on the basis of sequence features within the respective gene or cDNA. One of the early methods, PSORT [15,16], detects in sequences the signals required for sorting proteins to particular subcellular compartments. Although PSORT is a well-accessed program and is widely applicable to different organisms, its overall accuracy - at best, for yeast - is still in the region of 50%. Others have used phylogenetic profiles , more careful use of annotated databases such as the Meta-A evaluation of SWISS-PROT entries , or expression levels  as means to tap into the knowledge that can be gained from determining localization. More profitable, perhaps, is to concentrate on specific organelles and the sequence motifs that direct proteins to them. For example, defined signals for directing proteins to mitochondria, the secretory pathway or chloroplasts are now well characterized, and the success rate of prediction can be as high as 90%. Even the correct prediction of cleavage sites for the signal sequences is possible with more than 50% success rate . Certainly the speed and cost of these methods is currently unsurpassed. As a result of more genome sequencing projects being completed, more data for comparisons are available, and so the quality of results using screening algorithms based on sequence homologies rises steadily. More databases, which integrate all this information, are therefore being implemented [21,22]. Experimental data gathered for individual genes, and ideally proteins, also funnels into such databases information that is then accessible to in silico tools. For many novel proteins, however, these tools remain at present suggestive at best, and for these molecules there is still no alternative to actual experimental verification.
In summary, a protein's localization and its subcellular dynamics are important parameters to know when trying to determine its function. With the availability of GFP and its variants, new in vivo approaches have been made possible, and these have already identified novel proteins in various desired locations. In due course, these techniques will undoubtedly be applied and perfected on a genome-wide scale. Furthermore, the reagents generated during the course of such projects (such as GFP-tagged proteins) are extremely useful for subsequent microscope-based functional studies with different foci - for example, the analysis of a protein's posttranslational modifications or the dynamics of interactions with binding partners in living cells . This will ultimately allow us to identify functional networks of proteins in a morphological context and will greatly contribute to our understanding of whole-cell function.
We thank Jan Ellenberg for useful comments on the manuscript. J.C.S. was in part supported by an EMBO Long Term fellowship. The Wiemann and Pepperkok laboratories are supported by grants from the BMBF numbers 01KW9987 (German cDNA Consortium), 01KW0012 (to S.W.) and 01KW0013 (to R.P).