Advances in DNA sequencing technology have increased sharply over the past 15 years [
1]. These advances have enabled the sequencing of many large and small genomes, resulting in over 3,000 bacterial genomes including ~150 archaea and nearly 200 eukaryotic and mammalian genome sequences (
http://www.ncbi.nlm.nih.gov/sites/genome) to be completed. The access to this massive quantity of data has had a strong ripple effect leading to an increased demand for new technologies that will enable scientists to study the activities and functions of these gene sequences in a high throughput manner. Among the numerous discoveries enabled by genome sequence data, one somewhat unanticipated finding relates to the fact that at least one-third of the open reading frames (ORFs) encoded in genomes has no predicted function based on BLAST analysis [
2-
4]. Interestingly, the number of genes of unknown function increases in a linear manner as we sequence additional genomes [
5]. One might imagine that as we sequence more genomes, the rate that novel genes are identified would begin to decrease rapidly. This is clearly not the case though and strongly support the view that the number of unique gene sequences and functions encoded on our planet is very large. For most microbial species, 10-30% or more of the ORFs encoded in one strain’s genome are novel compared to another strain belonging to the same species. The gene pool of many bacterial species may exceed several tens of thousands of unique genes. It is likely that by the end of this decade, we will have sequenced over 10 million genes of unknown function!
This humbling realization emphasizes the need for substantial improvements in the area of functional genomics if we are to keep pace with the ever-increasing ease that genes and genomes are sequenced. One phenomenon that have been documented, referred to as non-orthologous gene displacement (NODs) may provide an inroad to tackling the monumental problem of determining the function of uncharacterized genes. NODs represent cases where two proteins perform the same cellular function but do not possess an ancestral relationship. We know of several cases like eukaryotic and prokaryotic DNA polymerases that essentially carry out the same cellular functions, but do not share common ancestral relationships. In other words these functions evolved independently during evolution. The vast majority of the assigned functions of genes are based on BLAST and orthology (conservation of DNA or amino acid sequence). If genes arise independently they by definition do not share ancestry nor do they share amino acid sequence identity. The scientific research community has developed strategies to assay a wide range of known protein functions over the years, it may follow that the screening of novel proteins of unknown function using familiar assay systems will yield a surprising number of experimentally determined gene functions. While this explanation may partially explain the reason we are accumulating more and more genes of unknown function in our databases, we remain highly ignorant as to the frequency of NODs in nature.
Massively parallel technologies have been developed, such as microfluidics and DNA and protein microarrays, which present important vehicles to partially enable the large-scale characterization of gene/protein function [
6-
12]. Our ability to determine the function of genes places strong demands on a variety of disciplines related to recombinant protein technologies. The large-scale characterization of protein function requires very efficient recombinant proteins production in a high-throughput environment and the necessary automation to perform high-throughput functional screens [
13,
14]. Likewise, complementary technologies that broaden the use of recombinant proteins such as labeling methods, sub-cellular localization determination, enzymatic activity and substrate specificity will also need to be developed and advanced if we are to make significant progress.
Among the numerous challenges associated with large-scale functional characterization of proteins is the choice of expression systems that are to be employed. Given the fact that several systems offer some discrete advantage, in an ideal world, one would employ many platforms. For practical reasons researchers are forced to make difficult decisions regarding which platform provides the greatest overall utility for the objectives in question. Among the variety of tools being developed that show promise of enabling the functional characterization of protein function, the HaloTag technology developed by scientists at Promega (Madison, WI) is notable [
15,
16]. Here we provide an overview of functional assays and experience we have developed in conjunction with the HaloTag technology.
We have used the HaloTag technology for a number of functional studies, including protein microarrays, affinity purification of DNA-protein, protein-protein interactions, and protein complex identification [
7,
17]. The HaloTag is a modified haloalkane dehalogenase designed to covalently bind a series of chloroalkane derivatives such as fluorophore-labeled ligands (Promega). We have observed improved solubility of fusion proteins using this system, comparable to that achieved by the best solubilization fusion partner, the maltose-binding protein MBP [
18]. The HaloTag vector (Promega) adopted a Flexi cloning system that uses traditional restriction site cloning methods. We found this cloning method to be inadequate for high-throughput cloning of genes, and have adapted the cloning platform for compatibility with Gateway and Ligation Independent Cloning (LIC) procedures [
19-
22]. We have used these vectors in a number of studies including the expression and purification of proteins derived from Influenza virus H1N1,
Y. pestis,
S. pneumoniae and
B. mallei. Genes were expressed using several expression systems including
E. coli, a cell-free (wheat germ) system and mammalian cells. The HaloTag supports development of functional assays, such as fluorescence polarization, FRET, on-chip purification in protein microarrays and also allows monitoring sub-cellular protein localization. The rapid covalent attachment of the HaloTag to its specific ligand is a critical feature that separates the HaloTag from any other tags that use reversible interactions [
23]. The high affinity covalent interaction is extremely rapid and allows binding reactions to be carried out in minutes. This has proven advantageous in that we observe a dramatic reduction in the background, non-specific binding events that reduce signal to noise assay ratios [
16,
24].