The sequences of several millions of proteins are currently known and this number is growing ever more rapidly as a result of the relentless efficiency of genomic and metagenomic sequencing projects. Around 30%–40% of these gene products are classified as so-called “hypothetical proteins.” This term is somewhat of a misnomer, but is the accepted way of indicating that no information is available about them other than the translated nucleotide sequence. It is interesting to note that this group of proteins persists despite years of annotation efforts in the genome sequencing projects. “Hypothetical proteins” are not merely artifacts, and many have been validated as gene products in function-based, genome-scale surveys, such as essentiality analysis 
, disease association studies 
, genome-wide DNA expression arrays 
, cDNA and proteomics-based environmental surveys 
. They then are bona fide proteins that simply have not yet been the focus of any detailed study. The importance of such “conserved hypotheticals” has been discussed many times in the literature 
and proposed as an important subject area for further studies: “experimental characterization of […] ‘conserved hypothetical’ proteins is expected to reveal new, crucial aspects of microbial biology and could also lead to better functional prediction for medically relevant human homologs”
. We can expect that most of the yet undiscovered functionality of these families will represent novel chemistry, novel biochemical pathways, alternative solutions to known reactions, or new regulatory mechanisms. The fact that they are usually overlooked or even omitted from many studies may introduce significant biases in “-omics” analyses 
. Thus, the NIH Protein Structure Initiative (PSI; http://www.nigms.nih.gov/Initiatives/PSI/
) has made a concerted and systematic effort to explore these uncharted regions of the protein universe as a means to uncover new insights into the evolution and diversity of protein structure and function.
Protein space can be dissected and organized by grouping proteins into families of homologs, based on inferred evolutionary and functional relationships. Many specialized resources 
have been developed to provide information on protein families. All of these sources paint a similar picture of the protein universe, with only some quantitative differences that arise from use of different protocols and definitions of protein families. One of the oldest and best known such resource, the PFAM database 
), in its 23rd
release, lists over 10,000 protein families that cover around 70% of an average genome. The number of protein families listed by PFAM and other resources increase over time; for instance, 5 y ago PFAM listed only 5,000 families. Part of this increase can be accounted for by more rigorous analysis of the existing data, but the rapidly increasing number of known protein sequences is the main factor driving the apparent growth in the number of protein families. One of the most interesting questions in biology concerns the implications of this growth—do we expect that the number of protein families grows linearly with the number of known sequences, or at some point, does it start to saturate? Results
from the analysis of metagenomics open reading frames (ORFs) 
, presented in this journal 2 y ago, seemed to suggest that we are still in the linear phase of growth of the number of protein families, but as we will show here, the picture is different when we look at the higher level of organization of the protein universe.
Protein families are most commonly defined by sequence similarity, as it represents the most obvious trace of an evolutionary relationship between proteins. However, as our ability to recognize sequence similarity between proteins has progressed from simple residue-by-residue comparisons measured by mutation matrices 
to sequence profiles 
, position-specific mutation matrices 
or Hidden Markov Models (HMM) 
, to comparisons between such profiles 
or between the HMM 
, it has become eminently clear that statistically significant sequence similarity between proteins may extend far beyond the intuitive definition based on sequence identity. Such a realization correlates well with our understanding of molecular evolution, which often obliterates easily recognizable sequence similarity among genes that diverged a long time ago, but leaves behind traces of statistically significant patterns of conserved residues that are apparent only when multiple, related sequences are aligned. To reflect the concept of different degrees of divergence between genes, proteins are often subjected to multilevel classification, with the term “family” reserved for groups of proteins related by short evolutionary distances that still retain traces of similarity in their primary sequences. But families can be organized into groups of higher hierarchy that are linked by more far reaching relationships. For instance, in PFAM 
such groups are called “clans,” whereas “superfamily” is often used in other resources. We can expect that further development of even more sensitive algorithms for recognition of distant homologs would expand the list of clans or equivalent groupings in other classification systems. The growth of the protein universe can then be investigated on the level of individual proteins, protein families, or clans/superfamilies, and we can expect qualitatively different answers on each level.
“Hypothetical proteins” can also be grouped into families, and the latest release of PFAM contains 2,156 families annotated as domains of unknown function (DUF), with 91 further families listed as Uncharacterized Protein Families (n.b. since 95% of families of unknown function in PFAM are called DUFs, from here on we will use the term “DUF” to denote both DUF and Uncharacterized Protein Families). Classifying DUF families into superfamilies and clans is more problematic, as such classification often depends on additional information, such as three-dimensional structures and/or protein function, and such information is not obviously available.
Structural genomics, represented in the United States by the NIH NIGMS PSI (http://www.nigms.nih.gov/Initiatives/PSI
), has pioneered a novel approach to structural biology that is highly complementary to strategies pursued in individual structural biology labs. Instead of focusing on individual proteins, US structural genomics and, specifically, the four large-scale production centers of the PSI have focused their attention on substantially increasing structural coverage of protein space. DUF families have then become natural targets as such families cover a significant fraction of the unexplored protein universe. In contrast, “classical” structural biology efforts are mainly focused on well-characterized systems, leaving the majority of protein families outside of their sphere of interest, including, by default, almost all DUF families.
Here, we investigate structures of representatives of DUF families determined by the PSI as a means to gain insights into the yet unexplored regions of protein space. While not perhaps as statistically rigorous a sampling as will eventually be possible, the substantial size of the sample (~250 protein families) offers a rare opportunity to make some general observations and conclusions and enables predictions to be made about the trends and features of the uncharted regions of the protein universe. In particular, we are now able to determine the distribution of the folds in these families and deduce the evolutionary relationships of many DUF families to previously characterized families. For many of these families, determination of their three-dimensional structures offers the first hypotheses about their function and represents a powerful approach to initiate and promote studies for experimental verification of the biological function of these unexplored and underappreciated regions of the protein universe.