|Home | About | Journals | Submit | Contact Us | Français|
“Life” is a particular state of matter, and matter is composed of various molecules. The state corresponding to “life” is ultimately determined by the genome sequence, and this sequence determines the conditions necessary for survival of the organism. In order to elucidate one parameter characterizing the state of “life”, we analyzed the amino acid sequences encoded in the total genomes of 557 prokaryotes and 40 eukaryotes using a membrane protein prediction online tool called SOSUI. SOSUI uses only the physical parameters of the encoded amino acid sequences to make its predictions. The ratio of membrane proteins in a genome predicted by the SOSUI online tool was around 23% for all genomes, indicating that this parameter is controlled by some mechanism in cells. In order to identify the property of genome DNA sequences that is the possible cause of the constant ratio of membrane proteins, we analyzed the nucleotide compositions at codon positions and observed the existence of systematic biases distinct from those expected based on random distribution. We hypothesize that the constant ratio of membrane proteins is the result of random mutations restricted by the systematic biases inherent to nucleotide codon composition. A new approach to the biological sciences based on the holistic analysis of whole genomes is discussed in order to elucidate the principles underlying “life” at the biological system level.
Biological organisms exhibit two contrasting characteristics: on the one hand, they show an enormous diversity, yet on the other hand, the state of “life” is common to all organisms. Since biological systems are ultimately dependent on the genome DNA sequence of the organism, biological diversity can be understood as the result of many mutations in genomes. In contrast, it is difficult to understand the common state of “life”, because relatively few homologous genes throughout the biological kingdom have been identified by sequence homology searches. New methods of analyses, such as analysis of the physical properties of sequences, are necessary for elucidating parameters that characterize the state of “life” in genome sequences. “Big data” accumulated from thousands of genome sequences are now available [1,2] and can be probed using diverse methods to solve the open question about the common state of “life”: namely, what parameters characterize “life”?
The above question is somewhat obscure, since “life” is not well defined. So, we will first clarify the question on the basis of our current understanding of biology. Figure 1 represents a flow chart of a biological organism. Organisms comprise four layers: a genome DNA sequence (a blueprint), amino acid sequences (materials), proteins (functional units), and the organism itself (a system). These layers are connected by four processes: the biosynthesis of amino acid sequences from DNA sequences; the folding of amino acid sequences resulting in proteins, which are the functional units of the system; the actual system, formed by an appropriate combination of proteins; and the redrawing of the genome sequence due to many mutations. Here, we assume that “life” is the product of the four layers and processes of a biological organism shown in Figure 1. We point out that “life” is frequently discussed in connection with the human mind, but the mind cannot be defined at the molecular level. Therefore, we confine the state of “life” to the cycle shown in Figure 1.
The open question about “life” can now be defined clearly. The principle of the first process of biosynthesis in Figure 1 is well known as the central dogma of biology, together with the genetic code, and these are universal to all biological organisms. However, the principles underlying the other processes remain unknown. Therefore, in order to answer the open question about “life”, we must elucidate the principles of the other three processes: the folding of proteins, the formation of biological systems, and the mutations required to redraw a genome. It is very difficult to elucidate these principles concurrently to solve these problems, and thus we rephrased the questions regarding these principles in terms of biological hierarchy, as shown in Figure 2. Biological systems are deeply hierarchical, from fundamental structures such as α-helices and β-sheets, to the ecological system in its entirety. The principles underlying higher-order structures such as a biological organism are too complicated to elucidate because organisms are composed of many diverse molecules that cannot be dealt with by theoretical and computational methods. Therefore, it is reasonable to begin physical research on the formation of biological structures and systems from the lower end of the hierarchy, namely, the structure of proteins, but the principles underlying protein structure have not yet been established.
However, as shown in the left column in Figure 2, complicated hierarchical structure in the right column corresponds to the simple nesting of sequences in the left column, and this nesting provides the blueprint for biological systems. Here, we divide the question regarding the parameters characterizing “life” into two more tractable questions: the question at the molecular level, “How are protein structures encoded by sequences?”, and the question at the system level, “How are biological systems encoded by sequences?” In this work, we focus on the latter question shown in Figure 2 and show that physical analyses of genome sequences are very useful for identifying the underlying parameter(s) responsible for the state of “life”.
An analogy to statistical mechanics in material science is useful for developing a new method of analysis of total genome sequences. The essence of statistical mechanics is that detailed properties of every molecule are not necessary for determining the macroscopic (thermodynamic) properties of a material, and that the distribution of molecules is sufficient for calculating the macroscopic properties of a material. This analogy raises the question of whether any macroscopic properties of a biological system can be determined by the distribution of biological units such as proteins without knowing the details of every protein molecule. If the answer is “yes”, the comprehension of the mechanism of biological systems will become much easier than the current systems biology, which tries to construct a database of all pathways connecting all proteins and substrates. The difference between the current systems biology and the approach based on the distribution of biological units is manifested by the two arrows in Figure 2. The upper arrow represents the structure formation of individual proteins from sequences of genes, such that a biological system is constructed according to the network of individual proteins. In contrast, the lower arrow assumes that there is some order in the total genome sequence that determines the framework of a biological system, namely the ordered distribution of proteins. After that, the network of individual molecules is formed on the basis of the framework.
However, deeper insight into the distribution of all biological units such as proteins is necessary for ascertaining the distribution of all proteins (including orphan proteins) from a total genome sequence. Here, we point out four important aspects of the distribution analyses of proteins and genome DNA sequences.
Based on the above considerations, we carried out extensive analyses of total genomes at the level of proteins as well as that of nucleotide composition.
We previously developed an online software tool called SOSUI for predicting membrane proteins from amino acid sequences at an accuracy of better than 95% . This online tool is suitable for the analysis of genome sequences, since the application uses only the physical properties of amino acids: the indices of hydrophobicity and amphiphilicity [4,5]. Since an accurate and robust method is required to characterize any amino acid sequence, we used the online tool SOSUI, which only uses simple physicochemical parameters, and does not depend on empirical parameters specific to sequences from certain kind of organisms.
We analyzed the putative amino acid sequences of all genes from 557 prokaryotic genomes and 40 eukaryotic genomes (Supplementary Material) using the SOSUI online tool. Figure 3 shows plots of the ratio of the number of membrane proteins compared to the number of all genes. The ratio of membrane proteins is essentially constant for all biological organisms: the average ratio (arithmetic mean) was 0.228 for prokaryotes with a standard deviation of 0.029, and the respective values for eukaryotes were 0.240 and 0.036. Furthermore, the genomic data set included many extremophilic organisms, yet their ratio of membrane proteins was the same as that of other organisms. Therefore, the constant ratio of membrane proteins appears to be a universal parameter for all organisms.
The universality of the constant ratio of membrane proteins suggests that this parameter is not determined by accident but instead is due to the regulation of mutations in the genome sequences. In order to reveal the mechanism of this regulation, we analyzed the nucleotide compositions at the first, second, and third positions of codons, as well as the average for all positions; the resulting data are shown in Figure 4. We found that the nucleotide compositions at each letter position of the codons showed very large biases from the compositions expected from completely random mutations. Since the nucleotide compositions are the result of the accumulation of mutations, the large and systematic biases of the compositions strongly suggested that the mutations occurring in genome sequences are regulated by some cellular mechanism.
The average nucleotide compositions across the three codon positions are almost the same as the compositions expected based on completely random mutations, but the biases at each individual codon position revealed very different tendencies with respect to each other. From the viewpoint of comparative genomics, two features of the graphs in Figure 4 are important. First, the values for all biological organisms fall in proximity to particular lines, and the scattering of data is small. Second, the tendencies of the nucleotide biases are correlated with the physical properties of amino acid groups at the codon positions. The compositions of guanine (diamond in the upper panel of Fig. 4A) and thymine (triangle in the lower panel of Fig. 4A) at the first codon position are much larger and smaller, respectively, than the compositions expected based on completely random mutations. This observation suggests that small amino acids are more abundant, making polypeptides statistically more flexible, whereas aromatic amino acids (which make polypeptides rigid) are rare. The GC content dependence of the compositions at the second positions showed much smaller slopes than those for the compositions by the completely random mutations. This fact indicates that the hydrophobicity and the frequency of charged amino acids become independent of the GC content.
We hypothesized that the constant ratio of membrane proteins is closely related to the characteristic biases in the nucleotide compositions at the codon positions. The small slopes of the GC content dependence of the nucleotide compositions at the second position are consistent with the universally constant ratio of membrane proteins. It is known that the groups of amino acids sharing the same nucleotides at the second position have similar hydrophobicities. Particularly, the amino acids in the group harboring thymine at the second codon position are all very hydrophobic. In other words, the average hydrophobicity of amino acid segments, which determine the position of transmembrane segments, can be controlled by the nucleotides at the second position of the encoding gene. Therefore, the small slopes of the nucleotide composition against the GC content are consistent with the constant ratio of membrane proteins. If the composition of thymine at the second position fluctuated, the composition of hydrophobic residues in proteins would fluctuate and the proportion of membrane proteins may have been altered.
Previously, we reported the computer simulation of extensive single nucleotide mutations for genome sequences of prokaryotes, evaluating the ratio of membrane proteins by using the SOSUI system [6,7]. The results of that work indicated that the ratio of membrane proteins is preserved when the characteristic biases of the nucleotide compositions are assumed. In contrast, the ratio of membrane proteins shows much larger variance (scattering) when completely random mutations are assumed. The present analyses of amino acid and DNA sequences for both prokaryotes and eukaryotes strongly suggested that the universal mechanism of the systematic biases of the nucleotide compositions renders the ratio of membrane protein constant.
Biological systems have hierarchical structures with many classes, and all classes of the hierarchy are embodied in each genomic sequence, as shown in Figure 2. Not only the hierarchical structures of proteins, but also the harmonious biological system of cells or biological bodies, are embodied in each genomic sequence. The lowest class in the biological hierarchy comprises the secondary structure elements of proteins (e.g., the α-helix and β-sheet motifs); the average size of these elements is much smaller than that of a whole protein. The easiest problem to solve in structure determination of biological systems is considered to be the folding of the tertiary structure of a protein from combinations of its secondary structure units. Hence, the prediction of secondary structure and the modeling of tertiary structure have been topics of investigation for over half a century, yet unambiguous physical solutions remain elusive. Therefore, it is sensible to consider that order in the higher classes of biological systems cannot be elucidated without knowing the principle underlying the lower classes of the biological hierarchy.
The constant ratio of membrane proteins suggests, however, that order in the top class of the biological system is directly regulated by genome sequences. Biological science is currently based on reductionism, in which understanding of the higher class of hierarchy is derived based on knowledge of the lower class. However, the constant ratio of membrane proteins could be directly obtained by a physical approach to the total genome without complete information about the lower classes of biological hierarchy. ‘What parameters characterize “life”?’ is one of the biggest questions in biological science. We do not claim that the present work provides a complete answer to this question. However, we propose that the holistic and physical approach to biological big data described in this work provides a novel route toward determining the answer to this question.
“Life” is a particular state of matter, and matter consists of diverse molecules. The state corresponding to “life” is ultimately determined by the genome sequence. In order to elucidate a parameter characterizing the state of “life”, we analyzed all the amino acid sequences from the genomes of 557 prokaryotes and 40 eukaryotes using a membrane protein prediction online tool called SOSUI. SOSUI uses only the physical parameters of the encoded amino acid sequences to make its predictions. The ratio of membrane proteins in a genome identified by the SOSUI online tool was around 23% for all genomes, indicating that this ratio constitutes one parameter of “life”.
The authors wish to acknowledge Dr. Kei Yura for the information about the preparation of a Special Issue saluting the late Prof. Nobuhiko Saito.
Conflicts of Interest
All authors declare that they have no conflict of interest.
S. M. directed the entire study and wrote the manuscript. R. S. performed the analysis of genome sequences.