Evolution from Primitive Life to Homo sapience Based on Visible Genome Structures : The Amino Acid World

It is not too much to say that molecular biology, including genome research, has progressed based on the determination of nucleotide or amino acid sequences. However, these approaches are limited to the analysis of relatively small numbers of the same genes among species. On the other hand, by graphical presentation of the ratios of the numbers of amino acids present to the total numbers of amino acids presumed from the target gene(s) or genome or those of the numbers of nucleotides present to the total numbers of nucleotides calculated from the target gene(s) or genome, we can readily draw conclusions from extraordinarily huge data sets integrated by human intelligence. 1) Assuming polymerization of amino acids or nucleotides in a simulation analysis based on a random choice, proteins were formed by simple amino acid polymerization, while nucleotide polymerization to form nucleic acids encoding specific proteins needed certain specific control. These results proposed that protein formation chronologically preceded codon formation during the establishment of primitive life forms. In the prebiotic phase, amino acid composition was a dominant factor that determined protein characteristics; the “Amino Acid World”. 2) The genome is constructed homogeneously from putative small units displaying similar codon usages and coding for similar amino acid compositions; the unit is a gene assembly encoding 3,000 7,000 amino acid residues and this unit size is independent not only of genome size, but also of species. 3) In codon evolution, all nucleotide alternations are correlated, not only in coding regions, but also in non-coding regions; the correlations can be expressed by linear formulas; y = ax + b, where “y” and “x” represent nucleotide contents, and “a” and “b” are constant. 4) The basic pattern of cellular amino acid compositions obtained from whole cell lysates is conserved from bacteria to Homo sapiens, and resembles that calculated from complete genomes. This basic pattern is characterized by a “star-shape” that changes slightly among species, and changes in amino acid composition seem to reflect biological evolution. 5) Organisms can essentially be classified according to two codon patterns. Biological evolution due to nucleotide substitutions can be expressed by simple linear formulas based on mathematical principles, while natural selection must affect species preservation after nucleotide alternations. Therefore, although Darwin’s natural selection is not directly involved in nucleotide alternations, it contributes obviously to the selection of nucleotide alternations. Thus, Darwin’s natural selection is doubtless an important factor in biological evolution.


INTRODUCTION
It is well known that Alfred R. Wallace's theory based on the geographical distribution of animal species, represented by the Wallace line, and the voyage on HMS Beagle, contributed to the development of Darwin's theory.
Molecular biology has progressed with the purification of proteins and the cloning of the genes encoding them, accompanied by sequencing of nucleotides and amino acid residues to understand complicated metabolic pathways.Therefore, the contributions of Frederich Sanger, who developed methods of amino acid [1,2] and nucleotide [3] sequence analyses, and that of Allan Maxam and Walter Gilbert who also developed nucleo-tide sequence analyses [4], to the development of molecular biology, are inestimable.An approach using nucleotide sequences has a merit that excludes standard errors.Changes in nucleotide or amino acid sequences in a single gene have been applied to evolutionary research based on the assumption that amino acid sequence changes are linked to biological evolution -a "molecular clock" [5].In general, it is possible to compare sequences among the same kinds of genes or proteins, but it is hard to compare different kinds of genes or their products.Thus, the approach using nucleotide sequences seems not to be suitable for genome research handling genomes consisting of different kinds and numbers of genes among species.On the other hand, focusing on constitutional differences in proteins, the ratios of the numbers of amino acids present to the total numbers of amino acids presumed from the target gene(s) or genome and those of the numbers of nucleotides present to the total numbers of nucleotides in the target gene(s) or genome are applicable for the comparison not only of the same kinds of genes, but also for the comparison of different kinds of genes and different genomes.Ratios based on amino acid or nucleotide sequences can exclude deviations, and the combinations of 20 amino acid or four nucleotide distributions can characterize genomes including a huge amount of data.Therefore, these ratios are a useful tool for genome research, which handles enormously huge data sets.In addition, using certain graphical presentations, huge data sets on genomes can be easily recognized as simple patterns representing complicated organisms.
Graphic representation or a diagram approach to the study of complicated biological systems can provide an intuitive picture and provide useful insights.The historic puzzle of Chargaff's second parity rule in molecular biology has recently been solved using a simple graphic DNA model [6].Various graphical approaches have been successfully used, for example, to study codon usage [7][8][9][10][11][12], enzyme catalyzed systems [13][14][15][16][17][18], and HIV re-verse transcriptase inhibition mechanisms [19,20].Graphical approaches have also been used recently to represent DNA sequences [21].

Biological Evolution Based on Cellular Amino Acid Compositions
Microorganism fossils were found in 2,500 -2,800 million year-old rocks [22][23][24].Evidence for the existence of microorganisms in ancient rocks indicates that these microorganisms were closed to primitive life forms on earth.Australopithecus, the forebears of Homo sapiens afarensis, are thought to have appeared about 4 million years ago in Africa, based on the fossil record [25], strongly supporting Darwin's theory and the existence of many extinct species, such as dinosaurs.
The scientific discovery that explained hereditary characteristics was made by James D. Watson and Francis Crick, namely, the double helix structure of DNA [26].The pairs of A versus T and G versus C in the double helix structure of DNA produce hereditary characteristics in the replication system and transcription system.According to the transcription system, where U is used instead of T in RNA, cellular proteins are the products of DNA, including various genes, which are responsible for genetic characteristics.Thus, cellular proteins naturally reflect genetic characteristics, even though the amount of each protein may differ.Cellular amino acid analysis was first carried out in bacteria by Noboru Sueoka [27].Then, my group investigated the cellular amino acid composition not only of bacteria, but also of archaea and eukaryotes, and found by graphical presentation of data on radar charts that the basic pattern of cellular amino acid compositions is conserved from bacteria to mammalian cells [28].This basic pattern, the "star-shape", is formed with high concentrations of Asp, Glu, Gly, Ala, Val, Ile, Leu and Lys, and with low concentrations of Ser, His, Arg, Pro, Tyr, Met, Cys and Phe (Figure 1).In archaea [29] and plants [30], similar basic patterns of cellular Figure 1.Cellular amino acid compositions on radar charts.The value is expressed as the percentage of total amino acids and in the mean of 3 or 4 independent experiments.Gln and Asn were incorporated into Glu and Asp, respectively, because the former two are converted to the latter two during acidic hydrolysis (Sorimachi 1999).In addition, Try was omitted because of higher decomposition during acidic hydrolysis.amino acid compositions are obtained.The fact that the basic pattern, the "star-shape", is conserved from bacteria to Homo sapiens, suggests that the pattern is extremely important for organisms on earth.Each amino acid composition changes slightly accompanied with conservation of the basic pattern, and these minor changes seem to reflect biological evolution.Intra-cellular free amino acid compositions also show species-specific patterns [31].
Whole cell lysates consist of many different proteins, the quantities of which show similar amino acid compositions among various organisms; however, species differences are observed.It would be quite interesting to evaluate whether this "star-shape" is conserved on other planets with life in the future, if any are found.

Primitive Life Formation
Based on the principles of molecular biology, the parental genetic information is transferred to daughter cells by the replication system.The fact that the basic pattern of cellular amino acid composition appears to be conserved from bacteria to Homo sapiens suggests that the presumed amino acid composition of primitive life forms might resemble the cellular amino acid composition obtained from modern organisms, because the original pattern could have been maintained by the replication system after codon establishment.

Chronological Precedence of Protein Formation over Codon Formation
We can easily understand that proteins are translated from codons within genes in modern organisms.However, it is unclear if codon formation really preceded protein formation.Although there have been several reports explaining the mechanisms of codon formation [32][33][34], no one theory has become established.At present, we cannot experimentally make life in the laboratory, because there are too many unknown factors.On the other hand, computational analysis is an ideal method for solving problems that cannot be solved experimentally.On the basis of molecular biological research, we cannot deny that codons are linked to the determination of the amino acid residues in proteins.
Assuming that a structure can sometimes reveal its formation process, it is possible to investigate the relationship between protein and codon formation based on the amino acid compositions presumed from codon usages.Before establishing the well-known protein synthesis pathway in the presence of codons, protein formation occurred via the polymerization of amino acids, the monomers of proteins.Indeed, amino acid polymerization occurred by heat without enzymes in clay [35].Proteins can be synthesized computationally by selecting a random order of amino acids from an amino acid pool presumed from a protein.When more than 300 amino acid residues are chosen at random, the amino acid composition resembles that of the original protein, and amino acid compositions with reduced similarities are obtained by even the first 100 amino acid residues chosen (Figure 2).On the other hand, the amino acid composition presumed from more than 900 randomly selected nucleotides, equal to 300 amino acid residues, cannot show the same pattern of amino acid composition.The amino acid composition based on fewer than 300 nucleotides also can not show the specific pattern.These results clearly indicate that mere polymerization of nucleotides, assumed by random choice of nucleotides, can not produce a specific protein.Eventually, the amino acid compositions of proteins obtained from freely polymerized nucleotides depend on both the concentrations of all four nucleotides and the genetic code, and proteins with specific amino acid compositions can not be obtained from nucleic acids formed by free nucleotide polymerization (Figure 2).When codon conversion is neglected, the nucleotide composition of polynucleotides can be expressed by a simple quadrangle based on the concentrations of the four nucleotides on radar charts.A consistent result was obtained when various genes were analyzed [36].In a gene encoding 5,005 amino acid residues, the amino acid compositions of small segments encoding 100 amino acid residues resemble that of the complete gene, and the gene is constructed homogeneously from putative small units encoding similar amino acid compositions [36].This result, based on gene segments, is consistent with that based on selecting a random order of amino acids or nucleotides.Thus, the initial codon formation might be surely controlled by certain factors to form specific proteins.On the contrary, protein formation could occur via simple polymerization of free amino acids without codons.

A Hypothesis Based on Simulation Analysis
Although it is difficult for us to envisage an inverse mechanism in which the information within polypeptides is transferred to nucleotide polymerization, this is the mathematical conclusion based on simple simulation analysis using a random choice, which assumes free amino acid or nucleotide polymerizations.In Miller's experiments, which assumed an atmosphere on primitive Earth, certain amino acids were formed by electrical discharges [37].Amino acids have also been identified in meteorites [38,39].Thus, proteins might be formed even without codons in prebiotic states, and then polynucleotides, including codons, might be formed under conditions that enabled the transfer of protein information.
Based on this assumption, primitive life forms might have consisted of proteins reflecting the concentrations of free amino acids that existed on primitive Earth.The concentrations of amino acids would have been controlled by various factors, such as gamma rays, UV light and heat, like the natural selection.These effects must have induced homogeneous amino acid concentrations and, eventually, the proteins formed must have had similar amino acid compositions.Indeed, considering the concentrations of each amino acid in cells, the concentrations of those with a benzene ring, Tyr, Phe and His, in their side chains are comparatively very low (Figure 1); UV light induces photo-decomposition of organic compounds.For example, the thyroid hormone, thyroxine, an amino acid derivative having two benzene rings its structure, is easily decomposed by UV light irradiation [40,41].Sometimes, though, this irradiation produces new compounds from certain organic compounds [42,43].Trp is heat sensitive and is decomposed during cell hydrolysis.On the other hand, the concentrations of amino acids such as Ala, Ile and Leu, with high hydrophobicity, are comparatively high on radar charts.This must have contributed to self-protein assembly from relatively low concentrations of proteins on primitive earth.The hydrophobic interaction must have been an important factor forming the "coacervates" proposed by Aleksandr Ivanovich Oparin.In addition, Gly and Ala were formed in Miller's experiments using electrical charges [37].In the prebiotic world, amino acid concentration was a dominant factor in the formation of primitive life forms.Therefore, I propose here an existence of the "Amino Acid World" during the prebiotic world based on both experimental and genomic data as a hypothesis of primitive life forms.
A "RNA world" has been proposed as a hypothesis of primitive life forms, as certain RNAs have an enzymatic activity for self replication -"ribozyme" [44].Even in this case, it is hard to image that free nucleotides formed primitive RNA molecules possessing template characteristics that would induce codon formations.In addition, nucleic acids are very sensitive to UV light, with this light irradiation commonly used for pasteurization.Thus, RNA might not have played a crucial role in primitive life formation on primitive Earth which would have been exposed to strong UV light and gamma rays.

Homogeneity of Genome Structures
Simulations based on a random choice of amino acids or nucleotides suggest that primitive life forms consisted of proteins formed with the same amino acid compositions, because the amino acid polymerization of proteins occurred in the presence of the same amino acid composition, as mentioned above.Therefore, the genomes of primitive life forms must have been homogeneous in terms of amino acid composition, and this characteristic must have been conserved in the genomes of modern organisms by a late-established replication system.In addition, the basic pattern of cellular amino acid compo- sition is conserved from bacteria to Homo sapiens, even though the cells are constructed from many different kinds of proteins in different quantities [28].This measurement of cellular amino acids is experimentally possible at present.However, we cannot evaluate the degree of gene expression of each gene in live cells.To overcome this problem, calculation of gene expression levels was carried out assuming conveniently that each gene is expressed equally [29]; this assumption equally means that the genome is constructed apparently from a single large coding region consisting of many genes, and another single non-coding region.The relationship between nucleotide contents can be expressed by different linear formulas for coding and non-coding regions [11].This suggests that the two regions were formed at different stages during the establishment of primitive life forms.Surprisingly, the amino acid composition calculated from the complete genome is extremely similar to that obtained from amino acid analysis of cell lysates, as shown in Figure 3.
This puzzle was solved as follows.I proposed that a genome may be constructed from putative small units encoding similar amino acid compositions [45].On the other hand, each gene has a different amino acid sequence and different amino acid composition, although some genes show a similar amino acid composition to the whole group.Thus, a gene assembly containing certain genes can show a similar amino acid composition to the whole group.Similarly, as proteins are gene products, it is possible to assume that cell lysates consist of assemblies of proteins.Therefore, the cellular amino acid composition based on amino acid analysis resembles that based on genomic calculation.
To prove this, the complete genome of the archaeon Methanobacterium thermoautotrophicum was examined.Both one-tenth segments (encoding 30,000 -60,000 amino acid residues) and one-twentieth segments (encod-ing 20,000 -30,000 amino acid residues) showed almost the same amino acid composition, and small units encoding 3,000 -7,000 amino acid residues obtained from genome division showed similar amino acid compositions (Figure 4).In Saccharomyces cerevisiae, chromosomes of different sizes showed almost the same amino acid composition.As shown in Fig. 4, it is clear that the genome is constructed homogeneously from putative small units having almost the same amino acid compositions, not only in bacteria, but also in eukaryotes.The putative unit size is independent of its location in the genome.Obviously, this fact led naturally to synchronous mutations across the genome during biological evolution; and as a result, genome structure is homogeneous based on codon usage [9] and amino acid composition [45].

Mathematical Proof of the Unit Size
In general, natural proteins are polymers of 20 kinds of amino acid residues.To clarify the reason why a gene assembly encoding 3,000 -7,000 amino acid residues represents a total population of amino acids based on the complete genome, a multinomial distribution analysis [46] was carried out.In this analysis, 17 amino acid residues were chosen at random from the amino acid pool based on the complete genome to compare the amino acid composition with those calculated from gene assemblies on the complete genome, because Glu and Asp were converted to Gln and Asn, respectively, and Trp was decomposed, during our amino acid analyses using cell lysates [28].Mathematical analysis clearly showed that the 17-amino acid composition based on a random choice of 3,000 -7,000 amino acid residues represents an amino acid composition with 95% level simultaneous confidence intervals for all amino acid probabilities in the sample [47].Reducing the level of simultaneous confidence intervals or sample size decreases the similarity of the amino acid composition.

Bacterial Classification Based on Complete Genomes
Bacteria can be classified by Gram staining into two groups, Gram-positive and Gram-negative bacteria, and both biochemical and morphological characteristics contribute to precise classification [48].At the end of the 20 th century, the methodology for genomic research was established, and the genomes of several hundred bacteria have been completely analyzed to date.The first complete genome analysis of a free-living organism was carried out in Haemophilus influenzae in 1995 [49], and the complete human genome was analyzed at the beginning of the 21 st century [50,51].Bacteria seem worthy of classification based on genome sequence, because using the ratios of the numbers of amino acids present to the total numbers of amino acids presumed from the target gene(s) or whole genome, or those of the numbers of nucleotides present to the total numbers of nucleotides in the target gene(s) or whole genome makes it possible to directly compare different genes or genomes, as mentioned above.As the genome is constructed homogeneously from putative small units encoding almost the same amino acid composition, the factor of genome size is to comparisons of amino acid compositions.
The patterns of amino acid compositions based on the complete genomes of various bacteria, 11 Gram-positive and 12 Gram-negative bacteria, are star shaped, as mentioned above.According to differences in concentrations of Ala, Arg or Lys, bacteria are classified into two groups, "S-type", represented by Staphylococcus aureus, and "E-type", represented by Escherichia coli; this classification is independent of Gram staining [52].Differences in Gram staining based on structural differences in cell walls are not detected in genomic structures, while precise changes in amino acid composition, expressed by  the "star-shape", seem to reflect biological evolution.

Classification of Organisms into Dendrograms
Changes in nucleotide or amino acid sequences have been applied to evolutionary research and their results are expressed by phylogenic trees on the assumption that these changes are linked to biological evolution [53][54][55][56][57][58].This analytical method is applicable to genes for which amino acid or nucleotide sequences have been determined, but it is not suitable for genome research handling extremely huge data sets.In addition, we cannot examine organisms that lack a certain target gene.Using the ratios of the numbers of amino acids present to the total numbers of amino acids presumed from the whole genome or those of the numbers of nucleotides to the total numbers of nucleotides in the whole genome, organisms consisting of numerous different genes can be examined.Indeed, a small number of 23 bacteria has been classified into two groups on the basis of only one amino acid, Arg, Ala or Lys [52].To quantitatively examine a large number of organisms, multivariate analysis using many factors is applicable to cluster analysis [59].Organisms consisting of 112 bacteria, 15 archaea and 18 eukaryotes were classified into two major groups by multivariate analysis using GC contents at the three different codon positions, calculated from complete genomes (Figure 5).When 20 amino acid concentrations or 64 codon usages are used as traits instead of GC content, similar dendrograms are obtained [59].The 145 organisms were classified into "GC-type equal to E-type" and "AT-type equal to S-type" repre-sented by high G or C (low T or A, and high A or T (low G or C) contents, respectively, at every third codon position.The organism that has the highest GC content at the third codon position is Streptomyces coelicolor [60], and that which has the lowest GC content at the third codon position is Ureaplasma urealyticum [61].Reciprocal changes between G or C and A or T contents at the third codon position occurred synchronously in every codon among the organisms, as shown in Figure 6.Thus, all organisms can basically be classified into two groups according to their characteristic codon patterns with low GC and high AT contents at the third codon position, and the opposite.A similar conclusion was obtained from research that examined the content of G + C in a large number of genes [62].These facts indicate that codon alternations occur synchronously, not only within three codon positions, but also among codons to form new species, as codon alternations occur synchronously over the genome [9,10,45].This principle is independent of genome size as well as species, from bacteria to Homo sapiens.

Biological Evolution Can Be Expressed by Linear Formulas
A half century ago, two great scientific concepts regarding DNA structures were discovered.One of them is the helical double-stranded structure of DNA [26], which can explain characteristic heredity.Another is Chargaff's parity rules obtained experimentally; Chargaff's first parity rule [63] in which C/G, T/A and (C + T)/(A + G) ratios are one in the DNA extracted from organisms ; and Chargaff's second parity rule [64] in which these ratios are nearly one in single stranded DNA isolated from double stranded DNA.The first parity rule is entirely based on physicochemical and intra-strand characteristics of nucleotides.Thus, the rule is independent of biological and intra-molecular influences, while biological divergences are excluded from this rule.The relationships between the contents of two nucleotides are expressed by linear lines whose regression coefficients are one based on the first rule.The second rule has historically been a puzzle in molecular biology, because we can not image that the pairings G to C and A to T are formed in the single stranded DNA.This is an intra-molecular rule governing single stranded DNA.Quite recently, however, was able to solve the puzzle, based on our results that genome structure is homogeneous [6], and that the sizes of the coding regions are nearly equal between the forward and reverse strands [11].Thus, mitochondrial genome in which coding sizes differ between the forward and reverse strands appears not to be subject to the second parity rule [65,66].It has been indicated that the double stranded DNA structure is important for biological evolution and that the double strand might be established during primitive life formation [6].This second parity rule has recently been applied to complete genomes derived from double stranded DNA [67].Chargaff's rules are universal for all replicating organisms, but they cannot reflect evolutionary differences based on different kingdoms.The findings of certain rules that govern biological evolution will help us to understand scientifically the evolutionary process over an extremely long time and based on unknown factors.
Fortunately, a huge amount of data regarding genomes has been accumulated by a large number of scientists.The present state could not be imagined in Darwin's Age.When nucleotide (G, C, T and A) contents based on complete genomes are plotted against the content each nucleotide among various organisms, their relationships can clearly be expressed by a linear formula, y = ax + b, where y and x represent nucleotide contents, and "a" and "b" are constants.These constant values differ between the coding and non-coding regions.This linear relationship is obtained from the complete single-stranded DNA forming the nuclear genome [11,67].The values of "a" and "b" in either coding or non-coding region differ slightly among kingdoms, such as bacteria, archaea and eukaryotes [11].Thus, nucleotide alternations are governed by slightly different rules among different kingdoms.Among these linear regression lines, the constant value "b" has never been zero, and the regression coefficients have never been one.This confirms that the formulas differ from Chargaff's formulas, while differences in regression lines among different kingdoms are the results of biological divergence.
As the relationships between two nucleotide contents are expressed by linear experimental formulas among various organisms, the determination of any one nucleotide content can essentially allow the estimation of all four nucleotide contents.In addition, because the relationships between nucleotide content and 64 codon usages are also governed by linear formulas, the 64 codons in the coding region can be estimated from the content of just one nucleotide (Figure 7).
In mitochondria and chloroplasts, nucleotide alternations are also expressed by similar linear formulas with  slightly different constant values representing the slope and its intercept [12].All nucleotide alternations in nuclei, mitochondria and chloroplasts are expressed by linear formulas with different constant values resulting from organelle characteristics among various organisms.Namely, a certain nucleotide content "y" can be expressed inter-species by linear formulas, y = ax + b, based on a single nucleotide content "x".Among four equations presenting four nucleotide contents after normalization, the summation of the value of the slope, "a", is zero and that of the value of constant, "b", is one [11].This relationship is mathematically definitive and independent of the co-relationships among four nucleotide contents.Chargaff's parity rules, G/C = 1, A/T = 1, (A + G)/(C + T) =1, are alternated as follows: G = G, C = G, T = -G + 0.5, and A = -G + 0.5.Thus, Chargaff's parity rules, even those governing single species DNA, are derived from the general formula, y = ax + b, when slope, "a" of the two equations' is 1 or -1, and when the intercept, "b", is 0.5 or 0 in the equation with -1 and 1, respectively, as the "a".On the other hand, the values of "a" and "b" in both codon evolution [11] and organelle evolution [68] shifted from 1 or -1 and 0.5 or 0, respectively because of biological divergences, and the regression coefficient also shifted from one.The shift of the regression coefficient from one represents biological divergence.
It has been thought that cellular organelle such as mi-tochondria [68] and chloroplasts [69] were derived during biological evolution from protobacteria and cyanobacteria, respectively, and that their evolutionary processes appear different from nuclear genome evolution、 as mentioned above.In addition, it is known that mutation rate is remarkably high in mitochondrial DNA [70].
In our study, amino acid compositions of chloroplast and plant mitochondria resemble those of nuclear DNA, whereas those of vertebrate mitochondria differ from those of other organelle [12].Particularly, the content of Leu was extremely high in animal mitochondria [12].
Comparing the shapes of the radar charts based on amino acid compositions, that of the ancient fish, the coelacanth (Latimeria chalumnae), more closely resembles those of salamanders and birds compared than those of other fish (Diodon holocunthus) [12].In further study, using multivariate analysis based on amino acid compositions, lung fish (Neoceratodus forsteri) and coelacanth were both found to belong to the cluster representing a reptile; a cluster separated from that one representing other fish (carp, rainbow trout and killifish).These results are consistent with the already established phylogenic concept.
The apparent great divergence of Homo sapiens from bacteria can be expressed by linear formulas with small turbulences based on the complete genome in biological evolution.Thus, biological evolution seems to be observed as a result of mere nucleotide substitutions based on simple mathematical principles, while natural selection affects species preservation after nucleotide alternations.This conclusion is consistent with the idea that evolution is based on neutral mutation [71,72].Therefore, natural selection does not directly regulate nucleotide substitutions, but is indirectly involved in biological evolution.

PERSPECTIVES
The present paper reveals that the analytical method using the ratios of the numbers of amino acids present to the total numbers of amino acids presumed from the whole genome, or those of the numbers of nucleotides present to the total numbers of nucleotides in the whole genome is useful for genome research, as well as methods using the sequences of amino acids or nucleotides.These ratios based on nucleotide sequences can exclude deviations in certain calculations.The fact that genome structures regarding amino acid compositions or codon usages are homogeneous makes it possible for us to compare various genomes with different sizes and genes.Namely, a large data set obtained from the complete genome can be expressed by just a simple point on a graph.Thus, using the ratios of amino acids or nucleotides to their total numbers seems to be an excellent method for genome research based on extremely huge data sets.In addition, even a certain size of gene assembly can be used instead of the complete genome for limited purposes.
In prebiotic evolution, amino acid composition might have been the strongest factor determining the characteristics of biopolymers used for the establishment of primitive life forms, whereas since the establishment of the codon system, biological evolution has been carried out by nucleotide expressed by linear formulas based on nucleotide contents, as shown in Figure 8.Thus, 64 codon usages can be estimated from just one nucleotide content (Figure 7), and the characteristic amino acid composition is expressed by the "star-shape" (Figures 1-7), not only in cell analysis, but also in genome analysis.This fact strongly suggests that this "star-shape" may be conserved in both primitive life forms and future organisms, because all organisms must be governed by universal rules on earth, without exception.Thus, this amino acid composition represented by the "star-shape" may reflect the "Amino Acid World".
We, Homo sapiens, stand merely in the middle of a line (Figure 8).We are not the end of line, nor do we have an "ultimate" status.Therefore, we have been and will be exposed to natural selection without exception.

Figure 2 .
Figure 2. Computational amino acid compositions of Ureaplasma urealyticum gene.Upper panel; random choice of amino acid was carried out in the original gene (5,005 amino acid pool).Lower; random choice of nucleotide was carried out in the original gene (15,018 nucleotides).In the simulation using nucleotides, the stop codon and Trp were discarded from the calculation of amino acid compositions, and a triplet formed was immediately counted as an amino acid.This figure was reproduced from Kenji Sorimachi and Teiji Okayasu.(2007) Mathematical proof of the chronological precedence of protein formation over codon formation.Curr.Top. in Pep.Prot.Res. 8, 25-34.

Figure 3 .
Figure 3. Cellular and genomic amino acid compositions on radar charts.The value is expressed as the percentage of total amino acids.Methanobacte©rium thermoautotrophicum was examined.The cellular amino acid composition was obtained from 3 independent analyses.In genomic calculations, Gln and Asn were also incorporated into Glun and Asp, respectively, to compare with data based on amino acid analysis.

Figure 4 .
Figure 4. Amino acid compositions calculated from various units of the complete genome of Methanobacterium autotrophicum and Saccharomyces cerevisiae on radar charts.A, the compete M. thermoautotrophicum genome consisting of 1,869 protein genes (Smithe et al. 1997) was divided into 10 (9 units consisting of 186 genes and one units consisting of 195 genes) or 20 (5 units consisting of 93 genes).B, Scaachromyces cerevisiae.This figure was reproduced from Kenji Sorimachi and Teiji Okayasu.(2005) Genomic structure consisting of putative units coding similar amino acid composition: synchronous mutations in biological evolution.Dokkyo J. Med.Sci.32, 101-106.

Figure 6 .
Figure 6.Codon usage patterns and amino acid compositions of Staphylococcus aureus and Escherichia coli.Codon usage (bar) and amino acid composition (radar chart) were expressed by percent of total codons and amino acids, respectively.These figures were reproduced from Kenji Sorimachi and Teiji Okayasu.(2008) Codon evolution is governed by linear formulas, Amino Acids, 34, 661-668.

Figure 7 .
Figure 7. Codon usage patterns and amino acid compositions of Homo sapience.Codon usage (bar) and amino acid composition (radar chart) were expressed by percent of total codons and amino acids, respectively.Upper and lower panels represent genomic and estimated data, respectively.These figures were reproduced from Kenji Sorimachi and Teiji Okayasu.(2008) Codon evolution is governed by linear formulas, Amino Acids, 34, 661-668.

Figure 8 .
Figure 8. Correlation of G content to C content in various organisms based on their complete genomes.Red, blue and green symbols represent 112 bacter©ia, 15 archaea and 18 eukaryotes, respectively.Each line was drawn computationally.This figure was reproduced from Kenji Sorimachi and Teiji Okayasu.(2008) Codon evolution is governed by linear formulas, Amino Acids, 661-668