The Pattern of Occurrence of Cytosine in the Genetic Code Minimizes Deleterious Mutations and Favors Proper Function of the Translational Machinery

The standard genetic code consists of 64 combinations of base triplets made from four different bases. The research aim of this study was to investigate the pattern of occurrence of cytosine in the genetic code. By exploring the base composition and sequence of all 64 codons, the author found some important features based on the instability of cytosine. Because cytosine undergoes spontaneous deamination that converts it into uracil, it is evolutionarily favorable to exclude cytosine from codons critical to the initiation and termination of translation. For amino acids that have one to three synonymous codons (also called synonyms), the frequency of occurrence of C in the first and second positions of their mRNA codons is significantly lower than the frequencies of A, U, and G. For mRNA codons that encode amino acids with four synonyms, the trend of base composition is opposite to those encoding amino acids with one to three synonyms; the instability of C could be inhibited or reduced via formation of hydrogen bonds with a G and/or with a protonated C, and the secondary structure of the resultant mRNA could be adjusted via the multiple synonymous alternates at the third position of their codons to facilitate the translation process. The overall pattern of occurrence for C in the genetic code not only minimizes deleterious mutations and favors proper function of the translational machinery by excluding C from certain positions within codons, but also allows the occurrence of genetic diversity via mutation by including C in less-critical positions.


Introduction
The standard genetic code is nearly universal, and consists of 64 combinations of base triplets made from four different bases-adenine (A), guanine (G), uracil (U), and cytosine (C). Since 61 of the 64 base triplets are used to encode only 20 amino acids, most amino acids are encoded by more than one codon. The remaining three triplets, called stop codons, designate the termination of translation [1]. To the author's knowledge, no study has investigated the pattern of occurrence of cytosine in the genetic code; it thus became the objective of this study. The author explored the base composition and sequence of all 64 codons, and inferred some important features in view of the instability of cytosine.

Methods
Since the genetic code is highly degenerate, meaning that most amino acids are encoded by more than one mRNA codon, the author divided the standard genetic codons into two groups: the base triplets encoding amino acids that have one to three synonymous codons (Table 1), and those amino acids with four synonymous codons (Table 2). Amino acids serine, leucine, and arginine each have six synonymous codons (also called synonyms); they are categorized as two-synonym plus four-synonym occurrences. The author determined the percentage (%) of A, U, G, and C at every position of the base triplet for mRNA codons with one to three synonyms (Table 1), and those with four synonyms (Table 2), respectively.

Results
The first feature is the absence of cytosine (C) in both the start (AUG, also the only codon for methionine) and stop codons (UAA, UAG, and UGA) of translation. The initiation and termination of translation are critical for protein synthesis; therefore, evolution has resulted in a higher frequency of the more stable A, U, and G to avoid a fatal malfunction in the translation process. Cytosine is also absent from the only codon for the amino acid tryptophan (UGG). The author infers that the absence of cytosine from the codons for methionine and tryptophan, neither of which has an alternate mRNA codon, is the result of evolutionary selection to avoid translation errors due to the spontaneous deamination of cytosine to uracil [2] [3] [4].
In contrast to the standard genetic code referred to above, mitochondrial genomes contain alternate start codons (e.g., AUA and AUU in humans, and GUG and UUG in prokaryotes). All vertebrate mitochondria use AGA and AGG as translation terminators. Mitochondrial mRNA from vertebrates and microorganisms use UGA to encode tryptophan rather than as a translation terminator,   [6]. Again, C is absent from these critical codons. While the author will focus on the nucleic genetic code in the following discussion, it is noted that the pattern of occurrence for cytosine seems to be true for mitochondrial codons as well.

B. Wang
The right-hand column in Table 1 ("All Three Positions" column) provides the total base composition, including total number and percentage of A, U, G, and C in the mRNA codons shown. Overall, A and U residues are more abundant than G and C residues in the codons for amino acids with one to three synonyms. Data presented in Table 1 ("1 st Position" column) provide the base composition at the 5'/left end of the base triplet of the mRNA codons studied.
The frequencies of A and U are 37.5% each, whereas G and C residues are less frequent (12.5% each). At the second/middle base of the mRNA codons studied, the frequencies of A, U, and G are 50%, 25%, and 25%, respectively, as shown in B. Wang Open Journal of Genetics For mRNA codons that encode amino acids with four synonyms, the trend of base composition is opposite to those encoding amino acids with one to three synonyms. As shown in Table 2 ("All Three Positions" column), C and G residues are more abundant than U and A residues for codons encoding amino acids with four synonyms. The frequencies of C and G at the first position of the mRNA codons studied ( Table 2, "1 st Position" column) are 37.5% each, whereas the frequencies of U and A are 12.5% each. At the second position of the mRNA codons studied ( Table 2, "2 nd Position" column), the frequencies of C, G, and U are 50%, 25%, and 25%, respectively; A does not occur at the second position. At the third position of the mRNA codons studied ( Table 2, "3 rd Position" column), there is an equal abundance of A, U, G, and C (25% each).

Discussion
Because cytosine is known to undergo spontaneous deamination into uracil, it is evolutionarily favorable to exclude cytosine from codons critical to the initiation or termination of translation. For amino acids that have one to three synonyms, the frequency of occurrence of C in the first and second positions (the root) of their mRNA codons is significantly lower than the frequencies of occurrence of A, U, and G (see Table 1, "1 st and 2 nd Positions" column). Furthermore, since the middle position of a base triplet is the most critical location for mRNA codon-tRNA anticodon interaction/binding [7] [8] [9] [10] [11], the complete absence of C from the second position that is observed for base triplets encoding amino acids with one to three synonyms is not surprising.
In Table 1, the only mRNA codons containing C in the root are those encoding histidine (CAU and CAC) and glutamine (CAA and CAG). If spontaneous deamination by hydrolysis occurs, histidine will be converted into tyrosine (UAU and UAC), and glutamine will be converted into a stop codon (UAA and UAG). Since histidine and tyrosine both have polar side chains, in theory, this C-to-U mutation may be less likely to introduce significant changes in a protein's structure or function. However, histidine is often found in active sites of enzymes because its imidazole ring-containing side chain is able to perform many different roles in catalysis, whereas tyrosine has a phenol-containing side chain [1] [6]. Therefore, the histidine-to-tyrosine mutation may allow for genetic variation. The C-to-U mutation within a glutamine codon would cause translation to stop. Because humans can synthesize enough glutamine, it is the most abundant nonessential amino acid in the human body; further studies are needed to determine the effects of the conversion of a glutamine codon into a stop codon on human health and on genetic diversity, although the loss of a protein is likely to have deleterious effects.
For amino acids that have four synonyms, the effects of an unstable C on Open Journal of Genetics translation mutations may not be as deleterious as for amino acids with fewer synonyms, due to the high percentages of C and G in the root, and to the existence of multiple synonymous alternates at the third position of these codons.
Frederico et al. demonstrated that the rate of hydrolytic deamination of cytosine in a double helix was approximately 140-fold slower than in single-stranded DNA at 37˚C [12]; this difference is mainly due to the decreased accessibility of the N3 and C4 positions in a cytosine that is paired to guanine via hydrogen bonds, blocking the attack from water. The mRNA codons encoding amino acids with four synonyms are CG-rich in the root (see Table 2, "1 st and 2 nd Positions" column), which indicates that they have the potential to inhibit or reduce cytosine deamination by folding upon themselves to form a C≡G double helix, and/or to form a hydrogen-bonded C + -C i-motif if the RNA sequence is C-rich.
(Note: Previous studies have proved the existence of i-motifs under physiological pH [13] [14].) Since CG-rich mRNA regions may form complicated secondary structures that hinder the translation process, producing the same amino acid no matter which of the four mRNA bases is in the third position allows the adjustment of the secondary structure of the resultant mRNA. Table 2 shows that no A is present at the second position of base triplets encoding amino acids with four synonyms. Previous studies have indicated that the second base of mRNA codons determines the hydrophobicity of the encoded amino acids: The majority of codons for hydrophilic (polar and/or charged) amino acids have A in the second position; while the majority of codons for hydrophobic amino acids have U in the second position [7] [15] [16]. From Table  1, we can see that hydrophilic amino acids with one to three synonyms have A or G in the second position of their mRNA codons, while hydrophobic amino acids with one to three synonyms have U or G in their second position. From Table 2, we can see that hydrophilic amino acids with four synonyms have C or G in the second position of their mRNA codons, while hydrophobic amino acids have U or C or G in their second position. Since the majority of hydrophilic amino acids have two synonyms, it is reasonable that A is absent from the second position of mRNA codons that encode amino acids with four synonyms.

Conclusion
In summary, for amino acids that have one to three synonyms, the frequency of occurrence of C in the root of their mRNA codons is significantly lower than the frequencies of A, U, and G. For amino acids that have four synonyms, the instability of C may be inhibited or reduced via the formation of hydrogen bonds with a G and/or with a protonated C. In addition, the "new" secondary structure of the resultant mRNA could be adjusted via the multiple synonymous alternates in the codons' third positions, which could facilitate the translation process. The overall pattern of occurrence for C in the genetic code not only minimizes deleterious mutations and favors proper function of the translational machinery by excluding C from certain positions within codons, but also allows the occurrence of genetic diversity via mutation by including C in less-critical positions. Evolu-