Codon evolution in double-stranded organelle DNA : strong regulation of homonucleotides and their analog alternations

In our previous study, complete single DNA strands which were obtained from nuclei, chloroplasts and plant mitochondria obeyed Chargaff’s second parity rule, although those which were obtained from animal mitochondria deviated from the rule. On the other hand, plant mitochondria obeyed another different rule after their classification. Complete single DNA strand sequences obtained from chloroplasts, plant mitochondria, and animal mitochondria, were divided into the coding and non-coding regions. The non-coding region, which was the complementary coding region on the reverse strand, was incorporated as a coding region in the forward strand. When the nucleotide contents of the coding region or non-coding regions were plotted against the composition of the four nucleotides in the complete single DNA strand, it was determined that chloroplast and plant mitochondrial DNA obeyed Chargaff’s second parity rule in both the coding and non-coding regions. However, animal mitochondrial DNA deviated from this rule. In chloroplast and plant mitochondrial DNA, which obey Chargaff’s second parity rule, the lines of regression for G (purine) and C (pyrimidine) intersected with regression lines for A (purine) and T (pyrimidines), respectively, at around 0.250 in all cases. On the other hand, in animal mitochondrial DNA, which deviates from Chargaff’s second parity rule, only regression lines due to the content of homonucleotides or their analogs in the coding or non-coding region against those in the complete single DNA strand intersected at around 0.250 at the horizontal axis. Conversely, the intersection of the two lines of regression (G and A or C and T) against the contents of heteronucleotides or their analogs shifted from 0.25 in both coding and non-coding regions. Nucleotide alternations in chloroplasts and plant mitochondria are strictly regulated, not only by the proportion of homonucleotides and their analogs, but also by the heteronucleotides and their analogs. They are strictly regulated in animal mitochondria only by the content of homonucleotides and their analogs.


INTRODUCTION
"Chargaff's second parity rule" [1], G ≈ C, A ≈ T and [(G + A) ≈ (T + C)] is retained in single DNA stranded that is formed from double-stranded DNA; however, it is difficult to imagine how the G and C or A and T base pairs are formed in the single DNA strand, or why G ≈ C and A ≈ T. Therefore, the biological significance of Chargaff's second parity rule (first described 40 years ago) has not yet been elucidated because of its unclear fundamental reasoning.In fact, it is unclear whether Chargaff's second parity rule is even linked to biological evolution.However, recently, this historic puzzle [2], has been solved, based on the fact that genome nucleotide composition is homogeneous [3] and that both the forward and reverse strand compositions are very similar [4].The second parity rule derives the similarities of nucleotide composition found between the forward and reverse strands.On the other hand, nucleotide contents represented by Chargaff's first parity rule [5], G = C, A = T and [(A + G) = (T + C)], excludes biological significance, and this rule is mathematically definitive and independent of biological significance.Under this rule, the nucleotide contents in nuclei are defined by these equations in any organism from bacteria to Homo sapiens.
The existence of deviations from Chargaff's second rule was reported by other groups [6,7].Only single DNA strands that form double-stranded genomic DNA obey Chargaff's second parity rule, whereas organelle DNA does not obey this rule [8].Nikolaou and Almirantis reported that mitochondrial DNA might be classified into three groups based on GC and AT skews, and that their DNA deviated from Chargaff's second parity rule [7].They also reported that chloroplasts shared the patterns of bacterial genomes [7].Mitochondrial gene sequences support the view that the evolutionary antecedents of mitochondria are a subgroup of the alpha-Protobacteria [9], such as Rickettsia, Anaplasma, and Ehrlichia [10].In addition, molecular phylogenetic studies showed that the closest bacterial homologs of chloroplasts are cyanobacteria [11].Recently, deviations from Chargaff's second parity rule in animal mitochondrial DNA were attributed to a different rule, and a single origin of species was derived from these mathematical genomic analyses [12].We have also examined nuclear and organelle DNA and shown that the nucleotide compositions are correlated with each other, and are correlated within the coding region of nuclear DNA [4].Additionally, only homonucleotide contents are correlated to each other between the coding or non-coding regions and the single DNA strand in organelles [13].These analyses indicate that biological evolution is expressed by linear formulae [4].In the present study, we have investigated the precise nucleotide relationships between the coding or non-coding regions and the complete single DNA that forms the double-stranded DNA.These analyses included not only homonucleotides, but also heteronucleotides, as Chargaff's second parity rule is linked to the double-stranded DNA structure [2,8].In fact, it would be interesting to determine whether Chargaff's second parity rule is preserved, not only in the complete genome, but also in the separated coding and non-coding regions, to understand biological evolution.

MATERIALS AND METHODS
Genome data were obtained from the National Center for Biotechnology Information (NCBI; http://www.ncbi.nlm.nih.gov/sites), and the list of organelles examined has been described in our previous paper [13].The same species which were examined in our previous study were used to compare the present result with the previous data [13].To evaluate the biological evolution of whole organelles, the coding region in the reverse strand was incorporated into the coding region in the forward strand as the complement [2].Calculations were performed using Microsoft Excel (version 2003).

Codon Evolution in Chloroplasts
The coding and non-coding regions were separated be-cause their nucleotide alternations differ [4].In the present study, however, the coding region in the reverse strand was incorporated into the forward strand as the complement, and the nucleotide content in the coding region was plotted against the complete single DNA strand to understand whole genome evolution [2].According to this process, each of the four nucleotide components are expressed by four equations, and the homonucleotide content is expressed by a regression line whose regression coefficient is close to 1.0 in normalized values.Similarly, when the nucleotide content of the coding region were plotted against the total G content in the complete single DNA strand, the lines for both G and C completely overlapped in chloroplasts.Similarly, the lines for T and A also overlapped (Figure 1, upper panel).This suggests that G ≈ C and T ≈ A in the coding region.Similar results were obtained for the non-coding region (Figure 1, lower panel).Each line was computationally characterized and the results are shown in Table 1.Using normalized values in the four equations, as the summation of the four nucleo tides is 1, the summation of the four equation slopes is 0 and that of the constant values at the vertical intercept is 1.0 in all cases [4].This fact is based on mathematical rule using normalized values.As shown in Figure 1, the absolute values of the slopes of the lines for G and C or the lines for T and A were mathematically similar for both coding and non-coding regions, while the former two slopes were positive but the latter two were negative.That is, the G and C lines are symmetrical to the A and T lines in both the coding and non-coding regions.In addition, the slopes differed between the coding and noncoding regions, where the absolute values of the slopes were 0.760  0.049 and 1.192  0.078 in the coding and non-coding regions, respectively.Thus, the compositions of the four nucleotides correlated well with the nucleotide compositions in the complete single DNA strand.Regression coefficients were more than 0.9 or close to 0.9, except those for the A and T contents against the total T and A contents, which were 0.79 and 0.77, respectively, in the coding region.
Based on Figure 1, the overlapped lines for G and C clearly intersected with the overlapped lines for T and A. The points of intersection for the overlapped lines of regression were calculated based on the regression line equations presented in Table 1.The combinations of the two lines for calculations were either lines for G and A (purines) or lines for C and T (pyrimidines).All combinations were approximately 0.25 at the point of intersection for both the coding and non-coding regions in chloroplasts (Table 2).

Codon Evolution in Plant Mitochondria
The nucleotide contents in the coding and non-coding regions were plotted against those in the complete single DNA strand for DNA obtained from plant mitochondria (Figure 2).The G line overlapped with the C line, whereas the T line slightly diverged from the A line in the coding region compared (Figure 2, upper panel).Scattering of the sample points was observed for each of the four nucleotides, particularly in the high G content region of the complete single DNA strand.In the noncoding region, the G line differed significantly from the C line (Figure 2).
Furthermore, the C line was parallel to the G line.Scattering of sample points was observed in the complete genome, and was likely due to the small genome sizes [12].Both the G and C lines were almost symmetrical with to both the A and T lines, respectively, for both the coding and non-coding regions.
Each line was computationally characterized and the results are shown in Table 3.The absolute values of the   presented in Table 3.All combinations due to G and A lines or C and T lines were approximately 0.24 in both the coding and non-coding regions for DNA obtained from plant mitochondria (Table 4).

Codon Evolution in Vertebrate Mitochondria
Nucleotide contents of the coding and non-coding regions were plotted against nucleotide contents in the complete single DNA strand (Figure 3).When the four nucleotide compositions in the coding region we plotted against the G content in the complete single DNA strand, the G content in the coding region was expressed by a linear regression line with a high regression coefficient (Figure 3, upper panel).The nucleotide A content could also be expressed by a linear regression line with a relatively high regression coefficient (0.82) (Table 5).On the other hand, C and T compositions were not correlated with G content in the complete single DNA strand (R-values of 0.24 and 0.01).Similar results were obtained in the non-coding region (Figure 3, lower panel and Table 5).
When nucleotide contents in the coding region or non-coding region were plotted against nucleotide contents in the complete single DNA strand, homonucleotides and their analogs (purines or pyrimidines) showed good correlations.However, heteronucleotides and their Table 3. Regression lines representing nucleotide contents in the coding and non-coding regions against the nucleotide contents in the complete single strand DNA based on 47 plant mitochondria.analog relationships (i.e., G vs. C or T and A vs. C or T) showed no correlation for vertebrate mitochondria (Table 5).This rule was observed in all cases in vertebrate mitochondria.
The calculated points of intersection of two regression line equations are presented in Table 6.The G and A (purines) lines intersected at 0.219 and 0.227 in the coding region and non-coding region, respectively, against the G (purine) content in the complete single DNA strand; while the C and T (pyrimidines) lines intersected at 0.106 and 0.160 in the coding and non-coding regions, respectively, against the G (purine) content (Table 6).The former values were close to 0.250, whereas the latter were relatively far from this value.On the other hand, the G and A (purines) lines intersected at 0.565 and 0.506 in the coding and non-coding regions, respectively, against the C (pyrimidine) content in the complete single DNA strand.The C and T (pyrimidines) lines crossed at 0.266 and 0.283 in the coding and non-coding regions, respectively, against the C (pyrimidine) content in the complete single DNA strand (Table 6).The former two values were significantly different from 0.250, whereas the latter two values were close to 0.250.When the A (purine) and T (pyrimidine) contents in the complete single DNA strand were used instead of the G (purine) and C (pyrimidine) contents, consistently similar results were obtained (Table 6).Combinations of regression line equations (G and A or C and T) against heteronucleotide content in the complete single DNA strand rarely attained 0.250 as a point of intersection.

Codon Evolution in Invertebrate Mitochondria
As determined in a previous study [12], although nucleotide content relationships in the complete invertebrate mitochondrial genome were heteroskedastic, they were classified into two groups, I and II, based on their distributions on the graph.Plotting the C content of the coding region against the G content in the complete single DNA strand in invertebrate mitochondria, it showed that mitochondria could be clearly classified into two groups (denoted by a dotted line on Figure 4).This is consistent with the result obtained from the complete genome [12].Nucleotide content relationships were also investigated in the classified invertebrate I mitochondria.Plotting the four nucleotide compositions in the coding region against the G content in the complete single DNA strand produced four lines of regression (Figure 5, upper panel); however, all four lines differed from each other.Similar results were obtained for the non-coding region (Figure 5, lower panel).The values of their regression coefficients, the slope, and constants for each equation are shown in Table 7. Relationships between the coding or non-coding region, the complete single DNA strand, and the homonucleotides and their analogs contents, correlated well for invertebrate I mitochondria (Table 7).
As shown in Figure 5, the lines for C and T against the G content in the complete single DNA strand intersected at around 0.250 for both the coding and non-coding regions.The points of intersection of two lines of regression equations are presented in Table 7 and were calculated and tabulated in Table 8.The points  of intersection from the two lines of regression equations (G and A or C and T) against homonucleotides or their analog contents in the complete single DNA strand were close to 0.250, while those against heteronucleotides or their analogs contents were rarely 0.250 for invertebrate I mitochondria (Table 8).
Additionally, the nucleotide composition relationships observed between the coding or non-coding region and the complete single DNA strand were also examined for invertebrate II mitochondria.The characteristics of the lines of regression are shown in Table 9.The points of intersection for two lines of regression equations are shown in Table 10.The results obtained from invertebrate II mitochondria were similar to those of invertebrate I mitochondria.proteobacteria [9,10] and cyanobacteria [11].A comparison of the human genome [14,15] with the sea urchin genome [16] has revealed that the number of protein coding genes is similar between the two species, while the non-coding region of the former is much larger than that of the latter.This fact also suggests that the noncoding region plays an important role in developmental biology.In chloroplasts and plant mitochondria, the rule that G ≈ C, T ≈ A and [(G + A) ≈ (C + T)] is not only observed in the complete genome, but also in the coding or non-coding region in plant organelles, based on nucleotide content relationships between the coding or non-coding region and the complete single DNA strand.
On the other hand, the nucleotide content relationships for either the coding or non-coding regions did not obey Chargaff's second parity rule in nuclear genomes [4], instead, (G + A) > (C + T) in the coding region [17].In addition, animal mitochondrial evolution seems to differ, not only from nuclear, but also from plant organelles.Plasmids, which are not compartmentalized from the nucleus, showed codon frequencies that resemble those of the host [18].Thus, the compartmentalization of cellular organelles is likely to strongly influence organelle evolution.
To understand the establishment of Chargaff's second parity rule, the existence of both forward and reverse strands is necessary [2,8].Namely, it is clear that the second parity rule is based on the double helical structure of DNA [19], where the complementary relationship between the two strands plays a role.Primitive genomes might be constructed by double-stranded DNA and mutations that occur synchronously over the genome [20] are governed by linear formulae [4].In addition, Chargaff's parity rules are alternated to four linear formulae based on single nucleotide content, as shown above.Thus, biological evolution is likely to be based on the nucleotide contents expressed by linear formulae.
Chargaff's first parity rule [5], G = C, A = T, and [(G + A) = (C + T)], is well known and uses the four nucleotide contents that are normalized as follows: G + C + A + T = 1.Therefore, 2G + 2T = 1 or 2G + 2A = 1.Finally, T = 0.5 -G or A = 0.5 -G.Eventually, four nucleotide contents are expressed by just G content: G = G, C = G, T = 0.5 -G, and A = 0.5 -G.Namely, each of the four nucleotide contents are expressed by linear formulae based on just one nucleotide content (G).Thus lines for G and C or for lines T and A overlap.In addition, the G line intersects the A line at 0.250 and the C line crosses the T line at 0.250.Thus, the four regression lines obtained from the sample that obeys Chargaff's first parity rule cross exactly at 0.250.In addition, the four regression lines based on a sample that obeys Chargaff's second parity rule will intersect at around 0.250.In the present study, four regression lines based on chloroplasts (Figure 1 and Table 1) and plant mitochondria (Figure 2 and Table 3), which both obey Chargaff's second parity rule, intersect at around 0.250 (Tables 2 and 4).On the other hand, for animal mitochondria, only two regression lines due to homonucleotides or their analogs in the complete single DNA strand intersect around 0.250, while the other two regression lines due to heteronucleotides or their analogs in the complete single DNA strand rarely intersect at 0.250.Thus, nucleotide alternations, not only in homonucleotides and their analogs but also in heteronucleotides and their analogs, are strictly regulated against the complete single DNA strand in samples that obey Chargaff's second parity rule; namely, chloro-plasts and plant mitochondria.However, only alternations of homonucleotides and their analogs are strictly regulated in both coding and non-coding regions against the complete single DNA strand in animal mitochondria.These results indicate that the evolutionary process of animal mitochondria differs from that of chloroplasts and plant mitochondria, possibly due to deviations from Chargaff's second parity rule.This is consistent with the previous conclusion that provided evidence for a single origin of life [12].

Figure 1 .
Figure 1.Nucleotide relationships in normalized chloroplast values.Upper panel, coding region; lower panel, non-coding region.Red squares, G; green triangles, C; blue diamonds, A; and shallow blue crosses, T. The composition of each nucleotide in the coding or non-coding region was plotted against the G content in the complete single DNA strand.The vertical axis represents the composition of the four nucleotides; the horizontal axis represents the G content in the complete single DNA strand.

Figure 2 .
Figure 2. Nucleotide relationships in normalized plant mitochondrial values.Upper, coding region; lower, non-coding region.Red squares, G; green triangles, C; blue diamonds, A; and shallow blue crosses, T. The composition of each nucleotide in the coding or non-coding region was plotted against the G content in the complete single DNA strand.The vertical axis represents the composition of the four nucleotides; the horizontal axis represents the G content in the complete single DNA strand.
mean the nucleotide content in the coding and non-coding regions, respectively.

Figure 3 .
Figure 3.Nucleotide relationships in normalized vertebrate mitochondrial values.Upper, coding region; lower, non-coding region.Red squares, G; green triangles, C; blue diamonds, A; and shallow blue crosses, T. The composition of each nucleotide in the coding or non-coding region was plotted against the G content in the complete single DNA strand.The vertical axis represents the composition of the four nucleotides; the horizontal axis represents the G content in the complete single DNA strand.

Figure 4 .
Figure 4.Nucleotide relationships in invertebrate mitochondria.Nucleotide contents were normalized, and the C content in the coding region was plotted against the G content in the complete single DNA strand.The vertical axis represents the G and C compositions and the horizontal axis represents the G content in the complete single DNA strand.The dotted line represents the G content in the coding region against the G content in the complete single DNA strand.

Figure 5 .
Figure 5.Nucleotide relationships in normalized invertebrate I mitochondrial values.Upper, coding region; lower, non-coding region.Red squares, G; green triangles, C; blue diamonds, A; and shallow blue crosses, T. The composition of each nucleotide in the coding or non-coding region was plotted against the G content in the complete single DNA strand.The vertical axis represents the composition of each of the four nucleotides, the horizontal axis represents the G content in the complete single DNA strand.

Table 1 .
Regression lines representing nucleotide contents in the coding and non-coding regions against the nucleotide contents in the complete single strand DNA based on 97 chloroplasts.

Table 2 .
Crossing points obtained from two regression lines based on 97 chloroplasts.The regression coefficients were around 0.9, except for low regression coefficients of 0.57 and 0.45 from the T and A contents in the coding region.The point of intersection of the two regression lines was calculated based on the regression line equations

Table 4 .
Crossing points obtained from two regression lines based on 47 plant mitochlondria.

Table 5 .
Regression lines representing nucleotide contents in the coding and non-coding regions against the nucleotide contents in the complete single strand DNA based on 45 vertebrate mitochondria.

Table 6 .
Crossing points obtained from two regression lines based on 45 vertebrates.

Table 10 .
Crossing points obtained from two regression lines based on 28 invertebrate II mitochondria.