Structural Features of the Nucleotide Sequences of Virus and Organelle Genomes

JBiSE). ABSTRACT The four nucleotides (bases), A, T (U), G and C in small genomes, virus DNA/RNA, organelle and plas-tid genomes were also arranged sophisticatedly in the structural features in a single-strand with 1) reverse-complement symmetry of base or base sequences, 2) bias of four bases, 3) multiple fractality of the distribution of each four bases depending on the distance in double logarithmic plot (power spectrum) of L (the distance of a base to the next base) vs. P (L) (the probability of the base-distribution at L), although their genomes were composed of low numbers of the four bases, and the base-symmetry was rather lower than the prokaryotic-and the eukaryotic cells. In the case of the genomic DNA composed of less than 10,000 nt, it was better than to be partitioned at 10 of the L-value, and the structural features for the biologically active genomic DNA were observed as the large genomes. As the results, the base sequences of the genomic DNA including the genomic-RNA might be universal in all genomes. In addition, the relationship between the structural features of the ge-nome and the biological complexity was discussed.


INTRODUCTION
Watson and Crick deduced that DNA had a doublehelical structure with complementary and anti-paralleled strands [1] based on the equal amounts of adenine (A) and thymine (T), and guanine (G) and cytosine (C) by Chargaff [2], and the X-ray diffraction patterns of DNA fibers by R. Franklin and M. Wilkins [3,4].After that, Chargaff and co-workers also observed that a single-strand of Bacillus subtilis DNA had the same amount of A + T and G + C ( [5]; Chargaff's second parity-rule, 1968).About fifty years later, the genome base sequences of many organisms described below have been determined, and an artificial bacterial genome (582,970 bp) was chemically synthesized based on Mycoplasma genitalium [6], although partial unreadable regions still remained in each genome.The structural analysis of the DNA based on the entire genome base sequence was necessary to understand living organisms.To do this, we had to characterize the structural features of genomic DNA.
Genome projects had been completed so far to obtain the base sequences of prokaryotic organisms, and eukaryotic organisms and so many organisms [7][8][9].The base sequences of many viruses, plastids and organelle genomes were also revealed.Their genomes were essentially small and were diverse because in a part of viruses, RNA or a single-strand DNA/RNA was used as genomes.
As the Genome Project revealed, an individual gene was an integral part of a genome.There were many genes in a genome, and the associated regulatory regions that were expressed, replicated, transcribed and translated into proteins, and all participated in biological phenomena.Each gene could be converted to respective protein according to the maturation of mRNA and "Central Dogma" [10].They might be organized based on the support the other regions in chromosome, so called, the non-coding region for the regulation of the gene-expression in living cells as a biological system.If so, we should be to face up to the entire genome as a molecule with three dimensions, not only the coding region, but also the non-coding regions.The genome might be organized in living cells as a biological system, including the coding-and the non-coding regions, which had grown with the passage of time.Therefore, we would have reported the entire genome as a systematized molecule to understand living cells [11].
The study for the entire genomic base sequences were not so much, because we had few effective tools, in-cluding hard-and soft-ware, to analyze the large-scale molecule such as genome now.Some challenging bioinformatics papers [12][13][14][15][16] had reported on stem-loop structures, and the analyses of the whole-genome using the structural features of the genomic DNA, the specific base sequences [17][18][19][20][21][22][23][24].
In prokaryotic cells including viruses and bacteriophages, most regions of the genome were occupied in the coding regions, whereas in eukaryotic cells the coding regions were not so large in entire genome, and variable depend on the genome-sizes (base numbers composed of the genomic DNAs), for example, the coding regions was occupied only several percent in H. sapiens genomic DNA [25].Furthermore, each gene on chromosome or genome had been arranged in the order, the direction using either the Watson-strand or the Crickstrand on the transcription, and the distance to the bothsides genes.When changed one of these three characters of gene on genome, the order, the direction, the distance, the living cells were become different ones.For instance, the changes of these characters might be occurred the chromosomal translocation [26][27][28], and they were forced to live the surroundings.Therefore, only the coding regions, i.e., the genes could not be explained over the biological phenomena in living cells, especially the eukaryotic cells [11].
The genomic DNA might be also "a molecule with the aligned four bases, A, T, G, C, and with three dimensions" even if there was a huge.So, the large region was deleted, presumably they might become a molecule with different conformation affected the gene-expression and the activity to interact with the biological materials, bioorganic compound(s), protein(s), nucleic acid(s), sugar(s), fatty acid(s) or so on.To express the gene(s), the regulatory elements, the promoter (trigger), the SAR (scaffold), the insulator (boundary), the poly-A-signal (stability), ncRNAs (controller) etc on genomic DNA were all or some necessary [25,[29][30][31][32][33][34].Thus, both the coding-and the non-coding regions should be necessary to express gene(s) precisely, rapidly and stably to carry out the various biological phenomena.
The small genomes were compact because of the little, or the low non-coding regions, and questioned whether the structural features of the genomic DNA/RNA would have the same or not as those of the large genomes.If they would have so, the base sequences of their genomes might be a model of the genomic DNA [11,35,36].
In this paper, the author had analyzed the small genomic DNA/RNAs such as the virus, the mitochondrial and the chloroplast genomes were also arranged sophisticatedly in the same rules similar to the large genomes and chromosomes.

Appearance Frequencies of Bases or Base Sequences
Appearance frequency of the base or base sequences (three successive base sequence = triplet) was described previously [11,35] 2.3.The Parameters "d"-, "m"-, "p"-, and "w"-Values of the SSM Analysis for the Interaction The Controllable parameters in the sequence spectrum were the base size "d" of the key sequence, the average width "m", the skip base number (the size factor) "p" and the window width "w" of homology as described previously [35,36].

f (α) Spectrum Analysis [37,38]
The f (α) and α were calculated from the base distribution curve of adenine base(s) as follows.1) L was 1 through 15 (the base distribution curve of adenine in for example, S. cerevisiae chromosome 1 was calculated as y = ae -bx , x = L-value, a = 0.3736, b = 0.3365), and 2) L was 16 or more (the base distribution curve of adenine in S. cerevisiae was calculated as a = 0.2148, b = 0.2770).
When the L-value was between 1 and 15, bases in the genome were expressed as Eq.1, y = ae -bx .Then, a derivative of both sides of Eq.1 by x is as follows.Here, the distribution curve P (L) correlated to f (α) is as follows, ( ) here c is constant in Eq.3.
In order to exclude the effect of c in this equation, let P (L)  = P (L)/P (L) max, and then use instead of P (L).P (L) max is the maximum value of P (L).
In each case (L = 1 -10, or 15, L = more than 11, or 16), the f (α) spectrum of the adenine (A) is calculated and plotted as α (x-axis) vs. f (α) (y-axis).When the f (α) varies as a function of α, the fractality must be multifractal (red-diamond, the linearly-decreased region of the "A" base in double logarithmic plot of L vs. P (L)); in contrast, when f (α) is constant at any given α-value, the fractality must be unifractal (black-square, the exponential-decreased region of the "A" base in double logarithmic plot of L vs. P (L)).
A similar calculation was carried out for each base, T, G, or C in a single-strand of DNA in the genome from the genome database.

RESULTS AND DISCUSSION
Using the data-bases of NCBI [7], Sanger Institute [8], SGD [9] and MIPS [39] were useful to analyze, following structural features were revealed in a single-strand of genomic DNA.

The Genome Base Sequence Was
Reverse-Complement Symmetry Even in a Single-Strand of DNA Genomic DNA/RNA was composed of four different bases, A, T (U), G and C. The base number (nt) and GC contents of each genome and chromosome for virus, plastid and the mitochondiral (mt) DNAs were calculated as shown in Table 1.Although in viruses (DNA/ RNA), mtDNA and chloroplast (ch) DNA, the symmetry of the base sequences was somewhat low because of the small genome-size (base numbers) of genomic DNA/ RNA in comparison with the large genomes such as eukaryotic chromosomes.In other words, the numbers of base A was almost equal to those of T, and the numbers of G was equal to those of C, the symmetry of a single-strand of DNA maintained according to exactly would agree with Chargaff's second parity-rule and previously reported [Table 1, ref. 5,11].The results also indicated that a single-stranded genomic DNA might sometimes be had a closed structure with partial hydrogen-bonding (stem-loops) as seen with RNA secondary structure [13][14][15][16].Although the reverse complement base-symmetry in a single-strand of DNA/RNA was rather low in the virus genomes, and a part of mtDNAs, the structural feature of genome could be maintained regardless the genome-size, the GC-content and the form of the genomes as the lar-ger genomes [11].The appearance frequencies of three successive base sequences corresponded to the speciesdependent genetic codon (triplets) [11,22,40], which in turn could be corresponded to the 20 amino acids.The structural feature of the genomes might be related the "peak" and "pocket" of the sequence spectra of the genomic DNAs and connected to identify the homology of the interactive-sites of proteins and DNAs [35,36].
The difference of the GC-content and the ratio of the base-symmetry between the nuclear chromosomes and the organelle, the chloroplast genomes might be caused of the origin of the symbiosis of these genomes in the host cells [41][42][43].

The Genome Base Sequence Was Localized
We calculated the distribution of the bases in 1) Simian virus 40, 2) Autographa California virus, 3) Human immunodeficiency virus 2, 4) Arabidopsis thaliana mtDNA, 5) Plasmodium falciparum mtDNA, 6) Arabidopsis thaliana chDNA (Figure 1).The artificial chromosomal sequences with the same appearance frequencies of the triplet (3 successive base sequences) and the same base numbers were generated using the random number as that of real sequence in each chromosome as previously reported [11].The S. cerevisiae chromosome 1 (230, 203 nt, Figure 1(g)) was a control of the real-and the artificial chromosome.The window-length (w, base number, nt) in each genome was depend on the genomesize as described in the MATERIALS ANA METHODS.
Four bases were localized on each real genome of each species (Figure 1, left panels), whereas they were distributed uniformly on the artificial genomes (Figure 1, right panels).In contrast to the uneven distribution of four bases on the real genome, the "A", "T", "G" or "C" frequencies in each artificial genome sequence were distributed uniformly.In addition, the results that the frequency of "A" was similar with "T", and that of "C" was similar with "G" corresponded to the base symmetry, i.e., the hydrogen bonding of A-T and G-C (GC-content) of each chromosome [11].When the genome was AT-rich, the frequency of A (or T) was higher than that of C (or G) (Figure 1).
Similar results were also observed in base distribution between real chromosomes and their artificial genome sequences both in single-or double-stranded RNA used as a genome (Figure 1(c)).These results indicated that there might be many A-T (U for RNA) and G-C hydrogen bonding in a single-strand DNA of intra-chromosomal molecules regardless eukaryotes or prokaryotes.The artificial genome sequence of each genome or chromosome could observe the reverse-complement symmetry, but the four bases were distributed uniformly, corresponding with the same molar contents, A to T and  G to C, as in the genomic DNA molecule [11].The low symmetry of A/T and G/C such as HIVtype2 (Table 1) or three successive base-sequences such as SV40 (Table 2) were affected on the distribution of the four bases (Figure 1).In addition, the low symmetry of four bases in HIVtype2-genome (Table 1) was affected to the distribution of bases (Figure 1(c)).Other small RNA-genome such as Fujinami sarcoma virus (RNA, 4788 nt) and RSV (ss-RNA, 9392 nt) maintained the base symmetry and the base distribution (Table 1 and data not shown), therefore, the low symmetry of four bases in HIVtype2-genome might be not only reason that the genome was RNA, but also related to the origin and the evolution of the genome.

The Genome Bases Had Multiple Fractality
Figure 2 showed the distribution curve of adenine bases "A" in small genomes.Most genomes should be partitioned the "L" value at 15 to observe the multiple frac-tality, but in the very small genomes composed of 10,000 -15,000 nt such as SV40 (a), HIV (c), and P. falcipu arum mtDNA (e) the linearly decreased region (power law-tail) at long distances could be observed when the L-value of the partition was favorable at 10, i.e., L = 1 -10, and more than 10 in double logarithmic plot of L vs. P (L).
Real chromosomes had the base-symmetry (the reverse-complement symmetry) as well as the base bias, whereas the artificial genome sequences had only the reverse-complement symmetry, but not the bias of the base-distribution.Based on the above results, how are the four bases, A, T (U), G, and C placed on a singlestrand of DNA in a genome?In order to understand this issue we investigated the fractality characteristics of the real genomes and the artificial genomes based on the distribution of the base-distance (L).Each base-distribution curve P (L) expresses the distribution of the distance L between a base and the next or the base "A", the base, f L-value was corresponded the base numbers from "A" to the next "A" in the genomic DNA, and P (L) is the sum of the L-value with the same base-distance in the genomic DNA [11].

JBiSE
A simple distinction of the multi-fractality (the linearly-decreased fractality = power-law-tail) or the unifractality (the exponential-decreased fractality) of the base distribution in a sequence was determined using by the fractal analysis described in the MATERIALS AND METHODS section.
For example, let us consider the case of adenine "A" in the SV40 genome.When the L-value was 1 through 10, the distribution curve P (L) of adenine (A) was fitted to an exponential equation, y = ae -bx (Eq.1, x = logL, y = logP(L); a and b are constant).In the case of adenine "A" in the SV40 genome, the a and b values were calculated from equation 1 (Eq.1) as 0.3819 and 0.3400, respectively (Figure 2(a)).
The identification of the multiple fractality in the base(s) in these genomes was also confirmed by the f (α) spectrum Figure 3.When f (α) varied as a function of α, the fractality must be multifractal (red-diamond, Figures 2 and 3); in contrast, when f (α) was constant at the α-value, the fractality must be unifractal (black-square, Figures 2 and 3).
The other three bases, thymine "T", guanine "G", and cytosine "C" in the SV40 genome also behaved in a similar manner as "A", with the multiple fractality at the boundary of the L-value.In addition, the a and b values of A and T, and G and C were identical.These fractal characteristics of a single-strand of DNA of the genome were also obtained for other species (Figure 1 In contrast, in the artificial genome sequences, neither the bias of four bases on the genomes nor the multiple fractality were observed in the base(s) regardless of the distance in the base distribution (L-value = 10 or more).Thus, the bases of the artificial sequence of genomes were distributed only the exponentially decreased-fractality (Eq.1, uni-fractal) even when L was more than 10, and the multiple fractality of the base sequences in the real genomes was not observed throughout the sequences, although the base numbers (nt) and the appearance frequencies of the base sequences were the same in each genome described in the MATERIALS AND METHODS section (data not shown).
Many studies using a part of genomic DNA of E. coli and other model DNA sequences had been reported that genomic DNA had a fractality [44][45][46][47].These studies might be analyzed based on the bacterio-phages, the prokaryotic genomes, because the fractality of large genome such S. cerevisiae and H. sapiens genomes had not been analyzed yet in those days, in addition, the multiple fractality might not be observed in the literatures previu ly published.Essentially, all genomes or chromosomes might be three structural features, the co-existence of the reversecomplement symmetry, the bias, the multiple fractality in a single-strand of DNA [11].
These three structural features of the single-strand DNA of genomes were able to observe only in the real (active) genome, but not observed in the individual gene, the short DNA or the random-ordered DNA such as the artificial sequence of the genome [11].When these three structural features were co-existed, the gene(s) on the genome could be able to express, and the resulted product(s) might be functioned timely and properly in the living cells even in the small genomes.The bases of genomes were not placed randomly, but seem to be placed sophisticatedly by the generation-rules as a single-strand of genomic DNA even in the small genomes.Presumably, two such structural-featured in a single-strand DNAs above described might be assembled to form the anti-when there were several chromosomes in one organism.In addition, the reason for using chromosome in H. sapiens, the personal computer can not be calculated the sum of chromosomes 1 -22, X and Y because of the limited capacity.
The genome data are draft as described above, but most of the unreadable area was very small part compared with the huge entire chromosome.So, when there was unreadable region in chromosome, we could skip the region to calculate the base frequencies of the chromosome or genome because the unreadable region of each chromosome was small number of bases to neglect in comparison to large number of genomic DNA.The complexity of the organisms might be dependent on the capacity of the non-coding region in the entire genome.

CONCLUSIONS
The structural features, 1) the reverse-complement symmetry of the base or three successive base sequences, 2) the bias of the four bases-distribution, 3) the multiple fractality of the four bases-distribution were the universal in a single-strand of genomic DNA or RNA genomes in a part of virus even in the small genomes such as virus, the plastids and the organelle genomes.These three characters were co-existed in the single-strand of DNA in all genomes.The molar ratio of the plastids and the organelle genomes were different from the nuclear chromosomes of the host cells because most of the plastids and the organelle genomes might be evolved under the process of the symbiosis in other organisms.
x is L-value).
Figures 2 and 3).The other three bases, thymine "T", guanine "G", and cytosine "C" in the SV40 genome also behaved in a similar manner as "A", with the multiple fractality at the boundary of the L-value.In addition, the a and b values of A and T, and G and C were identical.These fractal characteristics of a single-strand of DNA of the genome were also obtained for other species (Figure1, data not shown, ref. 11).In contrast, in the artificial genome sequences, neither the bias of four bases on the genomes nor the multiple fractality were observed in the base(s) regardless of the distance in the base distribution (L-value = 10 or more).Thus, the bases of the artificial sequence of genomes were distributed only the exponentially decreased-fractality (Eq.1, uni-fractal) even when L was more than 10, and the multiple fractality of the base sequences in the real genomes was not observed throughout the sequences, although the base numbers (nt) and the appearance frequencies of the base sequences were the same in each genome described in the MATERIALS AND METHODS section (data not shown).Many studies using a part of genomic DNA of E. coli and other model DNA sequences had been reported that genomic DNA had a fractality[44][45][46][47].These studies might be analyzed based on the bacterio-phages, the prokaryotic genomes, because the fractality of large genome such S. cerevisiae and H. sapiens genomes had not been analyzed yet in those days, in addition, the multiple fractality might not be observed in the literatures previu ly published.o s

Figure 2 .
Figures 2 and 3).The other three bases, thymine "T", guanine "G", and cytosine "C" in the SV40 genome also behaved in a similar manner as "A", with the multiple fractality at the boundary of the L-value.In addition, the a and b values of A and T, and G and C were identical.These fractal characteristics of a single-strand of DNA of the genome were also obtained for other species (Figure1, data not shown, ref. 11).In contrast, in the artificial genome sequences, neither the bias of four bases on the genomes nor the multiple fractality were observed in the base(s) regardless of the distance in the base distribution (L-value = 10 or more).Thus, the bases of the artificial sequence of genomes were distributed only the exponentially decreased-fractality (Eq.1, uni-fractal) even when L was more than 10, and the multiple fractality of the base sequences in the real genomes was not observed throughout the sequences, although the base numbers (nt) and the appearance frequencies of the base sequences were the same in each genome described in the MATERIALS AND METHODS section (data not shown).Many studies using a part of genomic DNA of E. coli and other model DNA sequences had been reported that genomic DNA had a fractality[44][45][46][47].These studies might be analyzed based on the bacterio-phages, the prokaryotic genomes, because the fractality of large genome such S. cerevisiae and H. sapiens genomes had not been analyzed yet in those days, in addition, the multiple fractality might not be observed in the literatures previu ly published.o s

Table 2 .
Appearance frequency of three successive base sequences.Frequency ratio * Frequency ratio * Frequency ratio * Frequency ratio * Frequency ratio * Frequency ratio * Frequency ratio * Frequency ratio * Frequency ratio * Frequency ratio * Frequency ratio * Frequency ratio * *