Simulation for chaos game representation of genomes by recurrent iterated function systems

some pattern (information) in the DNA sequence Chaos game representation (CGR) of DNA (Goldman 1993). Goldman (1993) interpreted the sequences and linked protein sequences CGRs in a biologically meaningful way. All points from genomes was proposed by Jeffrey (1990) plotted within a quadrant must corresponding to suban d Yu et al. (2004), respectively. In this sequences of the DNA sequence that end with the paper, we consider the CGR of three kinds of base labelling the corner of that quadrant. He also prosequences from complete genomes: whole posed a discrete time Markov Chain model to simugenome DNA sequences, linked coding DNA late the CGR of DNA sequences and use the sequences and linked protein sequences. sequence's dinucleotide and trinucleotide frequenSome fractal patterns are found in these cies to calculate the probabilities in these models. CGRs. A recurrent iterated function systems Goldman's Markov model can be calculated directly (RIFS) model is proposed to simulate the and easily from the raw DNA sequences, without refCGRs of these sequences from genomes and erence to the CGR. their induced measures. Numerical results Deschavanne et al. (1999) used CGR of genomes on 50 genomes show that the RIFS model can to discuss the classification of species. Almeida et al. s i m u l a t e v e r y w e l l t h e C G R s a n d t h e i r (2001) showed the distribution of positions in the induced measures. The parameters est iCGR plane is a generalization of Markov Chain probmated in the RIFS model reflect information ability tables that accommodates non-integer orders. on species classification. Joseph and Sasikumar (2006) proposed a fast algo-


INTRODUCTION
proteins.The idea of CGR of DNA sequences pro-The hereditary information of organisms (except for posed by Jeffrey (1990) was generalized and applied RNA-viruses) is encoded in their DNA sequences for visualizing and analyzing protein databases by which are one-dimensional unbranched polymers Fiser et al. (1994).Generalization of CGR of DNA made up from four different kinds of monomers (numay take place in several ways.In the simplest case, cleotides): adenine (a), cytosine (c), guanine (g), and the square in CGR of DNA is replaced by an n-sided thymine (t).Based on a technique from chaotic regular polygon (n-gon), where n is the number of difdynamics, Jeffrey (1990) proposed a chaos game repferent elements in the sequence to be represented.As resentation (CGR) of DNA sequences by using the proteins consist of 20 kinds of amino acids, a 20four vertices of a square in the plane to represent sided regular polygon (regular 20-gon) is the most a,c,g and t.The method produces a plot of a DNA adequate for protein sequence representation.A few sequence which displays both local and global patthousand points result in an 'attractor' which gives a terns.Self-similarity or fractal structures were found visualization of the rare or frequent residues and in these plots.Some open questions from the biologisequence motifs.Fiser et al. (1994) pointed out that cal point of view based on the CGR were proposed the chaos game representation can also be used to (Jeffrey 1990).study 3D structures of proteins.If the DNA sequences were a random collection of Basu et al. (1998) proposed a new method for the sequences and linked sequences of all protein chaos game representation of different families of sequences from complete genomes.proteins.Using concatenated amino acid sequences For DNA sequences, the CGR is obtained by using of proteins belonging to a particular family and a 12the four vertices of a square in the plane to represent sided regular polygon, each vertex of which reprea,c,g and t (Jeffrey 1990).The first point of the plot is sents a group of amino acid residues leading to conplaced half way between the center of the square and servative substitutions, the method generates the the vertex corresponding to the first letter, the ith CGR of the family and allows pictorial representapoint of the plot is placed half way between the (ition of the pattern characterizing the family.Basu et 1)th point and the vertex corresponding to the ith letal.(1998) found that the CGRs of different protein ter in the DNA sequence.families exhibit distinct visually identifiable patterns.
For linked protein sequences, we outline here the This implies that different functional classes of proway to get the CGR from Yu et al. (2004b).The proteins produce specific statistical biases in the distritein sequence is formed by twenty different kinds of bution of different mono-, di-, tri-, or higher order ami no a cid s, n ame ly Ala nin e (A), Arginine (R), peptides along their primary sequences.
Aspara gine (N), Aspartic acid (D), Cysteine (C), A well-known model of protein sequence analysis Glut amic acid (E), Glutamine (Q), Glycine (G), is the HP model proposed by Dill et al. (1985).In this Histidine (H), Isoleucine (I), Leucine (L), Lysine (K), model 20 kinds of amino acids are divided into two Meth ioni ne (M), Phenylalanine (F), Proline (P), types, hydrophobic (H) (or non-polar) and polar (P) Serine (S), Threonine (T), Tryptophan (W), Tyrosine (or hydrophilic).But the HP model may be too simple (Y) and Valine (V) (Brown 1998, page 109).In the and lacks sufficient information on the heterogeneity detailed HP model, they can be divided into four and the complexity of the natural set of residues classes: non-polar, negative polar, uncharged polar (Wang and Wang 2000).According to Brown (1998) where the four vertices correspond to the four letters Although we proposed the CGR for linked protein 0,1,2,3.The first point of the plot is placed half way sequences from genomes (Yu et al. 2004b), we did between the center of the square and the vertex correnot consider how to simulate the CGRs.In this paper, sponding to the first letter of the sequence X(s); the we extend the CGR to the study of whole-genome ith point of the plot is then placed half way between DNA sequences and linked coding DNA sequences the (i-1)th point and the vertex corresponding to the from genomes.Then we use the RIFS model to simuith letter.We then call the obtained plot the CGR of late the CGR of these 3 kinds of data from genomes the protein sequence s based on the detailed HP and their induced measures.The probability matrix in model.our RIFS model is similar to the one in Markov model Usually whole-genome DNA sequences and linked used by Goldman (1993), but the way to estimate this coding DNA sequences are relatively long, hence the matrix is different.
resulting CGRs are too dense to visualize any pattern directly.The linked protein sequences are 3 times shorter than the linked coding DNA sequences, and

CHAOS GAME REPRESENTATION OF
their CGRs produce clearer self-similar patterns.For

GENOMES
example, we show the CGR of the linked protein Three kinds of sequences from complete genomes are sequence of the bacterium Mycobacterium tuberculoconsidered, namely, whole-genome DNA sequences sis CDC1551 (MtubC) in .(including protein-coding and non-coding regions), Considering the points in a CGR of an organism, linked sequences of all protein-coding DNA The coefficients in the contractive maps and the l the number of points lying in a subset B of the CGR and probabilities in the RIFS are the parameters to be esti-N is the length of the sequence.We divide the square mated for the me asure that we wa nt to simulate .We A major result for RIFS is that there exists a unique solving the above linear equations.invariant measure of the random walk (Eq.2) When m=0, n 1 whose support is A (Barnsley et al., 1989).(2)

SciRes
If we denote by G the moments obtained directly mn from a given measure, and g the formal expression mn of moments obtained from the above formulae, then solving the optimization problem hence the moments are given by the solution of the linear equations will provide the estimates of the parameters of the RIFS.
Once the RIFS (S (x),p ,i,j=1,2, ,N ) has been estiii j mated, its invariant measure can be simulated in the following way: Generate the attractor of the RIFS via the random walk (Eq.2).Let be the indicator func-B When n=0,m 1 tion of a subset B of the attractor A. From the ergodic theorem for RIFS (Barnsley et al., 1989), the invariant measure is then given by hence the moments are given by the solution of the linear equations By definition, an RIFS describes the scale invariance of a measure.Hence a comparison of the given measure with the invariant measure simulated from the RIFS will confirm whether the given measure has this scaling behaviour.This comparison can be undertaken by computing the cumulative walk of a measure visualized as intensity values on a JJ mesh; here J=128 in this paper.When m,n 1 If we convert the two-dimensional matrix A=( ) to an one dimensional vector by concatekl J J nate every row in A at the end of previous row.We denote the one-dimensional vector as f=(f ,f ,, f ).
12 JJ The cumulative walk is defined as Where f is the average value of all element in vector f.
Returning to the CGR, an RIFS with 4 contractive hence the moments are given by the solution of the maps {S ,S ,S ,S } is fitted to the measure obtained linear equations 1234 from the CGR using the method of moments.Here we can fix Hence the parameters needed to be estimated are the probabilities in the matrix P. Once we have estimated the probability matrix in the RIFS, we can start from the point (0.5, 0.5) and use the chaos game algorithm Eq. (2) to generate a random point sequence{x } with     The probability matrix in our RIFS model characterizes the dinucleotide or di-amino acid frequencies (information) which is similar to the one in Markov model used by Goldman (1993), but the way to estimate this matrix is different.
The values of the RSE of the simulation for 50     better than that of measures for whole-genome DNA are much less than 1.0, confirming that the RIFS data and linked coding DNA data.Finally, the estimodel can simulate very well the measures of three mated parameters in the RIFS models for the linked kinds of data.The values of e for whole-genome DNA protein data from genomes can be used to characterdata are generally larger than those for linked coding ize the phylogenetic relationships of the genomes.DNA data, which in turn are larger than those for linked protein data.In other words, the RIFS model can simulate the measures for linked protein data better than the measures for linked coding DNA data, ACKNOWLEDGEMENTS and can simulate measures for linked coding DNA Financial support was provided by the Chinese National Natural data better than the measures for whole-genome DNA Science Foundation (grant no.30570426), Fok Ying Tung Educa- tion Foundation (grant no.101004) and the Youth foundation of data.We notice that the linked protein sequence is Educational Department of Hunan province in China (grant no. shorter than the corresponding linked coding DNA 05B007) (Z.-G.Yu), and the Australian Research Council (grant no.sequence, while the linked coding DNA is shorter DP0559807) (V.V. Anh).than the whole-genome sequence.We guess the length of the data reflects the information complexity , and positive polar.The eight residues A, I, L , M, F, P, one can divide the polar class in the HP model into W, V designate the non-polar class; the two residues three classes: positive polar, uncharged polar and neg-D, E designate the negative polar class; the seven resiative polar.So 20 different kinds of amino acids can dues N, C, Q , G, S, T, Y designate the uncharged polar be divided into four classes: non-polar, negative class; and the remaining three residues R, H, K desigpolar, uncharged polar and positive polar.In this nate the positive polar class.model, one considers more details than in the HP For a given protein sequence s=s s with length l, 1l model.We call this model a detailed HP model (Yu et where s is one of the twenty kinds of amino acids for i al.2004a).Based on the detailed HP model, we pro-i=1,, l ,we define posed a CGR for the linked protein sequences from the genomes (Yu et al. 2004b).The recurrent iterated function system in fractal theory(Barnsley and Demko, 1985;Falconer, 1997) has been applied successfully to fractal image construction(Barnsley and Demko, 1985;Vrscay, 1991), one dimensional measure representation of genomes(Anh et al. 2002;Yu et al. 2001 Yu et al.  , 2003) )  and magneticWe then obtain a sequence X(s)=a a , where a is field data(Wanliss et  al. 2005; Anh et al. 2005) for 1l i example.Yu et al. (2007) proposed a CGR for the a letter of the alphabet {0,1,2,3}.We next define the magnetic field data and used the RIFS model to simu-CRG for a sequence X(s) in a square [0,1] [ 0,1], late the CGR.

l
now describe the method of moments to perform this [0,1] [0,1] into meshes of sizes 64 64, 128 128, task.In the two-dimensional case of our CGRs, we 512 512 or 1024 1024.This results in a measure for consider a system of N contractive maps each mesh.We then obtain a 64 64, 128 128, 512 512 or 1024 1024 matrixA=( ) , where kl J J J=64,128,512 or 1024, each element is the measure kl value on the corresponding mesh.We call A the measure matrix of the organism.The measure based on If is the invariant measure and A the attractor of 2 a 128 128 mesh on the CGRs are considered in this the RIFS in R , the moments of are paper.For example, the measure based on a 128 128 mesh of the CGR in is shown in .Using the properties of the Markov operator 3. RECURRENT ITERATED FUNCTION defined by (S, P) (Vrscay, 1991), we get SYSTEM FOR A MEASURE Consider a system of contractive maps S={S, S, 12 S} and the associated matrix of probabilities P=(p ) Ni j such that p= 1 , i =1,2, ,N.We consider a random ji j sequence generated by a dynamical system where x is any starting point and is chosen 0n among the set {1,2, ,N}with a probability that depends on the previous index : P( =i)=p .n-1 ni -1,i Then (S, P) is called a recurrent iterated function sys-When n=0, m=0, from we have tem.Then there exist compact setsA,A ,i=1,2, N i such that for i=1,2, ,N.where set A is called the attractor of the RIFS (S, P). j Then we can get the values for g , j12, ,N by , 00

Figure 1 .
Figure 1.Chaos game representation of the linked protein sequence from genome of Mycobacterium tuberculosis CDC1551(MtubC) (with 1325681 amino acids).

Figure 2 .
Figure 2. The measure based on a 128 128 mesh of the CGR in Figure 1.

i
the same length N of the whole-genome DNA l for i=1,2, ,N.sequence, linked coding DNA sequence or the linked

Figure 3 .
Figure 3.The RIFS simulated CGR for the CGR in Figure 1.Figure 4. The measure based on a 128 128 mesh of the RIFS simulated CGR in Figure 3.

Figure 4 .
Figure 3.The RIFS simulated CGR for the CGR in Figure 1.Figure 4. The measure based on a 128 128 mesh of the RIFS simulated CGR in Figure 3.

Figure 5 .
Figure 5.The walk representation of measures induced by the CGR in Figure 1 and its RIFS simulation in Figure 4. Pixel position Figure 6 Figure 7 Figure 6 Figure 7Figure8

Figure 7 .
Figure 7.The RIFS simulated CGR for the CGR in Figure 6.

Figure 8 .
Figure 8.The walk representation of measures induced by the CGR in Figure 6 and its RIFS simulation in Figure 7.

Table 1 .
The goodness of fit for the walk representations of three kinds of data from 50 genomes.