DNA Sequences with Forbidden Words and the Generalized Cantor Set

Abstract

In this work, we establish relations between DNA sequences with missing subsequences (the forbidden words) and the generalized Cantor sets. Various examples associated with some generalized Cantor sets, including Hao’s frame representation and the generalized Sierpinski Set, along with their fractal graphs, are also presented in this work.

Share and Cite:

Yang, Z. and Wang, P. (2019) DNA Sequences with Forbidden Words and the Generalized Cantor Set. Journal of Applied Mathematics and Physics, 7, 1687-1696. doi: 10.4236/jamp.2019.78115.

1. Introduction

Researchers have been interested in the relationships between fractals and DNA structures for years. Just recently, Anitas and Slyamov [1] studied multiscale fractal representing DNA sequences using small-angle scattering analysis. Cattani and Pierro [2] conducted a multifractal analysis of binary images of DNA in order to define a methodological approach to the classification of DNA sequences. Badea and her collaborators [3] characterized the geometry of some medical images of tissues in terms of complexity parameters such as the fractal dimension (FD). Carlo Cattani presented analysis of DNA based on the indicator matrix together with some elementary approach to a fractal estimate of DNA sequences in the book [4] edited by Elloumi and Zomaya. Albrecht-Buehler [5] identified explicitly the GA-sequences as a class of fractal genomic sequences. Ainsworth [6] investigated how the cell’s nucleus holds molecules that manage human’s DNA in the right location. In a book edited by Crilly, Earnshaw and Jones, Voss applied standard spectral density measurement techniques to demonstrate the ubiquity of low frequency noise and long range fractal correlations.

The study of the genome or DNA sequences through fractal analysis is very interesting. DNA sequences can be seen as sequences over the alphabet Σ = { a , c , g , t } . Subsequences that do not appear in DNA are considered as forbidden words. A visualization method of the forbidden words in [7] [8] [9] [10] [11] has been designed by B.-L. Hao since 2000. This method is now called Hao’s frame representation. Recently, C.-X. Huang and S.-L. Peng discussed this method in detail, and many beautiful graphics were provided in [12] [13] . From these geometric intuitions, it can be observed that these forbidden words demonstrate certain fractal properties. In fact in this work we generated some amazing fractal graphs associated with DNA sequences with forbidden words as shown in Figure 1.

It is important to explore the fractal generating mechanism that is associated with the forbidden words in the sequence. H. J. Jeffrey [14] [15] and P. Tiňo [16] [17] tried to associate the forbidden words with the IFS (Iterated Functions Systems) using chaos game algorithm. Denote Σ as the set of all finite sequences over Σ . Then how to find a generating formula or the mapping σ : Σ w Σ , where w is a sequence that does not contain forbidden subsequences, or corresponding iteration method? As was pointed out by P. Tiňo, the IFS is a multifractal and therefore the generating formula would be relatively complicated.

In order to detect the structures of some symbolic sequences, one has to find the properties of their topology and metric and be able to visualize these sequences. To do this, we have to provide a type of graphical representation together with their topology and metric properties so that we can directly reveal their corresponding fractal graphs. This kind of representation method is important and necessary.

For an alphabet with cardinal 3, the well known CGR method (that is, Chaos Game Representation method) was first introduced by M.F. Barnsley by considering the points in an equilateral triangle. The substrings of a string were shown graphically (see [18] ). For an alphabet Σ = { a , c , g , t } with cardinality 4, the CGR method was later generalized by H.J. Jeffrey so that the DNA sequences can be visualized (see [14] [15] ). The authors have transformed the DNA sequences into pseudo random walk in a 2-dimensional plane or in a 3-dimensional space [19] [20] [21] . We notice here that an iterated function system can be applied to construct a graphical representation of some DNA sequences [16] [17] . The points in the unit square [ 0,1 ] × [ 0,1 ] can be used to denote the substrings of the DNA sequences. Consequently, the four vertices of the unit square are labelled as a , c , g , t .

In application, the frame representation method proposed by Hao et al. is more intuitive and visual [9] [10] . The unit square [ 0,1 ] × [ 0,1 ] is divided equally with vertical and horizontal lines so that there are 4 k congruent small squares with side length 2 k and area 4 k . For the alphabet Σ = { a , c , g , t } with cardinality 4, each small square of side length 2 k is used to denote the string in Σ k ( k = 1 , 2 , 3 , ) regularly (See 1-, 2- and 3-frame graphs in Figures 2(a)-(c)).

Figure 1. Graphs of some forbidden words.

(a) (b) (c)

Figure 2. The frame representation method of B.L. Hao et al. (a) 1-frame graph; (b) 2-frame graph; (c) 3-frame graph.

With the frame representation method of B.L. Hao, the repetition topology structure of the subsequences (i.e. the strings in Σ k ) of a DNA sequence can be easily visualized and efficiently drawn. The avoided or the under-represented short strings in the genome sequence form the forbidden words. These forbidden words are the reasons or the basis of the constructed fractals.

P. Tino [16] [17] proved the equivalence of the CGR method and the frame representation method of B.L. Hao et al. He noted that the cardinality of an alphabet can be generalized to a square integer ( | Σ | = b 2 simultaneously for some integer b). We will in this paper extend the above methods and relax the restriction to the cardinality of an alphabet.

The order of this paper is as follows. In Section 2, we will first convert the problem into the discussion on certain type of generalized Cantor set, which can naturally correspond to multifractals, and then in Section 3, we will induce Hao’s frame representation according to the principle that the correspondence between line segment and unit square is one-to-one [22] . Several examples, along with their fractal graphs, of some generalized Cantor sets are given at the end of this paper.

2. Forbidden Words and the Generalized Cantor Set

Rewrite the alphabet as Σ = { 0 , 1 , 2 , 3 } . We first give the following definition.

Definition 2.1 Let Σ = { 0 , 1 , 2 , 3 } . Denote B as the set consist of l finite sequences with length k ( 1 ) :

B = { t 11 t 1 k , t 21 t 2 k , , t l 1 t l k } , t i j Σ , i = 1 , , l , i = 1 , , k . (1)

Then call the infinite sequences over Σ

s = a 1 a 2 a n , a n Σ , and a 1 a i + 1 a i + k 1 B , i = 1 , 2 (2)

the DNA sequence with no forbidden words B, a.k.a. allowed sequence.

It is known that when x [ 0,1 ] is expanded in ternary representation, the subset in [ 0,1 ]

C = { x = 0 3 . x 1 x 2 x n , x { 0 , 2 } }

is called the Cantor set. Similarly, with quaternary expansion, we give the following definition.

Definition 2.2 When x [ 0,1 ] is represented in quaternary expansion

x = n = 1 a n 4 n = 0 4 . a 1 a 2 a n , a n Σ , (3)

we call

C G = { x = 0 4 . a 1 a 2 a n , a n Σ , a i a i + 1 a i + k 1 B , i = 1 , 2 , } , k 1 (4)

the generalized Cantor set.

Apparently, the discussions on DNA sequences (1) (2) that contain no forbidden words B can be converted into the discussion on the generalized Cantor set C G .

Let b k i = 4 k 1 t i 1 + + 4 t i k 1 + t i k , b k i { 0,1, ,4 k 1 } , i = 1 , 2 , , l , and

B = { b k 1 , b k 2 , , b k l } , (5)

Then, the condition a i a i + 1 a i + k 1 B , i = 1 , 2 , in Definition 2.2 can be rewritten as

4 k 1 a i + 4 k 2 a i + 1 + + a i + k B , i = 1 , 2 ,

Theorem 2.1 The generalized Cantor set C G can be inducted by using an iteration method.

Proof. In fact, for the ( k 1 ) th step of the quaternary expansion of x [ 0,1 ] , there is

x = a 1 4 + + a k 1 4 k 1 + x k 1 4 k 1 , 0 x k 1 1, a 1 , , a k Σ . (6)

Let

x k 1 = a k 4 + x k 4 , 0 x k 1, a k Σ and 4 k 1 a 1 + + a k B . (7)

Substitute (7) into (6),

x = a 1 4 + + a k 4 k + x k 4 k , 0 x k 1 , 4 k 1 a 1 + + a k B . (8)

In general, we let

x i 1 + k = a i + k 4 + x i + k 4 , 0 x i + k 1, a i + k Σ and 4 k 1 a i + + a k + i B . (9)

and as i , we obtain the generalized Cantor set C G (2.2).

[ j = 1 k a j 4 j , j = 1 k a j 4 j + 1 4 k ] , a j Σ are 4 k intervals in [ 0,1 ] with length 1 4 k . From the iteration Equation (7) in the theorem, the iteration acts differently on the l subintervals than on the 4 k l intervals. Hence we have [11] .

Corollary 2.3 The generalized Cantor set C G is multifractal.

Proof. In the construction of the generalized Cantor sets C G , measures on removed portions are redistributed to the neighboring sections repeatedly. Thus C G is multifractal.

Obviously, the generalized Cantor sets are applicable for all p-carry representation (p is an integer).

3. The Hao’s Frame Representation of the Generalized Cantor Set CG

The theoretic foundation of the construction of DNA sequences can be seen in [12] . The subintervals in the quaternary expansion of x [ 0,1 ] can be one-to-one corresponding to the subsquares that are obtained by repeatedly equally dividing the unit square (and its subsquares) into 4 smaller subsquares. Cantor sets are created in one dimension in [ 0,1 ] while Sierpinski sets are constructed in two dimension within [ 0,1 ] × [ 0,1 ] . Using the corresponding relationship between the unit interval and the unit square, we can convert the discussion on the generalized Cantor sets into the discussion on the generalized Sierpinski sets on the unit square.

Let 0 ξ , η 1 . The binary expansion of ( ξ , η ) is

( ξ , η ) = ( n = 1 c n 2 n , n = 1 d n 2 n ) , c n , d n { 0 , 1 } . (10)

The expansion can be related to the quaternary expansion of x [ 0,1 ] as follows:

a i = c i + 2 d i = ( 0 : ( c i , d i ) = ( 0 , 0 ) 1 : ( c i , d i ) = ( 1 , 0 ) 2 : ( c i , d i ) = ( 0 , 1 ) 3 : ( c i , d i ) = ( 1 , 1 ) (11)

Thus the forbidden words b k i in B can be represented as

b k i = 4 k 1 ( c i 1 + 2 d i 1 ) + + 4 ( c i k 1 + 2 d i k 1 ) + ( c i k + 2 d i k ) , c i 1 , d i 1 , , c i k , d i k { 0 , 1 } , i = 1 , 2 , , l (12)

Definition 3.1 Let 0 ξ , η 1 and the binary expansion of ( ξ , η ) is (10). Then call

S G = { ( ξ , η ) = ( n = 1 c n 2 n , n = 1 d n 2 n ) , c n , d n { 0 , 1 } , 4 k 1 ( c i 1 + 2 d i 1 ) + + 4 ( c i k 1 + 2 d i k 1 ) + ( c i k + 2 d i k ) B } (13)

the generalized Sierpinski set that corresponds to the the generalized Cantor set C G ( 4 ) .

Theorem 3.1 The generalized Sierpinski set S G can be inducted by iterating method.

Proof. The ( k 1 ) th binary expansion of ( ξ , η ) is

( ξ , η ) = ( n = 1 k 1 c n 2 n + ξ k 1 2 k 1 , n = 1 k 1 d n 2 n + η k 1 2 k 1 ) , 0 ξ k 1 , η k 1 1. (14)

Let

( ξ k 1 , η k 1 ) = ( c k 2 + ξ k 2 , d k 2 + η k 2 ) , c k , d k { 0 , 1 } , 0 ξ k , η k 1 , 4 k 1 ( c 1 + 2 d 1 ) + + 4 ( c k 1 + 2 d k 1 ) + ( c k + 2 d k ) B , c k , d k { 0 , 1 } (15)

Substitute (15) into (14), we have

( ξ , η ) = ( n = 1 k c n 2 n + ξ k 2 k , n = 1 k d n 2 n + η k 2 k ) , 4 k 1 ( c 1 + 2 d 1 ) + + 4 ( c k 1 + 2 d k 1 ) + ( c k + 2 d k ) B

Generally, let

( ξ k + i 1 , η k + i 1 ) = ( c k + i 2 + ξ k + i 2 , d k + i 2 + η k + i 2 ) , 0 ξ k + i , η k + i 1 , c k + i , d k + i { 0 , 1 } , 4 k 1 ( c i + 2 d i ) + + 4 ( c k + i 1 + 2 d k + i 1 ) + ( c k + i + 2 d k + i ) B

Noticing the corresponding relationship between numbers and the subsquares, naturally we have Hao’s frame representation. The second-order Hao’s frame representation can be inducted from the corresponding relationship illustrated in Figure 3.

The next few examples illustrate analytic structure of some DNA sequences along with the fractal graphs of the relevant generalized Cantor sets.

Example 3.2 Let Σ = { 0 , 1 , 2 , 3 } , B = { 00 , 11 , 22 } . Then B = { 0 , 5 , 10 } . Hence the arithmetic expression of the generalized Cantor set is

x = 0 4 . a 1 a 2 a n , a n { 0,1,2,3 } and 4 a i + a i + 1 0,5,10, i = 1,2,

And the symbolic sequence is

a 1 a 2 a n , a n Σ , a i a i + 1 B , i = 1 , 2 ,

which is shown graphically in Figure 4.

Example 3.3 Let Σ = { 0 , 1 , 2 , 3 } , B = { 10 , 20 , 30 } . Then B = { 4 , 8 , 12 } . Hence the arithmetic expression of the generalized Cantor set is

x = 0 4 . a 1 a 2 a n , a n { 0,1,2,3 } and 4 a i + a i + 1 4,8,12, i = 1,2,

And the symbolic sequence is

a 1 a 2 a n , a n Σ , a i a i + 1 B , i = 1 , 2 ,

with graphs Figure 5:

Example 3.4 Let Σ = { 0 , 1 , 2 , 3 } , B = { 011 , 022 , 100 , 133 , 200 , 233 , 311 , 322 } . Then B = { 5 , 10 , 16 , 31 , 32 , 47 , 53 , 58 } . Hence the arithmetic expression of the generalized Cantor set is

x = 0 4 . a 1 a 2 a n , a n { 0,1,2,3 } and 4 2 a i + 4 a i + 1 + a i B , i = 1,2,

And the symbolic sequence is

a 1 a 2 a n , a n Σ , a i a i + 1 B , i = 1 , 2 ,

which are shown below

Similarly, we could produce the following amazing fractal graphs shown in Figure 6, Figure 7, of different DNA sequences with various forbidden words.

Figure 3. Hao’s frame representation of k = 2 .

Figure 4. B = { 0 , 5 , 10 } .

Figure 5. B = { 4 , 8 , 12 } .

Figure 6. B = { 5 , 10 , 16 , 31 , 32 , 47 , 53 , 58 } .

Figure 7. Other examples.

4. Conclusion

We established relations between the generalized Cantor sets and some DNA sequences with missing words. And we have associated Hao’s frame representations and the generalized Sierpinski set with the generalized Cantor sets. The authors are interested in applying the analytical representation method to study the graphical results of space filling research works (cf. [23] [24] [25] ).

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

References

[1] Mircea Anitas, E. and Slyamov, A. (2017) Structural Characterization of Chaos Game Fractals Using Small-Angle Scattering Analysis. PLoS ONE, 12, e0181385.
https://doi.org/10.1371/journal.pone.0181385
[2] Cattani, C. and Pierro, G. (2013) on the Fractal Geometry of DNA by the Binary Image Analysis. Bulletin of Mathematical Biology, 75, 1544-1570.
https://doi.org/10.1007/s11538-013-9859-9
[3] Badea, A.F., et al. (2013) Fractal Analysis of Elastographic Images for Automatic Detection of Diffuse Diseases of Salivary Glands: Preliminary Results. Computational and Mathematical Methods in Medicine, 2013, Article ID: 347238.
https://doi.org/10.1155/2013/347238
[4] Elloumi, M. and Zomaya, A.Y. (2014) Biological Knowledge Discovery Handbook. John Wiley & Sons, Hoboken, NJ.
https://doi.org/10.1002/9781118617151
[5] Albrecht-Buehler, G. (2012) Fractal Genome Sequences. Gene, 498, 20-27.
https://doi.org/10.1016/j.gene.2012.01.090
[6] Ainsworth, C. (2009) Cells Go Fractal. Nature-International Weekly Journal of Science.
[7] Falconer, K.J. (1998) Techniques in Fractal Geometry. John Wiley & Sons, Hoboken, NJ.
[8] Hao, B.-L., Xie, H.-M. and Chen, G.-Y. (2000) Factorizable Language: From Dynamics to Bacterial Complete Genomes. Physica A: Statistical Mechanics and Its Applications, 288, 10-20.
https://doi.org/10.1016/S0378-4371(00)00411-8
[9] Hao, B.-L., Xie, H.-M., Chen, G.-Y. and Chen, G.-Y. (2000) Avoided Strings in Bacterial Complete Genomes and a Related Combinatorial Problem. Annals of Cominatorics, 4, 247-255.
https://doi.org/10.1007/PL00001279
[10] Yu, Z.-G., Hao, B.-L., Xie, H.-M. and Chen, G.-Y. (2000) Dimensions of Fractals Related to Languages Defined by Tagged Strings in Complete Genomes. Chaos, Solitons & Fractals, 11, 2215-2222.
https://doi.org/10.1016/S0960-0779(99)00141-1
[11] Crilly, A.J., Earnshaw, R. and Jones, H. (2013) Applications of Fractals and Chaos: The Shape of Things. Springer-Verlag, Berlin, Heidelberg.
[12] Huang, C.X. and Peng, S.L. (2006) A Note on Fractals of One Forbidden Word and Their Box Dimensions. Fractals, 14, 327-337.
https://doi.org/10.1142/S0218348X06003325
[13] Huang, C.X. and Peng, S.L. (2008) Fractals of Forbidden Words and Approximating Their Box Dimensions. Physica A: Statistical Mechanics and Its Applications, 387, 703-716.
https://doi.org/10.1016/j.physa.2007.09.009
[14] Jeffrey, H.J. (1990) Chaos Game Representation of Genetic Sequences. Nucleic Acids Research, 18, 2163-2170.
https://doi.org/10.1093/nar/18.8.2163
[15] Jeffrey, H.J. (1992) Chaos Game Visualization of Sequences. Computers & Graphics, 16, 25-33.
https://doi.org/10.1016/0097-8493(92)90067-6
[16] Tiňo, P. (1999) Spatial Representation of Symbolic Sequences Iterative Function Systems. IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans, 29, 386-393.
https://doi.org/10.1109/3468.769757
[17] Tiňo, P. (2002) Multifractal Properties of Hao’s Geometric Representations of DNA Sequences. Physica A: Statistical Mechanics and Its Applications, 304, 480-494.
https://doi.org/10.1016/S0378-4371(01)00574-X
[18] Barnsley, M.F. (1988) Fractals Everywhere. Academic Press, Cambridge, MA.
[19] Berthelsen, C.L., Glazier, J.A. and Skolnik, M.H. (1992) Global Fractal Dimension of Human DNA Sequences Treated as Pseudorandom Walks. Physical Review A, 45, 8902-8913.
https://doi.org/10.1103/PhysRevA.45.8902
[20] Stanley, H.E., Buldyrev, S.V., Goldberger, A.L., Goldberger, Z.D., Havlin, S., Mantegna, R.N., Ossadnik, S.M., Peng, C.K. and Simons, M. (1994) Statistical Mechanics in Biology: How Ubiquitous Are Long-Range Correlations? Physica A: Statistical Mechanics and Its Applications, 205, 214-253.
https://doi.org/10.1016/0378-4371(94)90502-9
[21] Leong, P.M. and Morgenthaler, S. (1995) Random Walk and Gap Plots of DNA Sequences. Bioinformatics, 11, 503-507.
https://doi.org/10.1093/bioinformatics/11.5.503
[22] Solovyev, V.V., Korolev, S.V. and Lim, H.A. (1993) A New Approach for the Classification of Functional Regions of DNA Sequences Based of Fractal Representation. International Journal of Genomic Research, 1, 109-128.
[23] Bagga, S., Girdhar, A., Trivedi, M.C. and Yang, Y.Z. (2016) RMI Approach to Cluster Based Cache Oblivious Peano Curves. 2016 Second International Conference on Computational Intelligence & Communication Technology, Ghaziabad, India, 12-13 February 2016, 89-95.
https://doi.org/10.1109/CICT.2016.26
[24] Makarov, B.M. and Podkorytov, A.N. (2017) On the Coordinate Functions of Peano Curves. St. Petersburg Mathematical Journal, 28, 115-125.
https://doi.org/10.1090/spmj/1441
[25] Platos, J., Kromer, P. and Snasel, V. (2015) Efficient Area Association Using Space Filling Curves. 2015 International Conference on Intelligent Networking and Collaborative Systems, Taipei, 2-4 September 2015, 322-326.
https://doi.org/10.1109/INCoS.2015.39

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.