DNA Sequences with Forbidden Words and the Generalized Cantor Set

In this work, we establish relations between DNA sequences with missing subsequences (the forbidden words) and the generalized Cantor sets. Various examples associated with some generalized Cantor sets, including Hao’s frame representation and the generalized Sierpinski Set, along with their fractal graphs, are also presented in this work.

The study of the genome or DNA sequences through fractal analysis is very interesting. DNA sequences can be seen as sequences over the alphabet { } , , , a c g t Σ = . Subsequences that do not appear in DNA are considered as forbidden words. A visualization method of the forbidden words in [7] [8] [9] [10] [11] has been designed by B.-L. Hao since 2000. This method is now called Hao's frame representation. Recently, C.-X. Huang and S.-L. Peng discussed this method in detail, and many beautiful graphics were provided in [12] [13]. From these geometric intuitions, it can be observed that these forbidden words demonstrate certain fractal properties. In fact in this work we generated some amazing fractal graphs associated with DNA sequences with forbidden words as shown in Figure 1.
It is important to explore the fractal generating mechanism that is associated with the forbidden words in the sequence. H. J. Jeffrey [14] [15] and P. Tiňo [16] [17] tried to associate the forbidden words with the IFS (Iterated Functions Systems) using chaos game algorithm. Denote * Σ as the set of all finite sequences over Σ . Then how to find a generating formula or the mapping : w where w is a sequence that does not contain forbidden subsequences, or corresponding iteration method? As was pointed out by P. Tiňo, the IFS is a multifractal and therefore the generating formula would be relatively complicated.
In order to detect the structures of some symbolic sequences, one has to find the properties of their topology and metric and be able to visualize these sequences. To do this, we have to provide a type of graphical representation together with their topology and metric properties so that we can directly reveal their corresponding fractal graphs. This kind of representation method is important and necessary.
For an alphabet with cardinal 3, the well known CGR method (that is, Chaos Game Representation method) was first introduced by M.F. Barnsley by considering the points in an equilateral triangle. The substrings of a string were shown graphically (see [18]). For an alphabet with cardinality 4, the CGR method was later generalized by H.J. Jeffrey so that the DNA sequences can be visualized (see [14] [15]). The authors have transformed the DNA sequences into pseudo random walk in a 2-dimensional plane or in a 3-dimensional space [19] [20] [21]. We notice here that an iterated function system can be applied to construct a graphical representation of some DNA sequences [16] [17].  ). We will in this paper extend the above methods and relax the restriction to the cardinality of an alphabet.
The order of this paper is as follows. In Section 2, we will first convert the problem into the discussion on certain type of generalized Cantor set, which can naturally correspond to multifractals, and then in Section 3, we will induce Hao's frame representation according to the principle that the correspondence between line segment and unit square is one-to-one [22]. Several examples, along with their fractal graphs, of some generalized Cantor sets are given at the end of this paper.

Forbidden Words and the Generalized Cantor Set
Rewrite the alphabet as We first give the following definition.
Then call the infinite sequences over Σ the DNA sequence with no forbidden words B, a.k.a. allowed sequence.
It is known that when the generalized Cantor set.
Apparently, the discussions on DNA sequences (1) (2) that contain no forbidden words B can be converted into the discussion on the generalized Cantor Then, the condition Substitute (7) into (6), In general, we let and as i → ∞ , we obtain the generalized Cantor set G C (2.2). From the iteration Equation (7) in the theorem, the iteration acts differently on the l subintervals than on the 4 k l − intervals. Hence we have [11].
Corollary 2.3 The generalized Cantor set G C is multifractal. Proof. In the construction of the generalized Cantor sets G C , measures on removed portions are redistributed to the neighboring sections repeatedly. Thus G C is multifractal.
Obviously, the generalized Cantor sets are applicable for all p-carry representation (p is an integer).

The Hao's Frame Representation of the Generalized Cantor Set CG
The theoretic foundation of the construction of DNA sequences can be seen in [12]. , , , , 0,1 , 1, 2, , the generalized Sierpinski set that corresponds to the the generalized Cantor set ( ) Substitute (15)  Similarly, we could produce the following amazing fractal graphs shown in Figure 6, Figure 7, of different DNA sequences with various forbidden words.