Paper Menu >>
Journal Menu >>
![]() J. Biomedical Science and Engineering, 2011, 4, 272-281 JBiSE doi:10.4236/jbise.2011.44037 Published Online April 2011 (http://www.SciRP.org/journal/jbise/). Published Online April 2011 in SciRes. http://www.scirp.org/journal/JBiSE Fuzzy splicing in precursor-mRNA sequences: prediction of aberrant splice-junctions in viral DNA context Perambur S. Neelakanta, Sharmistha Chatterjee, Mirjana Pavlovic, Abijit Pandya, Dolores de Groff Department of Computer and Electrical Engineering & Computer Science, Florida Atlantic University, Boca Raton, Florida, USA. Email: neelakan@fau.edu Received 31 January 2011; revised 25 March 2011; accepted 28 March 2011. ABSTRACT RNA splicing normally generates stable splice-junc- tion sequences in viruses that are important in the context of virus mimicry. Potential variability in envelop proteins may occur with point-mutations inducing cryptic splice-junctions, which would re- main unrecognized by T-memory cells of higher organisms in vaccine trials. Such aberrant splice- junctions result from evolution-specific non-conser- vation of actual splice-junction sites due to muta- tions; as such, locations of splice-junctions in a test DNA sequence could only be imprecisely specified. Such impreciseness of splice-junction locations (or cryptic sites) in a sequence is eva luated in this study via “noisy” attributes (with associated stochastics) to the mutated subspace; and, relevant fuzzy con- siderations are invoked with membership attributes expressed in terms of a spatial signal-to-noise ratio (SSNR). That is, SSNR adopted as a membership function expresses the belongingness of a site-region to exon/intron subspaces. An illustrative example with actual (Dengue 1 viral) DNA data is furnished demonstrating the pursuit developed in predicting aberrant splice-junctions at cryptic sites in the test sequence. Keywords: DNA; Exon/Intron; Aberrant/Cryptic Splice-Junction; MRNA Sequence; Fuzzy Subspace; Spatial SNR 1. INTRODUCTION Eukaryotic genomic data encoded via spatial statistical occurrence of the nucleotide set {A, T, C, G} eventually translates into a protein complex through transcription and translation processes. The effort of such correct translation is, however, subject to the effects of muta- tions on the evolutionary conservation. The underlying corruptions may manifest at the so-called splice junc- tions that separate/delineate two subsequences in a DNA sequence, namely, the (genetic) information-bearing codon segment (called an exon) and the non-informative “junk” codon, also known as non-codo n or int ron. (Exons bear necessary information towards protein-making, whereas non-codons are non-informative and their genetic role has not been fully elucidated. Exons and introns appear randomly along the DNA sequence. Codons tend to be typically no more than 200 characters long, while non- codons could be tens of thousands of characters in length. Thus in majority, introns prevail mostly in a typical eu- karyotic gene). Towards the process of protein-making, introns are first scissored out (in the transcription stage) from the sequence and the remaining exons are spliced together constituting the so-called messenger RNA (mRNA), which is rendered ready for translation into a protein complex (at the cell interior). Should any errors have occurred (due to mutations), they would give room to the possi- bility of evolving wrong or cryptic splice-junctions and lead to (imperfect) translations. That is, aberrant splice- junctions may result from mutational spectrum [1] and would hamper the making of correct proteins. Illustrated in Figure 1 is the formation of mRNA via transcription through translation steps. Further, in Figure 1, the locations of splice-junction shown may not so reliably be distinct. In a canonical sense, the splice-junction consensus (Figure 1(a)) may follow certain rules as regard to introns and exons [2]. For example, the introns almost always begin with the residue set {gt} at 5’-end and ends with an {ag} at the 3’-end. But, inasmuch as the nucleotide sequence corre- sponds to a set of statistically permutated elements, {A, T, C, G}, numerous putatively occurring {gt} and {ag} locations (other than in the introns as indicated) may prevail and resemble such canonical patterns. The putatively occurring {gt} and {ag} locations im- ply that relying on such canonical details alone may not reasonably and robustly indicate the presence of true splice-junctions. Further, in the event of point mutations, ![]() P. S. Neelakanta et al. / J. Biomedical Science and Engineering 4 (2011) 272-281 Copyright © 2011 SciRes. JBiSE 273 Folded protein 5’ 3’ Intron gt Pyrimidine tract ag Exon Intron IntronIntron ExonExon Exon 5’ Splice site3’ Splice site mRNA (Post-splicing) 5’ 3’ UTR P r otein Precu r so r -mRNA Transcription 3’ Exon Intron Intron IntronExon Exon Exon 3’ UTR Translation DNA 5’ UTR 3’ UTR .. (a/c)ag gt(a/g) a gt……………............(c/t)6 x (c/t) ag g(g/t) (a) (b) 5’ UTR Figure 1. Transcription through translation steps: (a) Typical splice-junction consensus. (b) Illus- tration of splice-junctions delineating exons and introns in the context of transcription through translation phases of central dogma dictating the use of genetic information in the DNA to make the eventual protein complex. (UTR: Untranslated region). stemming of aberrant splice-sites is inevitable [1]. As such, should a junction be recognized and prevailing of possible cryptic junction sites elucidated, it is necessary to analyze statistically, the prevailing long-range genetic information so as to determine the extent to which sub- sequences surrounding the splice-junctions differ from sequence segments of adjoining spurious analogs; hence, true versus aberrant (cryptic) splice junctions can be distinguishably identified. Among feasible techniques developed in ascertaining the delineation of codon/noncodon parts, (that is, in lo- cating the splice-junctions), indicated in [3] is an entropy estimator method that extracts “meaningful signal” from the exon/intron segments of a test DNA; and hence, an entropy technique is applied to detect the underlying splice-junctions between the segments. This is, an in- formation-theoretic (or entropy-based) tool envisaged in a classical setting. It demarcates introns/ exon boundary with a fair efficacy of performance. With the advent of newly sequenced genomes, recog- nition of genes has, however become a challenge and detecting relevant splice-junctions with a system (that does not require prior training) implies inherent difficul- ties in this endeavor warranting more novel approaches; for example, the so-called entropic segmentation method of [4], has shown promising results in using an algorithm based on the so-called Jensen-Shannon (JS) contrast measure to distinguish coding versus non-coding regions in a DNA sequence. This JS-measure is based on condi- tional entropy aspects of statistical divergence (SD) specified in terms of the well-known Kullback-Liebler measure [5]. The main driver behind the success of this method is due to distinguishable statistical characteris- tics of exon and intron segments. That is, a non-uniform codon usage prevails in the exon part meaning that, spe- cific to coding regions not all bases of {A, T, C, G} oc- cur with the same probability; but, there are subtle dif- ferences between the statistics of their appearance exist depending on the position of each base in the codon triplets. In contrast, in non-informative intron segments, the occurrence probabilities of A, T, C and G are the same (equal to 1/4). Developed in [6] is another strategy that identifies the splice-junctions between codon and non-codon regions present in a massive stretch of a DNA chain, especially when the delineating boundary in question is submerged ![]() P. S. Neelakanta et al. / J. Biomedical Science and Engineering 4 (2011) 272-281 Copyright © 2011 SciRes. JBiSE 274 in a subspace where codon and non-codon parts exist as overlapping and ambiguous/fuzzy entities. A fuzzy in- ference engine (FIE) developed thereof uses again in- formation-theoretic based metrics (with relevant algo- rithms applied to symbolic as well as binary sequence data representing the DNA) so as to score differentiating extents of codon/non-codon populations at a given site in the DNA sequence. The information-theoretic metrics adopted in [6] refer to various statistical divergence (such as KL and JS measures) as well as distance and discriminant concepts. Further, the algorithms indicated in [6] yield consistent results on the delineation bound- ary sought on test subspaces that are fuzzy; and simu- lated studies using human as well as bacteria codon- statistics confirm the efficacy of the approach pursued. Another approach due to Neelakanta et al. [7] uses the concept of information redundancy in complex systems and defines a complexity metric that is adopted to dif- ferentiate codon/non-codon segments and specify there- of, the intermediate splice-junction. Notwithstanding the existence of pursuits as above in locating splice-junctions, the computed statistical diver- gence (SD) is extended in the present study in getting mapped into a novel membership function that specifies the fuzzy subspace of overlapping exon and intron seg- ments. Relevant membership function is defined on the basis of “error’’ feature prevailing in the overlapping (“noisy”) segment with mutational aberrations. The un- derlying heuristics are described below. As indicated before, the evolutionary conservation of splice-junctions could be hampered with inevitable phy- logenetic-specific mutations. If such mutations are (as- sumed) independent, any “noisy” change in the spatial DNA pattern of the sequence (at the splice- junctions) can be marked as a “spatial jitter” with a characteristic parameter called spatial signal-to-noise ratio (SSNR). Splice-junctions with a spatial jitter as above corres- pond to fuzzy offsets of exons and introns at their junc- tions. That is, the spatially-jittered junction corres ponds to an overlapping mix of codon and non-codon entities and hence constitutes a (fuzzy) universe. In other words, the splice-junction information has a fuzzy structure that can only be identified/specified in norms of linguistic descriptions. Such descriptions can be characterized by a membership (function) [5,8] of belongingness to the attributes of exon or introns. The thematics of the present study refers to develop- ing an appropriate FIE that delineates fuzzy overlaps of codon/non-codon parts so as to elucidate the underlying cryptic (or aberrant) splice-junctions. This is done on the basis of SSNR defined with reference to the spatial-jitter. The SSNR is also adopted to represent the relevant membership function. Remainder of the paper describes the underlying considerations. 2. SPATIAL JITTER ACROSS SPLICE JUNCTIONS Consider a small window(-length) accommodating a fi- nite-number (say, 100) of putatively occurring base re- sidues along a DNA sequence. Suppose this window traverses a splice-junction. With no a priori information available on the accurate disposition of the splice- junc- tion, it can be initially assumed that the reading gathered thereof is a “blurred” information implying an overlap of exon/intron region with a fuzzy codon /non-codon transi- tion. That is, a spreading function is assumed to prevail across the finite window-length. The resulting spa- tially-varying 1-D signal so gathered from the scan of the entire DNA sequence would resemble a set of ran- dom telegraphic waveform train constituted by changing statistical profiles of exons and introns (being scanned). The task in hand is then to detect the spatial transition sites, each delineating adjoining exon/intron (or intron/ exon) segments despite of the noisy, blurred spatial in- formation of the transition site. Suppose (x) represents an uncorrupted DNA se- quence pattern metric computed along the variable x denoting the 1-D space of the sequence length. Relevant signal component will assumed to be corrupted in the event of mutational changes in {A, C, T, G} had oc- curred along the sequence are encountered. Such muta- tion- specific effects can be modeled as a contrbution of “noise”, m(x) on the signal part, (x). Hence, the signal output of the window-reader can be modeled by either a spatial-domain convolution description, namely, s(x) = (x) × m(x) or, equivalently by a corresponding fre- quency-domain description, S(f) = (f)M(f), where S(f), (f) and M(f) are the Fourier transforms of s(x),(x) and m(x) respectively. Consider an intron-exon splice junction illustrated in Figure 2. The upper figure (marked as (a)) is a crisp noise-free (uncorrupted) site with a splice-junction at xo along the DNA sequence constituted by {A, C, T, G} residues. Should mutational corruptions have taken place, this crisp transition-boundary xo becomes (xo x), where x denotes spatial jitter. Further in Figure 2, the y-axis depicts the measure/metric of (relative) statistical divergence of exon versus intron (or vice versa) prevail- ing at any point, x on the sequence. (This statistical di- vergence prevails due to the reason that exon has a dis- tinct distribution of {A, C, T, G} constituents vis-à-vis the corresponding distribution in the intron segment). The effect of (mutation-specific) corruption would make the splice-junction to become unclear or fuzzy, as shown in Figure 2(b). In essence, x is a jitter variable superimposed on s(x) corresponding to crisp disposition ![]() P. S. Neelakanta et al. / J. Biomedical Science and Engineering 4 (2011) 272-281 Copyright © 2011 SciRes. JBiSE 275 x o x Exon Intron (a) (b) y Nucleotide sequence y Figure 2. “Spatially-jittered” splice-junction mani- festing as fuzzy exon/intron (or vice versa) tran- sitional residues along the sequence. (a) Unal- tered (crisp) splice-junction; (b) Fuzzy splice- junction with a graded variation of divergence (distance) between the statistical features of exon/ intron (or intro/exon) along the transition region (specified as a measure on the ordinate, (y) The abscissa (x) depicts a scale of residues along the DNA sequence. of the splice junction xo. The expected root-mean- squared (RMS) jitter Jr at any splice-junction xo can be expressed by the “noise power” imposed by the mutation errors. In traditional communication theory, the term sig- nal-to-noise ratio (SNR) is defined to specify the quality of an uncorrupted “signal (power) level” to the corrupt- ing “noise power”. Translating this concept, suppose the average length of intron-plus-exon is X , corresponding “spatial SNR” (SSNR) with reference to the DNA se- quence space (of Figure 2) can be defined as follows: SSNR = 22 r X J. 2.1. Error Probability of Splice-Junction Prediction Relevant to a “noisy” intron/exon (or exon/intron) tran- sitions, the accuracy of locating the transition site, xo is constrained by the probability of error associated with the estimation of xo. In this context, within the specified blurring limits of jitter, the SSNR implicitly would pre- dict the error probability of estimating the splice-junc- tion. Suppose a sequence of exon/intron (or vice versa) transitions (0 i x ,s) prevail at locations indexed by i = 0, 1, 2, , m. From these data, one can extract exon or intron widths (χ) as follows: +11E or I = () iii xx for all values of i = 0, 1, 2,, m, where the suffix (E or I) denotes the measurement done on an exon or an intron respectively. In terms of the average length of consequent intron plus exon X subspaces, the transition (split-junction) locations in the presence of mutation error-induced jitter can be expressed as follows: Noisy = 0 = + i iji j xkXδ X where kj is an integer with ko being zero; and, i = 0, 1, 2, , m; further, is a dimensionless random variable, which in a simple case, has zero-mean Gaussian distri- bution with variance 2 = (1/SSNR). (This variance is invariant along the sequence length if the sequence sta- tistics is assumed to be stationary). Now defining a nor- malized variable, ii X , it can be estimated as: 1 = + iiii kδδ with (i = 0, 1, 2,, m); hence one can specify the probability of correct decoding of the splice-junction, Pc(m) as the probability that κ0.5. ii k Inasmuch as, 1 = + iiii kδδ , the aforesaid probability can be restated as follows: 10 10 1 Prob 0.5, , 0.5 c mm Pm (1) With the assumed Gaussian statistics for , the cumula- tive probability of correct decoding of the splice-junc- tion, namely Pc(m) can be deduced as follows: o 2 11 exp d 22 11 22 2 x cx x Px, x x erf (2) where x with respect to an ith junction is given by ix = 1 ii δδ x ; and, 2 0 2exp d u erf uuu . Fur- ther, the fuzzy-space in question enclaves the universe mdepicting an m-dimensional hypercube across the unit interval, I [0.5, +0.5]. Eq.2 implies that the probability of correct detection (and hence error probability) of the splice-junction disposition is implicitly dependent on SSNR parameter. The plot of Eq.2 is shown in Figure 3 where Pc is plotted as a function of 0 x xx with respect to a presumed, crisp splice-junction at xo posing a transi- tional error-prone width x. This error-prone region depicts a subspace of overlapping exon/intron sub- spaces that smear the exact location of xo. This unspe- cific (error-prone) subspace x is therefore, fuzzy impos- ing an imprecision on xo. Relevantly, the generic descrip- tion of Pc in this fuzzy subspace takes a membership at- tribute of vagueness vis-à-vis the position vari- ![]() P. S. Neelakanta et al. / J. Biomedical Science and Engineering 4 (2011) 272-281 Copyright © 2011 SciRes. JBiSE 276 (x o + x)/x o (x o x)/x o 1.0 0.5 0.90 1.0 P c 0.85 1.10 1.15 0 Increasing SSNR Increasing SSNR Location of crisp exon/intron crisp transition (x o ) Exon subspace I ntron subspace Figure 3. Probability (Pc) of correct estimation of a splice- junction versus (xo x)/xo. able, x. The membership here depicts the belongingness to exon subspace or intron subspace. Hence described in the next section are the underlying aspects of the fuzzy subspace in question with the object of ascertaining the splice-junction in the fuzzy subspace. 3. FUZZY SPLICE-JUNCTION PREDICTION Suppose a set of input values xi are taken from the se- quence and considered as non-specific or fuzzy. By de- noting those segment values by {ix}f, corresponding {(Pc)i}f can be written in terms of uncertain limit- ing-values of all the vectors in the bounding (lower and upper) interval, x [xL, xH]. Hence it follows that [5]: ΔΔ ΔΔ f cc L ii ff α-1 j-1 j cL if j = 1 PxPx +ρPxx /j! (3) where f (.) depicts the slope equal to d(Pc)i/dxi and is the number of interval-valued parameter for the range within [xL, xH]. Further, Equation (3) denotes an alge- braic sum of addenda computed via interval arithmetic, which denotes the “width of the results”. In other words, for the specified vector bounding-limits of {(Pc)i}f, namely, x [xL, xH], an -set of interval-valued pa- rameters namely, {Q}, Q = Q1, Q2, …, Q , prevails at or around xo with no fuzzy attributes. Then relevant crisp- domain relation of {x} versus {Pc} can be written by a differential equation given by [5]: d2Pc/dx2 + (dPc/ dx)2 = g(x) where g(.) is some arbitrary function of x. In the event of overlapping fuzzy attributes existing at xo, then the corresponding (fuzzy)-domain relation between {x} versus {Pc} can be generalized by a stochastical discourse of Pc versus x expressed in terms of a fuzzy stochastical differential Equation [5]. Further, in such exon-to-intron transition subspace (denoted as F) having fuzzy attributes, corresponding demarcation of exon/intron transition can be assumed to be at a centroid location (XC) with a line-of-delineation through the centroid. This lo- cation refers to a defuzzified elucidation based on mem- bership-of-belongingness of the site-of-interest in the fuzzy space. The procedure to find XC is described be- low. 3.1. Centroid of the Fuzzy Subspace The SSNR and Pc considerations versus (xo x)/xo indicated before imply inherent statistical attributes of {A, C, T, G} population in the exon and intron regions across the splice-junction. As said earlier, the exon-side statistics encodes for genetic information (so as to make necessary protein) and the intron-side statistics is non-informative. In other words, suppose the probabili- ties of occurrence of the elements {A, C, T, G} in the exon are denoted by the set: {QA, QC, QT, QG} with (QA + QC + QT + QG = 1). Then, the associated errors for the elements of {A, C, T, G) are decided by the inequalities, QA QC QT QG. Now, suppose the corresponding probabilities of occurrence in the intron are: {A, C, T, G} with (A + C + T + G = 1); then, the associated errors for the elements of {A, C, T, G} on intron-side are set by the condition that, A = C = T = G = 0.25. This is because the intron-side being non-informative, Laplacian hypothesis applies in presuming that all (four) elements are equally-likely to occur. Hence, by virtue of the distinction between {Q}A, C, T, G and {} A, C, T, G, rele- vant entropy/information-theoretic (IT) distances (that is, statistical divergence or SD values) can be computed (for the exon and intron regions). The results would show distinction in the profiles of SD (in exon and intron regions) as illustrated in Figure 4. (This SD can be any one on the divergence measure such as KL or JS men- tioned before. Illustrative measures are presented later in the results with reference to a real DNA structure). Following the considerations presented in [9,10], the expression for Pc is 12 122erf x and it can be approximately written as: 0 qq Lz L where q Lz denotes the Bernoulli-Langevin function and the prime sign depicts the differentiation with respect to the argument 2zx . Explicitly, 112coth 112 12coth 12 q Lz qqz qqz where q represents an disorder entity associated with the statistics of the population concerned [11]. Described in ![]() P. S. Neelakanta et al. / J. Biomedical Science and Engineering 4 (2011) 272-281 Copyright © 2011 SciRes. JBiSE 277 [11] is that the upper-bound corresponding to isotropic disorder statistics is decided with q = 1/2 and the lower-bound (depicting an anisotropic disorder) is speci- fied by q → . Inasmuch as the statistics of exon-region would differ from that of intron-region, qE qI. Further, as indicated in [9], the ratio 0 qq Lz L denotes ap- proximately the membership function q for the fuzzy space or block, F:{xi}) of interest with its fuzzy range (upper-to-lower) is decided by: q = 1/2 to q → . Hence, shown in Figure 4, is the mapping of com- puted divergence measures (SD) of intron and exon subspaces (across the slice-junction) into corresponding membership values, q(SD) (with q = 1/2 yielding up- per-bound values and q → giving the lower-bound values). For example, suppose a location xa (in exon re- gion) gives the SD-value equal to (a). Then, the value (a), maps on to the membership-plane as the entities (aU) and (aL) depicting respectively, the upper- and lower- bound values. Similarly, assuming a location xb (in in- tron region) has an SD-value (b), this value maps on to the membership-plane as (bU) and (bL) denoting respec- tively the upper- and lower-limits. The steps as above can be elaborated as follows: First, the chosen divergence measure (SD: KL or JS) is computed for the entire fuzzy domain F at each pointer-position within a chosen window-size. For this purpose, two subspaces FExon and FIntron depicting re- spectively, the exon- and intron-side of the F-space are specified. Then, the computation of the SD-measures with exon statistics {Q}A, C, T, G in FExon-subspace and with intron statistics {} A, C, T, G in FIntron- subspace is done with KL or JS algorithm. The values of SD generated in each differential win- dow (of FExon- and FIntron-subspaces) accounts for the extents of codons and noncodons in the relevant fuzzy subspace. Corresponding to window-specific pointer positions along the sequence, the SD-score profile ob- tained across each differential block will be distinct for each subspace (exon or intron) in question. Next, the values of SD obtained are translated via membership function to provide descriptive details of belongingness in the fuzzy domain. The translated values gathered can be subjected to a defuzzification process [8,12] in order to get the centroid position (of the pointer) that delineates the boundary of the two, fuzzy test subspaces. Relevant local search fol- lows the principle of “search and score” procedure ap- plied appropriately on the assigned membership values that describe the qualitative aspects of overlapping and ambiguous codon/non-codon locales across the fuzzy site. The boundary that marks the desired splice-junction being searched corresponds to a defuzzified location Figure 4. SD-to- q(SD) mapping. (I): (xo x)/xo versus SD curves in the intron and exon subspaces. Note the SD profiles are distinct in each region; (II): (xo x)/xo ver- sus membership function, q(SD). (Other details given in the text). obtained via centroid-finding method. Towards centroid, the fuzzy exon- and fuzzy intron-domain would con- verge close a single membership value. Referring to Figure 4, the SD-value (a) in the exon subspace yields mapped values of q (SD): (aL and aU); and, the SD-value, (b) in the intron subspace maps into q(SD): (bL, bU). Suppose the set {aL, aU } in turn pro- jects on to x-axis at xaL and xaU respectively; and, like- wise, the set {bL, bU} projects on to x-axis at xbL and xbU respectively. Then, the mean position of (xaL, xaU, xbL and xbU) would correspond to the centroid being sought. 4. SIMULATION EXPERIMENTS USING REAL DNA DATA The efficacy of efforts and procedure described above is illustrated with an example of real-world DNA sequence of Dengue virus type 1 (NCBI Reference Sequence: NC_001477.1) [13]. Its CDS stretches from nucleotide position 95 through 10273. Using the nucleotide popula- tion details of this virus, a moving-window based calcu- ![]() P. S. Neelakanta et al. / J. Biomedical Science and Engineering 4 (2011) 272-281 Copyright © 2011 SciRes. JBiSE 278 lation of KL-measure is plotted in Figure 5 across the entire sequence length. The data available in [13] for example, shows a CDS stretch from position 7574 through 10270 with an indi- cation of a transition at 7574. Presented in Figure 6 is an exclusive plot of KL-measure across this selected CDS regime at the transition locale around 7574. While the codon (exon)/non-codon (intron) transition is markedly seen (via KL value change), there is however a subspace of fuzziness, wherein an overlap of exon and intron re- gimes prevails indistinguishably (viewed in terms of simple KL-measure). Therefore, by assigning member- ship attribute, the FIE algorithm (described earlier) can be invoked to decide on the location of the splice-junc- tion in the fuzzy region. Hence, drawn in Figure 7 is the profile of membership values ( q) mapped from the computed KL-measures (of Figure 6) across the transi- tion region of interest. There are two profiles: (A) de- picts q-values with q = 1/2 (meaning the upper-bound on the membership); and, (B) denotes q-values with q = (meaning the lower-bound on the membership). From Figure 7, the location of the splice-junction buried in the fuzzy domain can be ascertained. This lo- cation corresponds to the centroid coordinate (xC). This centroid position is featured by the upper- and lower- bound profiles of the -value. As discussed earlier, xC corresponds to the mean position of xaL, xaU, xbL and xbU; and, for the data presented in Figure 7, the computed results show that this centroid (xC) is at 7401 as against the crisp value indicated in [13] as 7574. (The centroid (7401) is the mean of: [(xbL + xbU)/2 = 7401] and [(xaL + xaU)/2 = 7401]). 5. DISCUSSIONS AND CLOSURE Depicted in Figure 8, are base residues reported around, for example splice-junction site, namely 7574 of [13]. The present method predicts in addition, a cryptic set of 7370 and 7419 in the vicinity of the centroid 7401 de- termined. The selection of this set {7370, 7401} is based on the considerations of [2] suggesting the intron’s 3’-side preferential ending being ag. That is, the values 7370 and 7401 are picked around the centroid deter- mined such that they are in conformance with the abut- ting of ag-residues. Further, in Figures 8(a)-8(b), the intron-subspace ends with residue set {ag} at 7574 and is consistent with the canonical splice-junction consen- sus (as mentioned earlier) of [2]. Notwithstanding this canonical pattern, the mutational influences could have possibly induced aberrant splice-junctions. A scan through the test DNA indicates a cluster of sites between 7500 through 7700 exist at which the residues a and g occur together making it ambiguous on 02000 40006000 8000 10000 N ucleotide p ositions ( DEN1 virus ) 5 3 0 0.2 0.4 KL Figure 5. Nucleotide position versus computed KL-mea- sure of the DNA sequence of Dengue virus type 1 (NCBI Reference Sequence: NC_001477.1) [13]. KL 5 3 Nucleotide positions Fuzzy subspace DEN1 virus 60008000 0.2 0.4 ( a ) (b) Intron subspace Exon subspace Figure 6. Nucleotide position in the limited range of 5000 to 9000 versus computed KL-measure of the DNA sequence of Dengue virus type 1 (NCBI Refe- rence Sequence: NC_001477.1) [13]. the decision that splice-junction (such 7574 of [13]) alone can be the splice-junction of interest. However, following the fuzzy pursuit presented here, it enables pointing out that other cryptic splice-junctions such as 7370 and 7419 could reasonably be alternative splice- junction sites having adjacent ag residues as illustrated, for example in Figures 8(a)-8(b) with 7419 site. The complete list of aberrant splice junctions evalu- ated for the test viral DNA in the present study is pre- sented in Table 1 and illustrated in Figure 9. Table 1 indicates the centroid values determined as well as cryp- tic transition sites predicted on the basis of the details in [2]. It may be noted that the data available in [13] por- trays overlaps of CDS domains that eventually facilitate various protein structures as listed. The purpose of knowing correct and aberrant splice-junctions in the context of viral DNA (such as DEN 1 virus) is pertinent to and implicates vaccine de- signs [14]. In general, a gene is first transcribed into pre-mRNA, which is a copy of genomic DNA containing ![]() P. S. Neelakanta et al. / J. Biomedical Science and Engineering 4 (2011) 272-281 Copyright © 2011 SciRes. JBiSE 279 0.8 0.9 1.0 with q = 1/2 (Upper- bound ) ( a ) 0.9 0.9 1.0 6000 6400 6800 7200 7600 800 0 with q = (Lower-bound) ( b ) 5 3 Nucleotide p ositions Exon subspace Exon subspace Fuzzy centroid location of intron/exon transition at: 7401 Intron subspace Intron subspace Fuzzy subspace Figure 7. Membership profiles ( q) across the fuzzy transition region of interest. (a) q-values with q = 1/2 (meaning the upper-bound on the membership) versus nucleotide positions of the test DNA; (b) q-values with q = (meaning the lower-bound on the membership) versus nucleotide positions of the test DNA. (a) g c a c g c g g… …g g a g a g 7574 Exon subspace Intron subspace CDS 5 3 Nucleotide positions … a g 7419 Exon subspace Intron subspace CDS 5 3 Nucleotide positions (b) Figure 8. Details on nucleotides adjacent to the predicted splice-junctions: (a) As per [13]; and (b) as per present method. (In both cases, the intron-subspace ends with a residue pair ag bases consistent with the canonical splice- junction consensus. (See text). Figure 9. Summary of results on the locations of splice junctions. Downward arrows indicate values available in [13] for DEN 1 virus. Upward arrows indicated computed values that include details of cryptic sites in the fuzzy subspace. exon and intron regions. Gene-splicing is an important form of protein diversity and has also regulatory funtions and RNA-splicing is essential so as to regulate precisely the process that occurs after gene transcription and be- fore mRNA translation (in which introns are removed and exons are retained). The sequences between the boundaries of introns (denoting regions of DNA or pre- cursor RNA that are not represented in mature RNA, but ![]() P. S. Neelakanta et al. / J. Biomedical Science and Engineering 4 (2011) 272-281 Copyright © 2011 SciRes. JBiSE 280 Table 1. Transition sites indicated in [13] and the predicted sites as per the present method. Bounds of membership value CDS range data from [ ] Description Transition site [ ] Upper-bound (UB)* Lower-bound (LB)* Centroid of UB and LB Cryptic transition sites predicted** 95 394 Capsid protein 394 1, 401 301 352 350 354 394 94 436 Anchored capsid protein 436 301, 701 301, 701 501 515 710 934 Membrane glycoprotein 710 701 701 701 437 934 Membrane glycoprotein precursor 934/935 701, 1101 701, 1101 901 954 935 2419 Envelope protein 2419/2420 1801, 2501 2801 2151 2160 2420 3475 Nonstructural protein 1 3475/3476 3301, 3801 3301, 3801 3551 3553 3476 4129 Nonstructural protein 2a 4129/4130 4001, 4301 4001, 4301 4151 4149, 4170 4130 4519 Nonstructural protein 2b 4519/4520 4301, 4701 4301, 4701 4501 4326, 4356 4452, 4505 4520 6376 Nonstructural protein 3 6376 6201, 6701 6201, 6701 6451 6447, 6462 6377 6757 Nonstructural protein 4a 6757 6701, 7001 6701, 7001 6851 6758 6826 2k protein 6826 6701, 7001 6701, 7001 6850 6833, 6857 6827 7573 Nonstructural protein 4b 7573/74 7201, 7601 7201, 7601 7401 7370, 7419 7574 10270 Nonstructural protein 5 10270 10001, 10401 10001, 10401 10201 10202, 10211 ** The UB and LB values indicated correspond to the sites where minima of q-plot (map) in the fuzzy domain of interest are observed, (for example, see Figure 7). * The predicted site is based on locating a site in the vicinity of the centroid where the introns almost always begin with the residue set {GT} at 5’-end and ends with an {AG} at the 3-end as illustrated in Figure 8. reside between regions) and exons (depicting regions of DNA or precursor RNA represented in mature RNA) are not random. There are several splicing events that are possible eventually resulting in: Exon-skipping, intron- retention, cryptic splice-site usage and alternative 3- and 5’-side splice-sites [1]. Further, in RNA splicing, the so-called splicing-variants may be formed prior to mRNA translation due to differential inclusion or exclu- sion of regions in the pre-mRNA structure. Also, a sys- tematic analysis of splice-junction sequences in eu- karyotic protein coding genes using GenBank databank has revealed a striking similarity among the rare splice- junctions [2] that do not contain ag at the 3’ splice site, or gt at the 5’ splice site. As mentioned before, indistinct splice-junctions would result from deleterious effects of mutations that target the splice-sites causing variability in splicing patterns. Such deleterious effects eventually form a major source of protein diversity leading to a considerable ex- tent of diverse proteomic functions that stem from a relatively small number of genes. Thus, changes in splice-site (alternative splicing) can induce different ef- fects on the encoded proteins, not only in humans but also in viruses. As regard to the viral leader sequences, there may be a splice donor site for generation of subgenomic messages, ![]() P. S. Neelakanta et al. / J. Biomedical Science and Engineering 4 (2011) 272-281 Copyright © 2011 SciRes. JBiSE 281 usually the Env (viral envelope) transcript. In general, the role of RNA splicing is to generate a set of stable splice-junctions across viral sequences so that virus mimicry is enabled as a mechanism for potential vari- ability in envelope proteins, (which are prone to changes due to point-mutation and thus, avoid to be recognized by T-memory cells of higher organisms in vaccine trials). The present study offers a systematic way of elucidating cryptic splice-junction sites in viral DNA structures, the knowledge of which can be profitably used in vaccine design efforts. The study is being extended to a variety of viruses in order to elucidate the underlying cryptic aspects of splice-junctions. Pertinent analytical frame- work and computational aspects are augmented with the details available in [15-17]. REFERENCES [1] Krawczak, M., Reiss, J. and Cooper, D.N. (1992) The mutational spectrum of single base-pair substitutions in mRNA splice junctions of human genes: Causes and con- sequences. Human Genetics, 90, 41-54. [2] Shapiro, M.B. and Senapathy, P. (1987) RNA splice junc- tions of different classes of eukaryotes: Sequence statis- tics and functional implications in gene expression. Nu- cleic Acid Research, 15, 7155-7174. doi:10.1093/nar/15.17.7155 [3] Farach, M., Noordewier, M., Savari, S., Shepp, L., Wyner A. and Ziv, J. (1995) On the entropy of DNA: Algorithms and measurements based on memory and rapid conver- gence. Proceedings of the Sixth Annual ACM-SIAM Sym- posium on Discrete Algorithms (SODA’95), San Fran- cisco, January 1995, 48-57. [4] Bernaola-Galván, P., Grosse, I., Carpena, P., Oliver, J.L., Román-Roldán, R. and Stanley, H.E. (2000) Finding borders between coding and noncoding DNA regions by entropic segmentation method. Physical Review Letters, 85, 1342-1345. doi:10.1103/PhysRevLett.85.1342 [5] Neelakanta, P.S. (1999) Information-theoretic aspects of neural networks. CRC Press, Boca Raton. [6] Arredondo, T.V., Neelakanta, P.S. and Groff, D.D. (2005) Fuzzy attributes of a DNA complex: Development of a fuzzy interference engine for codon-“junk” codon de- lineation. Artificial Intelligence in Medicine, 35, 87-105. doi:10.1016/j.artmed.2005.02.008 [7] Neelakanta, P.S., Arredondo, T.V. and Groff, D.D. (2003) Redundancy attributes of a complex system: Application to bioinformatics. Complex Systems, 14, 215-233. [8] Jang, J.S.R., Sun, C.T. and Mizutani, E. (1997) Neuro- fuzzy and soft computing. Prentice Hall, New Jersey. [9] Neelakanta, P.S., Abusalah, S.T., Groff, D.F.D. and Park, J.C. (1998) Fuzzy nonlinear activity and dynamics of fuzzy uncertainty in the neural complex. Neurocomputing, 20, 123-153. doi:10.1016/S0925-2312(98)00006-X [10] Neelakanta, P.S., Park, J.C. and Degroff, D. (1997) Com- plexity parameter vis-à-vis interaction systems: Applica- tion to neurocybernetics. Cybernetica, XL, 243-253. [11] Neelakanta, P.S. and Groff, D.D. (1994) Neural network modeling: Statistical mechanics and cybernetic perspec- tives. CRC Press, Boca Raton. [12] Neelakanta, P.S. and Deecharoenkul, W. (2000) A com- plex system characterization of modern telecommunica- tion services. Complex Systems, 12, 31-69. [13] GenBank, Dengue virus type 1: Complete genome. NCBS reference Sequence NC_001477.1. Available at: http://www.ncbi.nlm.nih.gov/nuccore/NC_001477 (Ac- cessed on January 28, 2011). [14] Pavlovic, M., Cavallo, M., Kats, A., Kotlarchyk, A., Zhuang, H. and Shoenfels, Y. (2011) From Pauling’s Abzyme concept to the new era of hydrolytic anti-DNA autoantobodies: A link to rational vaccine desin? A re- view. International Journal of Bioinformatics Research and Applications (accepted for publication). [15] Krishnamachari, A., Mandal, V.M. and Karmeshu, B. (2004) Study of binding sites using Renyi parametric en- tropy measure. Journal of Theoretical Biology, 227, 429-436. doi:10.1016/j.jtbi.2003.11.026 [16] Florea, L. (2006) Bioinformatics of alternative splicing and its regulation. Briefing in Bioinformatics, 7, 55-69. doi:10.1093/bib/bbk005 [17] Stephens, R.M. and Schneider, T.D. (1992) Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites. Journal of Molecular Biology, 228, 1124-1136. doi:10.1016/0022-2836(92)90320-J |