^{1}

^{2}

^{3}

^{*}

We build a model of storage of well-de fi ned positional information in probabilistic sequence patterns. Once a pattern is de fi ned , it is possible to judge the effect of any mutation in it. We show that the frequency of bene fi cial mutations can be high in general and the same mutation can be either advantageous or deleterious depending on the pattern’s context. The model allows to treat positional information as a physical quantity, formulate its conservation law and to model its continuous evolution in a whole genome, with meaningful applications of basic physical principles such as optimal efficiency and channel capacity. A plausible example of optimal solution analytically describes phase transitions-like behavior. The model shows that, in principle, it is possible to store error-free information on sequences with arbitrary low conservation. The described theoretical framework allows one to approach from novel general perspective s such long-standing paradoxes as excessive junk DNA in large genomes or the corresponding G- and C-values paradoxes. We also expect it to have an effect on a number of fundamental concepts in population genetics including the neutral theory, cost-of-selection dilemma, error catastrophe and others.

Optimality principles such as Maupertuis’ or the least action and their different formulations and applications are the foundations of physics, but they are applied moderately in life sciences. Another field where the efficiency optimization is a quite practical problem is Information Theory (IT) [

Information theory originally described the process of sending discrete data over noisy channel, which seemed to be quite similar to transmitting DNA sequences through generations with mutational errors. A few applications of IT in biology were attempted in order to exploit this similarity [3,4]. Nevertheless the engagement of IT in genetics is disappointingly limited, given the revolutionary role of IT in communications and the strong analogy between DNA sequence and discrete messages. As pointed by Eigen [

For “information” to have physical meaning it must be “relational”—in IT the information is defined as a degree of correlation between sender and receiver, and in the proposed model the correlation of 3D molecular shapes between interacting molecules signifies the amount of information, corresponding to the degree of specificity of interactions.

All molecular interactions can be viewed as more or less specific search (“homing”) for an interacting partner with subsequent “docking” and energy dissipation. And the most specific molecular interactions can likely be found in biological objects; for instance a “binding factor”—a protein (complex) which seeks and binds to a specific spot on DNA to regulate the corresponding gene expression. For the important IT-related reasons explained in the Methods section, many binging factors recognize not a single specific sequence but a large set of sequences, which has certain properties, forming the pattern for recognition. Here we present a theoretical model of evolution of such sets and corresponding patterns and provide some validating examples for real binding sites— we used abundant and well-annotated splicing sites of few mammalian genomes to support our conclusions.

Previous applications of IT in genetics were focused mainly on the problems of binding sites and factors operations in a genome. Von Hippel and Berg addressed the combinatorial and thermodynamic properties of binding, such as their specific recognition mechanisms [

We have to note that the problem of choice from a set, i.e. seeking for a site in a genome can be classified as a combinatorial problem rather than a full-featured IT application per se. The examples of core notions of IT, which paved its way to broad success, are the “channel capacity theorem”, “typical set” and ”asymptotic equipartition property” (in its basic version called ShannonMcMillan-Breiman theorem). To our knowledge neither of these concepts was applied in population genetics for the positional information, hence the present work is the attempt of more complete integration of IT conceptual framework into genetics models.

The genetic information can be viewed as positional in a general sense: it defines the process of homing and speciﬁc binding between molecules, including the binding of a molecule to itself, which is common for proteins and RNAs (i.e. secondary/tertiary structures). Hence such processes turn one-dimensional (sequential) DNA information into 3D shapes, and the energy inflow adds binding/unbinding kinetics, unfolding the temporal dimension. So we have all the basic “physical” properties for a living system: organized dynamic 3D structure with hereditary information stored on a molecular sequence.

The example of a binding site on DNA we widely used in this work, merely serves as a convenient visual illustration of the general phenomena. However, for instance the process of protein synthesis starting from transcripttion of DNA template can be viewed as a cascade of diverse homing, binding and unbinding events, hence the notion of positional information is quite universal.

Imagine an Engineer who wants to maintain positional information in a population of mutable replicating sequences. He can design recognizers (e.g. proteins— “binding factors”), which recognize specific sub-sequences (“binding sites”). For example to uniquely define the position on a (quasi-random) sequence of length L, he must use at least log_{2}L bits of information, which takes half of this number of nucleotide positions to deﬁne, because each nucleotide position provides 2 bits. The possible number of unique binding sites is obviously. However, in this case any mutation in a site will break the recognition erasing the information. Hence the only possibility to maintain information is to avoid all mutations—if a mutation rate is sufficiently low and reproduction rate is high, then some of the progeny sequences will have no mutations and information can be maintained by discarding all mutated sequences–an extreme example of “purifying” selection. This (rather trivial) mode of maintenance can be accomplished only in small microbial genomes. However, what if mutations in binding sites in progeny cannot be avoided? In that case the Engineer must deploy “redundant coding” in terms of IT and to store information in redundant patterns—after a round of mutagenesis, at least some individuals will retain recognizable sequences and then selection retains only those individuals in a population, which keep the ensemble of patterns unchanged as a whole, in that case the information can be maintained indefinitely. Now a binding factor must recognize a set of (“synonymous”) sequences rather than a single sequence.

Here we define “a site” as a speciﬁc site in a genome; “a (typical) set”—a set of functionally acceptable sequences for a site, which keeps its functional performance (a phenotype) within acceptable limits; “a pattern”— a set together with its equilibrium frequencies–some sequences in a set might be more frequent than others.

Here we are not concerned with specific ways of binding factors functioning or how selection picks individual sequences, for our goals it is sufficient to know the final result—the molecular “homeostasis” of patterns and corresponding sets. Apparently gene-specific binding sites of the same binding factor may have different acceptable sets, and/or equilibrium distributions within a set, depending on individual gene regulation requirements, hence their patterns should be regarded as different, though they are used by the same binding factor. These position-specific pattern differences are commonly neglected in the literature which applies computational methods involving genetic information (GI) formalism, because currently site-specific patterns are unattainable directly in normal populations due to insufficient divergence from the last shared ancestor of a site.

For the storage purposes alone the mutation rate can be pushed to a minimum. However, the evolvability demands non-zero rates, so the balance between information maintenance and evolvability is required. Here, for brevity, we focus mainly on the maintenance phenomenon, because a considerable change of the total genomic information can occur only on geologically large time scales (e.g. the human and chimpanzee genomes are ~ 99% identical), and once the maintenance mode is clarified, it is relatively clear how to model the progressive evolution of genetic information.

For simplicity we assume asexual population in equilibrium, constant population size, and a genome with the balanced content of four nucleotides. We also assume a pattern with independent positions, though more sophisticated “encoding” schemes can be evaluated, at this stage we prefer to keep things simple, because without the loss of generality the main predictions and conclusions of this model are sufficiently interesting for the suggested simple encoding scheme. Here we consider only single base substitutions, not exploring the roles of indels, genomic rearrangements, epigenetics, recombinetion, ploidy, variability or evolution of “recognizers”, etc. We consider the concise IT “engineering” problem as defined above. However, these things can be added as interesting extensions to the model without interfering with our conclusions drawn from the basic model.

We will use the term single position site or simply position (P), bearing in mind not a specific nucleotide, but a 4-vector (f_{A}, f_{G}, f_{C}, f_{T}), where each of f_{N}, N Î {A,G,C,T} is a population frequency of corresponding nucleotide in a given position, as shown in

In equilibrium, when composition of a site does not affect phenotype, selection ignores it and the site contains no information by definition. Due to random mutagenesis this site in a population will be occupied by four nucleotides with equal frequencies of 1/4. However, if a site is functional, selection will affect equilibrium frequencies. The variability of a site can be naturally quantiﬁed by the entropy:

Non-functional site with frequencies of 1/4 has the maximum variability of 2 bits, and for a fully preserved site with single acceptable nucleotide the variability is zero. To obtain the measure of genetic information we have to take the reciprocal value: GI(P) = 2 – H(P). Correspondingly for a fully conserved site it takes the maxi-

mum of 2 bits, for non-functional it is zero, while intermediate values quantify the degree of conservation, hence the biological value of this measure.

GI does not depend on permutations of elements in the nucleotide frequencies vector. Each GI value can be obtained with infinitely many variants of nucleotide frequency vectors, except for the degenerate cases of GI = 0 bit and GI = 2 bits.

This deﬁnition of GI was proposed more than 25 years ago by Schneider et al. [

Schneider conjectured [_{binding} = åGI(P_{i}), i.e., the sum of GIs of individual positions in a binding site is equal to the information necessary to locate it in corresponding sequence context. Hence the hypothesis is that besides the degree of conservation GI_{binding} has additional interpretation. Apparently the conjecture is interesting and biologically important but non-trivial because despite both values being “in bits”, the definitions of GI and localization information are different and not directly related. However, for sensible GI-modeling applications it is crucial to provide the rigorous proof of this conjecture.

If we describe an abstract binding site in terms of IT as a “source” which “generates” particular sequences (its realizations in a population), these two information values can be related with an aid of asymptotic equipartition property (AEP) [_{binding} mostly fall into a “typical set” [^{L}) can be an outcome, the ones actually observed, with probability close to 1 belong to the typical set having members distributed with approximately equal probabilities. The exponent value reflects the variability of a binding site, or a “source entropy”.

To select a single site from a sequence of length N the required information is log_{2}N bits, interpretable as a number of binary yes/no questions required for the task. Less specific search requires less information: Selection of any item belonging to a set N_{set} requires log_{2}N - log_{2}N_{set} bits. Returning to the localization information we recall that a binding factor defines the corresponding typical set, recognizing sequences belonging to it and ignoring all others. Then it is easy to see that the corresponding localization information is equal to GI_{binding}. This result naturally links the continuous transversal variability (i.e. across population, orthogonal to multiple sequences alignment) with the discrete “longitudinal” localization on a sequence. The content of a typical set might provide a biological error protection mechanism: if a mutation does not remove a sequence from corresponding typical set, it is effectively “synonymous”.

To our knowledge the additivity of GI was not proved but was used as an ad-hoc conjecture, since it is impossible to prove it without proving AEP. However the additivity of GI is a critical property for whole-genome information modeling. Also a sequence “typicality” (as an object for selection force) concept may prove useful as it represents a binding site collective property, naturally accounting for single positions cumulative effects, as opposed to modeling of interaction of large number of separate selection coefficients for each allele in each position. Typicality considerations indicate that the same mutation can either make a site more typical or less typical, depending on the other site’s positions, hence the mutation selective value can be of different signs depending on the background.

By definition “population genetics is the study of allele frequency distribution and change” [

Traditional models consider only two alleles due to common observations: the vast majority of observed variants (e.g. SNPs in a population) have two states, because too little time passed since the last common ancestor. However, for our model we ask what if this time goes to infinity in a stable population without progressive evolution and other disturbing events. When we understand the equilibrium we can explore the evolution of variability “snapshots” created by recurring population bottlenecks.

We suggest the law of GI conservation in population genetics–a position with any intermediate value of GI can be at equilibrium, maintaining constant GI and nucleotide frequencies (hence the pattern and positional information of the corresponding binding site). So-called “balancing selection” where the frequencies may be stable due to heterozygotes advantage [18,19] is apparently different from our generalization (possibly interesting ploidy effects are not explored here for brevity).

The information already accumulated in a genome requires maintenance to prevent mutational degradation and the majority of accumulating mutations (in functional sites) reflects the maintenance. Traditional models are often based on historical observational biases: for instance, it is easier to observe and study Mendelian traits as compared to low penetrance [

Forestalling, we can say that mutational expansion into this potential variability is perceived as the “neutral evolution” which in fact is the “maintenance evolution” where observed deleterious (for GI value) mutations are compensated by approximately equal amount of beneficial mutations. The role of beneficial mutations is usually overlooked in classical models, as common wisdom dictates that they are rare, so that all the maintenance is carried out by purifying selection, which is a special case in our model when GI is close to 2 bits.

The postulated constancy of frequencies and GI can be exemplified by the divergence of splice site patterns—the difference between mouse and human splice logos is quite small despite the large number of mutational and selective events happened since our divergence.

Maximum divergence of GI (less than 0.08 bit) can be observed in the fifth donor site position. Notably the number of splice sites is hundreds of thousands; hence mouse-human divergence shows the phenomenon of constant GI for the total of millions of nucleotide positions for a period of tens of millions of years.

As the average length of exons is ~100 nucleotides, splice sites constitute significant amount of genomic sequence in comparison with coding sequences; and it is natural to assume that this mode of evolution affects significant fraction of a genome besides splice sites. Other commonly known binding sites tend to be of sufficient length and high conservation (computational methods) and/or high binding affinity and specificity (experimental methods), creating observational biases with the preference for long sites with high GI per nucleotide. However, splice sites provide a unique opportunity for our analysis because of their large number and well-defined locations in a genome.

By one of the classical definitions: “Genetic load is the reduction in selective value for a population compared to what the population would have if all individuals had the most favored genotype” [

Traditionally equilibrium states are modeled through their stability to perturbations, i.e. deviation from the equilibrium caused by some external perturbation is returned back by some stabilizing force. In our case the perturbations are random mutations and the force of (purifying) selection is compensating them. Thus it is straightforward to model the maintenance of a pattern: initially nucleotide frequencies are (f_{A}, f_{G}, f_{C}, f_{T}), then mutagenesis pushes them into (f_{a}, f_{g}, f_{c}, f_{t}), then these frequencies are corrected by reproduction and selection, preserving the initial value of GI and returning nucleotides frequencies back to the initial values:

.

The changes in frequencies are assumed to be small.

Mutations can be of two types: transitions change a purine to another purine or pyrimidine to another pyrimidine: ti = {A « G, C « T} and transversions are all others: tv = {A, G « C, T}. Here we assume that all 4 transitions are equiprobable as well as all 8 transversions are. The system for descendant nucleotide frequencies can be written as:

where p is mutation probability and k—probability of transition, upon condition that mutation occurred (k » 2/3 for mammals [

Due to the pressure of mutagenesis, the GI of descendant frequencies vector is always less or equal to initial GI (equality happens only if initial GI = 0 or p = 0).

As an example of one the many alternatives of optimization parameters we define a variant of mutational load (ML) as Manhattan norm of frequencies deviation vector:

Minimizing this measure would minimize the number of mutations rejected by selection, minimizing the “genetic deaths” rate, making it biologically plausible. As can be seen from the expression for the optimal solution (see Equation (6)) in that case, and Df_{N} ³ 0, " N Î {G, C, T}, assuming A to be the highest frequency variant. Then the selection can correct the frequencies simply by removing alleles (C, G and T) which increased in frequency. So the number of individuals which must go extinct is proportional to the deﬁned ML which is equal to −2Df_{A} for the optimal frequencies.

As we for simplicity consider equilibrium, we keep population size constant. Contrary to typical classical models, the population size does not matter for GI maintenance and evolution. Population size matters for phenomena such as selective sweeps—fixation of a suddenly appeared site with GI = 2 bits, which is a non-equilibrium event and out of the scope of this model.

With biased mutagenesis (k ¹ 1/3), different compositions (e.g. nucleotides permutations) of a 4-vector with the same GI can produce different ML (

The minimum ML for a given GI is the solution of the following optimization problem:

ML—mutational load which has to be minimized for a given GI value by adjusting the frequencies in 4-vector. The solution does not depend on the probability of mutation p, it was found numerically using evolutionary algorithm [

where f_{1} is the highest frequency, f_{2}—the frequency connected to the f_{1} by transition, f_{3}—maximum of transversions to f_{1}, f_{4}—transition to f_{3}. k—probability of transition, upon condition that the mutation occurred.

The solution-the optimal frequencies vector vs. GI is shown in

derivative discontinuities near 0.5 and 1 bits, with corresponding changes in the number of “degrees of freedom” and permutation symmetries. That is theoretically interesting because phase transitions are generally assumed to be highly non-analytic.

However, we cannot expect this experimental data to match this particular optimization precisely, because on the one hand other optimization parameters are possible (for instance a total site length to optimize the transcription speed), and the pattern (i.e. the logo) itself was derived with simplified assumptions outlined earlier (e.g. ignoring exons-specific individual patterns differences). Moreover it is natural to expect the existence of nonoptimal compositions due to specific regulatory demands.

Using BioMart tools [

We compared the substitution rates for splice sites divergence between human and two other primates—chimpanzee and rhesus (

Genes make up approximately 1.5% of human genome. Functional significance of remaining 98.5% non-coding DNA is still largely undetermined. A number of recent studies show that the signatures of purifying selection are wide-spread in non-coding DNA [

than the above estimates (i.e. 15% of human genome), but the bulk of this functionality simply escapes detection by conventional methods. The provided model shows that it is possible to store any amount of error free (binding) information with arbitrary high substitution rates, provided sufficiently long sequences. This is analogous to the revelation in signal transmission theories occurred due to the understanding provided by IT: before the IT, the usable signal/noise ratio was supposed to be high and some transmission errors inevitable (e.g. the analog broadcast). However, the IT showed that with any noise level it was possible to perform efficient error-free communication. In genetics, the intuition that functional sequence must have high conservation (high signal/noise) went as far as calling weakly conserved sequences such as introns and intergenic non-repetitive sequence “junk DNA” (constituting about 50% of a genome), while we (keeping faith in nature’s thriftiness) speculate that it is the evolutionary innovation for increasing efficiency.

Another counter-intuitive feature of the proposed model is that significant fraction of random mutations is “positive”—compensatory for GI storage (

The shift of paradigm we introduced here is to model the evolution and/or conservation of probabilistic patterns instead of evolution of defined sequences. A pattern can be thought of as a superposition of sequences (which forms the corresponding typical set). Instead of fixation as an elementary act of evolution, a mere shift in allele frequencies implies evolution in this framework. This seem to little sense for a single allele, however, for millions of alleles in a population, also considering that the frequency of beneficial mutations can be high, that introduces quite different mode of evolution than traditionally considered. For the first time this framework allows to model quantitatively the evolution of the total genomic information (due to additivity of GI), rather than modeling the fixation dynamics of single alleles with arbitrary assigned selective values. High frequency of beneficial mutations raises the question of what are the forces which impose the limits on progressive evolution, e.g. why some species are stable for millions of years.

A simple gedanken experiment with the provided model shows that for a given genome size, mutation and reproduction rates there is a limit on the amount of information which can be maintained in a population, so there is a limit on the absolute number of functional mutations a genome can tolerate, possibly explaining the Drake’s rule [

With other things being equal, a species with better GI storage optimization (“coding efficiency”) is more efficient, since less genetic load effectively implies better survival rates. The “survival of the fittest” is equivalent to the survival of the most efficient, naturally including information processing efficiency. From IT, it is known that better efficiency requires higher complexity-coders and decoders must have memory and sufficient algorithmic complexity. In general approaching closer to the channel capacity limit requires increase of memory and computational complexity. Hence the IT naturally links the drive to efficiency with the drive to complexity. While the drive to efficiency is self-evident in biological systems the drive to complexity was difficult to rationalize. Traditionally, complexity is assumed to passively “emerge” as simple rules (interactions), applied recursively, can generate perceivably complex patterns (but still they are simple algorithmically), in contradistinction, the lesson from IT is that there can be an active drive to increase algorithmic complexity. From this perspective the “evolution” of IT itself is quite instructive [

We thank Anatoly Ruvinsky and Peter Krawitz for valuable comments and suggestions.