Computation of the Genetic Code : Full Version

One of the problems in the development of mathematical theory of the genetic code (summary is presented in [1], the detailed—to [2]) is the problem of the calculation of the genetic code. Similar problem in the world is unknown and could be delivered only in the 21st century. One approach to solving this problem is devoted to this work. For the first time a detailed description of the method of calculation of the genetic code was provided, the idea of which was first published earlier [3]), and the choice of one of the most important sets for the calculation was based on an article [4]. Such a set of amino acid corresponds to a complete set of representation of the plurality of overlapping triple gene belonging to the same DNA strand. A separate issue was the initial point, triggering an iterative search process all codes submitted by the initial data. Mathematical analysis has shown that the said set contains some ambiguities, which have been founded because of our proposed compressed representation of the set. As a result, the developed method of calculation was reduced to two main stages of research, where at the first stage only single-valued domains were used in the calculations. The proposed approach made it possible to significantly reduce the amount of computation at each step in this complex discrete structure.


Introduction
The idea of calculating the genetic code arose after many years of research on mathematical genetics.The basic idea was that the code, apparently, today-half a century after its discovery, can be calculated on the basis of experimental data already known to date.The approach proposed below is not the main, or comprehensive, and it can only be considered as one of the attempts to find an approach to the solution of the task.After all, it's about finding a solution in a

Theorem for Homogeneous Overlaps
We consider unusual ways of recording genetic information-overlapping genes, when the same DNA portion corresponds to more than one protein.We investigated all 5 possible cases of overlapping of genes resolved by DNA structure, which were studied earlier [5].This study was based on a mathematical analysis of all 5 possible overlap cases and relied on sets of so-called elementary genetic overlaps-e.o., or overlaps corresponding to a pair of single amino acids.In [6] a brief analysis of such sets is presented, and the final version in [2].In Figure 1.A description of the structure of the sets W1-W5 is presented, and are presented by the 4th e.o.In each of these sets.
The principal position of this research is indicated in [2], where it was shown that the presented list of elementary overlaps can cost any (!) Allowed by the structure of the genetic code, overlapping not only 2 but also all admissible overlap from 3 to 6 Genes.The urgency of the problems is due to the current situation: overlapping genes common in viruses, mitochondria, bacteria and plasmids were found is in eukaryotic of large genomes, including humans, with the number of overlaps usually high, for the human genome it is about 1700 [7].
In the mathematical analysis of overlaps of more than two genes, we have investigated some problems.Of course, it would be possible to construct sets of all e.o.From 3 to 6 genes.It is not difficult to do this with the help of modern computer facilities.However, the main thing-what new conclusions-it can give.And that's why we are going the traditional way-from tasks.Let us first briefly discuss only some of them, solutions for which we have already published.The first of these concerns the analysis of ambiguities [8] [9] [10], this is when two Another problem was connected with the construction and analysis of a set of elementary overlaps for 3 genes overlapping in the same DNA chain.It is established that there are only 307 such overlaps.On the basis of these overlaps, a new problem was posed, connected with the calculation of the genetic code by mathematical methods [11] [12].The question of why exactly such a set was chosen to calculate the code was based on a theorem that was published relatively recently [4].We are talking about the calculation and analysis of all homogeneous e.o.ochs.From 2 to 6 genes.Are e.o. which correspond to the same amino acid.
Its solution is given by the following theorem.
For the amino acid Ama (isolated by hatching) encoded by the triplet n1n2n3, there are 5 alternative amino acids Ama 1 -Ama 5 , the encodings of which are formed by −1, +1 shifts in the same DNA chain (→) and −1, 0, +1 in the complementary DNA strand (←).The designations n i , i∈(0,4) are the nucleotides from the set A, T, C, G; N i n′ i∈(0,4)-complementary components: i.e.For i n′ i = A; n i = T; n i = C, i n′ = G for any i∈(0,4) and vice versa.In order to sequentially isolate e.o.-2 for all 5 cases of pair overlaps from [2] [3], in Figure 3 one should  First of all, it is necessary to exclude from consideration all homogeneous overlaps in which two strands of DNA participate.Consideration of these overlaps requires the introduction of a double strand of DNA-this is an additional condition in the problem.Eliminating such homogeneous overlaps, we proceed from the principle of constructing an algorithm with a minimum number of conditions.Therefore, in our examination there remain only homogeneous overlaps belonging to the same DNA chain: for pairs of amino acids (1), there are only 5 of them and similar overlaps for three amino acids (3)-th total of 4.
Thus, we selected the main working sets E.o., namely, those in which these homogeneous overlaps are present.The final version of these sets is presented on pages 312-319 in [2].
Let us consider the question in more detail.On sets with these overlaps.Earlier [2] we introduced the notion of elementary overlap with respect only to overlapping pairs of genes.Let's generalize this conce.o.t for three genes belonging to the same DNA chain.By the term elementary overlap, we mean the overlap for the codons of single amino acids by the maximum number of positions.Figure for overlapping in two positions, and one u3 for overlapping 3 Amino acids In Figure 4 and Figure 5 only fragments of these sets are presented, and some of their characteristics are presented in Table 1.
Amino acids shown in Table 1.Elements are given in the view that is used in this task.Each of the elements consists of three lines: upper, middle and lower.1.Each of the representation contains two lines: the first corresponds to a shift between codons equal to −1, and the second corresponds to a +1.Under the name of the amino acid, the number of elements corresponding to these shifts is indicated, and the lower numbers correspond to the numbers in the full list of these elements.all the elements designated above the three sets U1, U2, U3, corresponding to the genetic experiments.
N. N. Kozlov Figure 6.The compressed representation for 307 elements of the main set-U3: for each Ama of this set (it is indicated in the corresponding cell) is given the Ama1 amino acid along the abscissa axis, and on the ordinate axis-Ama2 (see Figure 1(a)).It turned out that the resulting representation is not homogeneous, but contains multiple ambiguities: these are the cases when more than one Ama value corresponds to the same Ama1 and Ama2 values-from 2 to 4. These cases are shaded in this figure, they are denoted by A1-A13, i.e. there are only 13 of them, although the figure shows 34 hatchings.The fact is that in this figure A6, A9 and A10 are represented 4 times, A1, A5, A7, A8 and A11-three times, A3 and A4-2 times, and A2, A12 and A13-only by one time.
Arg, both along the abscissa axis and along the ordinate axis.However, the most significant area in Figure 6, which corresponds to the cases where on both axes there is none of the amino acids from the Ser, Leu, Arg.For our calculations, the last region is reduced, eliminating from it all cells containing Ser, Leu, Arg.In Figure 7, the shading corresponds to the three amino acids mentioned, and the non-zero elements of the unshaded region have the following property: each Ama value is unique for the corresponding pair Ama1 and Ama2.
The above property allowed us to refer to the first stage of the calculation, when the calculation of the encodings for all elements is made Ama value on the basis of the encodings for the corresponding pair Ama1 and Ama2.The results of the ste.o.-by-ste.o.solution of the problem are presented in Table 2, but the most important stage of the study was the question of finding the initial approximation.
The initial approximation THE SOLUTION OF THE PROBLEM.We use the standard three-letter abbreviations for each of the 20 amino acids listed in the first column of Table 1.
We have a set A 0 : Ama i , i∈ (1,20). ( We introduce the definition.Let us turn to the previously introduced homogeneous overlaps.As before, we call a combination of amino acids, constructed on the basis of an elementary genetic overlap, homogeneous if the same amino acid participates in it.For homogeneous elements of the set we have.
Property.Let the encodings Ama for homogeneous u3 have one of the fol- Where small letters denote the unit components of the set N, and large-some subsets of this set, up to N. Then homogeneous u3 can exist only if at least one base triplet or triplet with three identical letters is used.
For the proof we successively substitute each of the representation (6) in u3: (7) where n'i is the single component of the set Ni, where i∈ (1,6), and the string na.-nucleotide sequences that are formed after this substitution.In the first case, in (3), the base codon n1n1n1 was used for encoding amino acid Ama from the bottom position, in the second-n2n2n2-from the middle position, and in the third-n3n3n3-from the top position.We turn to homogeneous u3 from the set U3, which turned out to be 4: Within the framework of the assumption specified in the Property, the following ste.o.-by-ste.o.process of searching for a genetic code is proposed; See Table 2. Ste.o. 1.Amino acids from (8) will assign the corresponding base codons.This assignment is not unique.However, in our approach, the set of letters N: a, b, c, d is not correlated with the canonical set of 4 nucleotides; This will be discussed at the end of the paper.Therefore, we will continue to operate with only one of the representation for the amino acids from (4), which we assign respectively the following basic triplets: Lys: aaa, Phe: bbb, Pro: ccc, Gly: ddd (9) For further calculations, we turn to some generalized data on the sets U2 and U1, which are given in Table 1.
Step. 2. From Table 1 it follows that, as in column m12 (the number and the list of overlapping amino acids on 1 and 2 bases are indicated), and in column m23 (similar data for 2 and 3 bases) do not contain mutual overlap between amino acids from (8).Such overlaps take place only one position and belong to the set U1.We have: (10) N. N. Kozlov

Figure 1 .
Figure 1.Description of the structure of sets W1-W5.There are 4 e.In each of the sets.The total number of e.o.In the corresponding set.

Figure 3 .
Figure 3. (a) For amino acid Ama, encoded by the triplet n1n2n3, there are 2 alternative amino acids Ama1 and Ama2.Their encodings are formed by −1, +1 shifts in the same DNA (®) chain.The codon n0n1n2 for Ama1 overlaps with codon n1n2n3 for Ama-the overlap contains two nucleotides n1n2; The codon n2n3n4 for Ama2 overlaps with codon n1n2n3 for Ama-the overlap contains two nucleotides n2n3.As a result, the triple overlap contains only one common position n2.(b) Elements of sets of combinations of amino acids, formed on the basis of elementary overlap from Figure 3(a).On the left there is one element of the set U1, in the center there are two elements of the set U2 and to the right one element of the set U3.

3
(a)  for the amino acid Ama encoded by the triplet n1n2n3 indicates the alternative amino acids Ama1 and Ama2, the encodings of which are n0n1n2 and n2n3n4, respectively, are formed by shifts of −1 and +1 nucleotides in the same DNA (®) chain.It is assumed that all the values of n0-n4 belong to the canonical set of four nucleotides.On the basis of Figure3(a), it is possible to construct three types of combinations of amino acids, re.o.resented in Figure3(b) and designated respectively u1, u2, u3: one u1 for overlapping one position, two u2 The named amino acids Met and Arg are shown in the middle line.Formulation of the problem.Introduction of a compressed set.FORMULATION OF THE PROBLEM.Let's have a set of 4 letters: N: a, b, c, d, and also triplets-any triples of these letters, there are 64 in all.Moreover, each of the 20 canonical amino acids can be encoded by an arbitrary combination of such triplets.The task is to search for all the genetic codes that correspond to N. N. Kozlov

Figure 4 .
Figure 4.A list of some elements of the set U2 is given.These are two sets of such elements corresponding to the first-Met and the last-Arg amino acids from the corresponding set, represented in the first column of Table1.Each of the representation contains two lines: the first corresponds to a shift between codons equal to −1, and the second corresponds to a +1.Under the name of the amino acid, the number of elements corresponding to these shifts is indicated, and the lower numbers correspond to the numbers in the full list of these elements.

Figure 5 .
Figure 5.The elements with numbers 1-8 and 279-307 from the set U3, corresponding to the first (Met) and last (Arg) amino acids from the list.

Figure 7 .
Figure 7.The reduced region of Figure 6: there are areas in which the code is calculated.All the shaded regions are cut off for the reasons indicated above, and the main area in the calculation is re.o.resented without shading.