Mathematical Modeling the Biology of Single Nucleotide Polymorphisms (SNPs) in Whole Genome Adaptation

As a living information and communications system, 
the genome encodes patterns in single nucleotide polymorphisms (SNPs) 
reflecting human adaptation that optimizes population survival in differing 
environments. This paper mathematically models environmentally induced adaptive 
forces that quantify changes in the distribution of SNP frequencies between 
populations. We make direct connections between biophysical methods (e.g. 
minimizing genomic free energy) and concepts in population genetics. Our 
unbiased computer program scanned a large set of SNPs in the major 
histocompatibility complex region and flagged an altitude dependency on a SNP 
associated with response to oxygen 
deprivation. The statistical power of our double-blind approach is 
demonstrated in the flagging of mathematical functional correlations of SNP 
information-based potentials in multiple populations with specific 
environmental parameters. Furthermore, our approach provides insights for new 
discoveries on the biology of common variants. This paper demonstrates the 
power of biophysical modeling of population diversity for better understanding 
genome-environment interactions in biological phenomenon.


Introduction
As a complex, dynamic information system, the human genome encodes and perpetuates the principles of life. The information is incorporated within a How to cite this paper: Lindesay, J., Mason, T.E., Hercules, W. and Dunston, G.M. in Whole Genome Adaptation. Advances in Bioscience and Biotechnology, 9, 520-533. Advances in Bioscience and Biotechnology mostly fixed template, as well as within the structure of human genome sequence variation. Of the approximately 3 billion nucleotides of the human genome, only about 0.1% consist of bi-allelic single nucleotide polymorphisms (SNPs) distributed throughout the genome [1]. Once the statistical distribution of variation reaches homeostasis in a given environment, a human population can be described in terms of the maintained order and patterns of polymorphisms in the whole genome. We define the environment not just in terms of geophysical parameters, but rather as the complete interface of the population to biologic and evolutionary influences. We assert that the stability of whole genome adaptation is reflected in the frequencies of maintained diversity in these common variants (SNPs) for a population in its environment. As dynamic sites in the human genome, SNPs are often highly correlated into combinations referred to as haploblocks whose haplotypes are maintained throughout generations with fixed frequencies within a given population. Such combinations of SNPs are said to be in linkage disequilibrium (LD). This reflects that certain SNP allelic combinations never appear within the population, implying that only certain haplotypes are biologically viable and generationally maintained. In population dynamics, viability manifests as maintained survivability and functionality. The formation of haploblocks is an emergent property of genomic information that cannot be characterized in the absence of the environmental influences that compel such phase transitions among populations. Therefore, the dynamically independent statistical genomic units we use are SNP haplotypes together with alleles within SNP sites that are not in contiguous LD with any other SNPs. In particular, changes in the distribution of allelic and haplotypic responses to the environment directly reflect adaptive forces on the population. The resilience of living humans as embodiments of the genome allows for the adaptation of groups to new or changing environments. Differing human populations have emerged as a consequence of the various past migratory groups remaining within specific environments and developing the collective coping mechanisms that have allowed the groups to function effectively in their surroundings. We consider adaptation to be the dynamic process of modifying expressions of the genome towards optimizing the survivability of a group that remains in a particular environment. Using measures of genomic information that reflect the interplay of statistical variations due to the environmental baths within which stable populations exist motivates the development of "genodynamics" as an analog to macro-physical "thermodynamics" [2]. This approach offers a novel way of thinking about population diversity, through the discovery of relationships between the environment and genome variation underlying biology. In this paper, we mathematically model genome-environment interactions and demonstrate straightforward environmental influences upon common genomic variants.

Population Variation and Information
We begin by developing expressions that relate the genomic information meas-Advances in Bioscience and Biotechnology ures of human groups whose diversity profile is stable over generations, to additive dynamic state variables that depend upon the environment occupied by that group. Most common informatic measures in the physical and communication sciences are related to the entropy of the statistical system being described. In order to develop entropy measures for a genomic population, the dynamic units of relevance must first be ascertained. Within a given environment, the statistical distributions of certain sets of SNPs become highly correlated as emergent units.
This means that the genomic information dynamics in a specific environment is an emergent phase of expression of the human genome. The specificentropys (S) (or per capita entropy) of a single SNP location (S) that is not in (contiguous) linkage disequilibrium will take the form of a canonical ensemble state variable in an environmental bath given by where ( ) where n (H) is the number of SNP locations in haploblock (H), and ( ) H h p represents the probability (frequency) that haplotype h occurs in the population.
The upper limit in this sum represents the number of mathematically possible bi-allelic combinations of alleles within the haploblock. Commonly available tools were used to construct the haploblock structures [3].
Since entropy is a measure of the disorder of a distribution, a system with maximum disorder (equal statistical distribution of all mathematically possible combinations) is one of maximum entropy S max . The information content (IC) of a maintained statistical distribution is measured by the degree of order that the distribution has relative to a completely disordered one, i.e., the difference between the entropy of a completely disordered distribution and that of the given distribution; max IC S S = − [4]. Such an information measure is likewise additive due to the additive nature of the entropy [5]. Thus, both entropy and information content are extensive state variables whose values increase proportionate with the population size. The normalized information content (NIC) for a given SNP haploblock (H) is a (non-additive) intrinsic measure defined by where, as previously stated the specific entropy of the haploblock ( )

Information Dynamics of the Human Genome
We next develop dimensional scales and units that can quantify the relative plia- will parameterize the genomic free energy change in a population from the addition of one individual of allele a or haplotype h. For a given haploblock (H), the differential genomic free energy takes the form where represents the number of individuals in the population with hap-Advances in Bioscience and Biotechnology lotype h. This form neglects any influence of the population upon the environment. The total genomic free energy is a sum over all SNP haploblocks and non-linked SNPs given by As is the case in thermodynamics, the additive allelic potentials ( ) are expected to scale relative to the environmental potential T E , and allelic or haplotypic potential differences should directly reflect in the ratio of the frequencies of occurrence of those dynamic units within the population. We assert that such properties are encompassed in the functional form Defining a single human Genomic Energy Unit ( 1GEU µ ≡  ) to be the allelic energy necessary to induce maximal variation within a single non-linked bi-allelic SNP location ( ), the potential of the haplotype h or allele a in an environmental bath characterized by the environmental potential T E that bathes the whole genome can be expressed as If only one allele is present at a SNP location for a given population, the allelic potential of that allele is defined to be at the fixing potential μ fixed for that envi- We will assume that the population is homeostatic (or at least quasi-homeostatic, which means that any changes occurring in the population distribution requires many generations to become significant). Population homeostasis is equivalent to the Hardy-Weinberg condition used in population biology that the statistical distribution is independent of any sub-divisions of the population data, including those associated with differing generations or ages.
Our population stability condition will require that the genomic free energy be a (stable) minimum under changes in the population within the local environment when the population is in homeostasis with its environment, i.e., . By substituting the forms of the allelic potentials ( ) expressed in terms of the probabilities in Equation (7) into the population stability condition and summing over all haploblocks and SNPs, an explicit expression of the environmental potential can be obtained: This inversely relates the environmental potential to the intrinsic normalized information content characterizing the variation of the whole genome of the as the SNP potential for location (S). The population stability condition then requires that the sum of all block and SNP potentials for a given population vanishes: This condition demonstrates that balance is established between diversity and conservation in a population to optimize its survivability within the given environment. One should note that the environmental potential T E , the block poten- where the set of SNP haplotypes h and alleles a are unique to the individual. An individual's overall allelic potential is not a universal parameter, but rather depends strongly upon the environment.
To illustrate population dependent spectra of genomic block potentials, the genomic free energies of blocks in the major histocompatibility complex (MHC) region on chromosome 6 are displayed for a few founder populations using phase I, II, and III data from HapMap in Figure 1.
The MHC region encodes genes for the human immune response. This region of the genome is particularly relevant in host response to environmental stressors and is known to display straightforward biological correlations with environmental parameters. The emergent differences in the haploblock structure of the populations are immediately apparent. The block binding potential (which parameterizes the stability of an emergent haploblock) will be defined as the difference in the block potential from the sum of the individual SNP potentials that make up that block if they were not in linkage disequilibrium (LD). The corresponding spectra of binding potentials (per SNP) are demonstrated in Figure 2.
Those SNPs in haploblocks with more negative binding potential per SNP have enhanced biologic favorability for maintaining their correlated statistics throughout generations of the populations in the given environments. SNPs in haploblocks with nearly zero binding potential per SNP are nearly independent, indicative of the environmental transition point of the emergent genomic phase.   potential per SNP indicates the degree to which the SNP variation must be correlated in order to maintain a biologically viable population.

Distributive Genodynamics
The formulation of the information dynamics of the human genome in terms of genomic free energies directly results in well-defined forms for the SNP potentials for SNPs that are not in LD and for block potentials for correlated SNPs that are in LD. Since the SNP haploblock structure has an emergent form that differs between populations, meaningfully defined distributed potentials will reflect the biology underlying the participation of individual SNPs in the informatics architecture of its correlation with other SNPs in the haploblock. We will next develop distributed SNP potentials ( ) H S µ within a haploblock (H) such that they satisfy the following conditions: If the SNP is occupied by an allele that is fixed in the given population, then its distributed SNP potential is the fixing potential μ fixed ; The sum of the distributed SNP potentials should be the same as the block The first bullet insures that if the SNP is not variant within the population, its genomic energy is not modified from that of a SNP that is not in LD, and the second bullet requires that the distributed potentials should reconstruct the block potential in an additive way. The third bullet represents a simple mechanism for relating the distributed potentials to the degree of variation in the SNP. The mathematical form that satisfies these conditions is given by where 1 S S p p = − is the minor allele frequency of the SNP labeled (S). Using this form, the distribution of the haploblock potential to any constituent SNP is proportionate to the occurrence of the minor allele in the population in a manner that increases the SNP's genomic free energy as the SNP has higher variation (i.e., becomes less conserved).
The degree of stability of the participation of the SNP in the biology of the emergent haploblock can be quantified in terms of its binding potential defined by . The most straightforward form that uniformly assigns the distributed SNP potential within a haploblock, and maintains the expected correlation that increased genomic potential reflects increased variation, results by simply adjusting the non-linked allelic potentials using the SNP binding potential, i.e., It should be noted that all distributed potentials are only defined at the population level and cannot be ascribed to individuals. Only the emergent haplotype can be ascribed to individuals within the population. However, since distributed potentials are defined for the population as a whole, they can bequite useful for parameterizing the environmental influences upon that population. Distributed potentials are particularly useful for describing the adaptation of the population to stimuli and stressors with known biological correspondence to particular alleles or SNPs. The description of genomic variants using distributed potentials inherently includes any presently unknown whole genome response to specific stressors.

Adaptive Forces
Once genomic free energy measures have been developed for individual alleles Advances in Bioscience and Biotechnology and genomic regions, environmentally induced adaptive forces can be characterized using gradients of those additive measures down the slope of environmental parameters. For a given allele a on the genome that is biologically connected to a definable environmental parameter λ (such as UV light, lactose in diet, prevalence of malarial plasmodia, etc.), we define the environmentally induced adaptive force on that allele by with analogously defined adaptive forces on potentials characterizing SNPs, haploblocks, haplotypes, genes, and even perhaps whole chromosomes. Such an expression is only meaningful if there is a functional relationship between the biology of the genomic unit and the particular environmental parameter λ. In such cases, positive adaptive forces drive the conservation of the given genomic unit down the slope of the genomic potential. Increased survivability might drive the genomic unit towards more diversity, or more conservation, depending on the nature of the environmental influence upon the homeostatic population.
Quantifying such forces inherently involves comparisons between differing environments.
To explore environmental impacts on adaptation, we will confine our investigation to phase III data of HapMap, since this represents the broadest set of populations with somewhat uniform genotyping. We have chosen to exclude ASW, CEU, CHD, GIH and MXL from our parameterization of adaptive forces, since these populations do not reside in their geographical origin. In this paper, the genomic potentials of the set of SNPs in the MHC region on chromosome 6 were chosen to conduct a double-blind exploration for possible correlations with three particularly straightforward environmental parameters: annual exposure to UV-B radiation, altitude above sea level, and exposure to malarial vectors. In order to simplify the analysis of any results, the set of all SNPs in this region that are not in LD for most of the populations were pre-selected out for the computational search. The algorithm examines whether the genomic potentials for the SNPs and alleles can be fitted to simple functional forms (curves) singly dependent on a given environmental parameter. If the root-mean-squared (RMS) deviation of the data points from the curves, as compared to the maximum variation of the data, falls within 10%, the SNP is flagged by the program, and adaptive forces are calculated for the curves.
The averaged ancestral annual UV-B radiation exposure used was expressed in units of Joules per square meter (UV radiance) as estimated from the following cited source [7]. In these units, estimates of annual UV radiance for the CHB population averaged 2180 (ranging from 1500 to 2600), for the JPT population averaged 2400 (ranging from 2300-2500), for the LWK population averaged 5764 (ranging from 5450 to 6500), for the MKK population averaged 5624 (ranging from 5000 to 6125), for the TSI population averaged 1507 (ranging from 950 to 2500), and for the YRI population averaged 5129 (ranging from Advances in Bioscience and Biotechnology

Results and Discussion
Our program flagged functional dependencies on altitude of phase III HapMap data for the SNP rs1109771 in the MHC region for the populations CHB, LWK, MKK, TSI and YRI. The curves are plotted in Figure 3.
The relative RMS deviation for the SNP potential was 0.03, for the G allelic potential was 0.008, and for the A allelic potential was 0.001. A significant adaptive force of about +1.5 GEUs/kilometer at lower altitudes on allele A towards increased conservation is apparent. At higher altitudes, significant variation is maintained, as indicated by the SNP potential remaining very near the maximum value of 1 GEU (maximal variation). This implies that the G allele continues a significant presence in the population in order to optimize its survivability in the higher altitudes available in the HapMap data.

High Altitude and NOTCH4
Over the course of human history, adaptation to challenging environments has necessitated modulation of biological pathways at the genomic level to combat the toxic effects present in said environments. High altitude is an excellent example of how humans have adapted to an environmental stressor (e.g., low oxygen content). The body's response to chronic exposure to alveolar hypoxia is to hyperventilate, thereby increasing resting heart rate and stimulating the production of red blood cells to maintain the oxygen content of arterial blood at or above sea level values [10]. Moreover, an insufficient supply of oxygen prompts the formation of new vessels from the walls of existing ones, i.e. angiogenic sprouting [11]. Growth factors and chemokines are secreted from hypoxic tissues, stimulating endothelial cells to break away from vessel walls. These angiogenic factors then coordinate sprouting, branching, and new lumenized network formation until the oxygen content rises and normoxia can be re-established [12]. The Notch signaling pathway plays a key role in shaping the formation and remodeling of the vascular network under hypoxic conditions [11]. This pathway is an evolutionarily conserved intracellular signaling pathway that was originally identified in Drosophila. Notch has four transmembrane receptors, with Notch 1 and Notch 4 being expressed by endothelial cells [13] [14] [15]. It has been shown that targeted deletion of Notch 4 in mice results in the deregulation of arterial and venous specification of endothelial cells as well as the deformation of arteries and veins [16] [17]. In addition, overexpression of the intracellular domain of Notch 4 in endothelial cells results in a β1 integrin-mediated increase in adhesion to collagen resulting in cells that show a reduced sprouting response to vascular endothelial growth factor both in vitro and in vivo [18]. Thus, it appears that Notch signaling promotes cellular responses in endothelial cells that help to alleviate the harmful effects of hypoxia in the human body. Consequently, population differences in allelic frequencies in this pathway could effectively provide an adaptive advantage for survival in response to this environmental stressor.
As a demonstration of the potential guidance offered by this formulation towards future discovery in the biology of whole genome adaptation, our program flagged functional dependencies on plasmodium parasite load from HapMap data for rs430620 in the MHC region for the populations CHB, LWK, MKK, TSI and YRI. The curves plotted in Figure 4 represent a strong flag for parasite dependency of a SNP in the intervening sequence of the genome with no known association to any gene. The relative RMS deviation for the SNP potential was 0.007, for the G allelic potential was 0.02, and for the A allelic potential was 0.008. A significant adaptive force of about +3 GEUs/unit PfPR for initial parasite loads on allele A towards increased conservation is apparent. The A allele has very low occurrence within populations with no parasite load, and the SNP approaches fixation towards allele G. Once again, for higher parasite loads, significant variation is maintained, as indicated by the SNP potential approaching  Moreover, population diversity in genome-wide common variants, such as SNPs that are non-randomly embedded in the human genome, represent a "quintessential experiment of nature" in whole genome adaptation to environmental stimuli and stressors associated with population diversity in health outcomes. SNPs associated with common diseases not only reveal mechanisms underlying the complex biology of common diseases, but also the "genomic cost" to populations in whole genome adaptation to environmental stimuli and stressors. By parameterizing the information dynamics of SNPs in HapMap populations, we developed a mathematical model of environmentally induced adaptive forces as drivers of population health and diversity in health outcomes. Our model provides new lenses through which SNP data can be explored to solve problems in population-based patterns of genome variation in common complex diseases which we submit is significant in clinical translation.