Statistical Methods of SNP Data Analysis and Applications

We develop various statistical methods important for multidimensional genetic data analysis. Theorems justifying application of these methods are established. We concentrate on the multifactor dimensionality reduction, logic regression, random forests, stochastic gradient boosting along with their new modifications. We use complementary approaches to study the risk of complex diseases such as cardiovascular ones. The roles of certain combinations of single nucleotide polymorphisms and non-genetic risk factors are examined. To perform the data analysis concerning the coronary heart disease and myocardial infarction the Lomonosov Moscow State University supercomputer “Chebyshev” was employed.


Introduction
In the last decade new high-dimensional statistical methods were developed for the data analysis (see, e.g., [1]).Special attention was paid to the study of genetic models (see, e.g., [2][3][4]).The detection of genetic susceptibility to complex diseases (such as diabetes and others) has recently drawn much attention in leading research centers.It is well-known that such diseases can be provoked by variations in different parts of the DNA code which are responsible for the formation of certain types of proteins.One of the most common individual's DNA variations is a single nucleotide polymorphism (SNP), i.e. a nucleotide change in a certain fragment of genetic code (for some percentage of population).Quite a number of recent studies (see, e.g., [5,6] and references therein) support the paradigm that certain combinations of SNP can in-crease the complex disease risk whereas separate changes may have no dangerous effect.
There are two closely connected research directions in genomic statistics.The first one is aimed at the disease risk estimation assuming the genetic portrait of a person is known (in turn this problem involves estimation of disease probability and classification of genetic data into high and low risk domains).The second trend is to iden-tify relevant combinations of SNPs having the most significant influence, either pathogenic or protective.
In this paper we propose several new versions of statistical methods to analyze multidimensional genetic data, following the above-mentioned research directions.The methods developed generalize the multifactor dimensionality reduction (MDR) and logic regression (LR).We employ also some popular machine learning methods (see, e.g., [2]) such as random forests (RF) and stochastic gradient boosting (SGB).
Ritchie et al. [7] introduced MDR as a new method of analyzing gene-gene and gene-environment interactions.Rather soon the method became very popular.According to [8], since the first publication more than 200 papers applying MDR in genetic studies were written.
LR was proposed by Ruczinski et al. in [9].Further generalizations are given in [6,10] and other works.LR is based on the classical binary logistic regression and the exhaustive search for relevant predictor combinations.For genetic analysis it is convenient to use explanatory variables taking 3 values.Thus we employ ternary variables and ternary logic regression (TLR), whereas the authors of the above-mentioned papers employ binary ones.
RF and SGB were initiated by Breiman [11] and Friedman [12] respectively.They belong to ensemble methods which combine multiple predictions from a certain base algorithm to obtain better predictive power.RF and SGB were successfully applied to genetics data in a number of papers (see [2,13] and references therein).
We compare various approaches on the real datasets concerning coronary heart disease (CHD) and myocardial infarction (MI).Each approach (MDR, TLR and machine learning) is characterized by its own way of constructing disease prediction algorithms.For each method one or several prediction algorithms admitting the least estimated prediction error are found (a typical situation is that there are several ones with almost the same estimated prediction error).These prediction algorithms provide a way to determine the domains where the disease risk is high or low.It is also possible to select combinations of SNPs and non-genetic risk factors influencing the liability to disease essentially.Some methods allow to present such combinations immediately.Other ones, which employ more complicated forms of dependence between explanatory and response variables, need further analysis based on modifications of permutation tests.New software implementing the mentioned statistical methods has been designed and used.
This work was started in 2010 in the framework of the general MSU project headed by Professors V. A. Sadovnichy and V. A. Tkachuk (see [14]).MSU supercomputer "Chebyshev" was employed to perform data analysis.
The rest of the paper is organized as follows.In Section 2 we discuss various statistical methods and prove theorems justifying their applications.Section 3 is devoted to analysis of CHD and MI datasets.Section 4 contains conclusions and final remarks.

Methods
We start with some notation.Let be the number of patients in the sample and let the vector 1 N ( , , ) consist of genetic (SNP) and nongenetic risk factors of individual j .Here n is the total number of factors and j i X is the value of the i-th variable (genetic or non-genetic factor) of individual j.These variables are called explanatory variables or predictors.If j i X stands for a genetic factor (characterizes the i-th SNP of individual j) we set 0, SNP is homozygous for do 1, SNP is heterozygous 2, SNP is homozygous for rec , , For biological background we refer, e.g., to [15].We assume that non-genetic risk factors also take no more than three values, denoted by 0, 1 and 2. For example, we can specify a presence or absence of obesity (or hypercholesterolemia etc.) by the values 1 and 0 respectively.If a non-genetic factor takes more values (e.g., blood pressure), we can divide individuals into three groups according to its values.
Further on Y (response variable) be equal to 1 for a case, i.e. whenever individual j is diseased, and to -1 otherwise (for a control).Set where Introduce a random vector   , X Y independent of  and having the same law as 1   .All random vectors (and random variables) are considered on a probability space   E denotes the integration w.r.t.P. The main problem is to find a function in genetic and non-genetic risk factors describing the phenotype (that is the individual being healthy or sick) in the best way.

Prediction Algorithms
Let denote the space of all possible values of explanatory variables.Any function   Err f depends also on the law of   , Clearly X Y

 
but we simplify the notation.Following [8,16] we put where the trivial cases and a sample is called balanced and one has The reason to consider this weighted scheme is that a misclassification in a more rare class should be taken into account with a greater weight.Otherwise, if the probability of disease is small, then the trivial function   1 f x   may have the least prediction error.
It is easy to prove that the optimal theoretical predicttion function minimizing the balanced prediction error is given by where Then each multilocus genotype (with added non-genetic risk factors) is classified as high-risk if or low-risk if Since and are unknown, the immediate application of (3) is not possible.Thus we try to find an approximation of unknown function with values in which depends on and the sample

 
where . ( The simplest way is to employ formula (3) with   and replaced by their statistical estimates.Consider and take where I A stands for the indicator of an event A and #D denotes the cardinality of a finite set D.
Along with (7) we consider for Thus ( 6) is a special case of (

 
), we set If one deals with too many parameters, overfitting is likely to happen, i.e. estimated parameters depend too much on the given sample.As a result the constructed estimates give poor prediction on new data.On the other hand, application of too simple model may not capture the studied dependence structure of various factors efficiently.However the trade-off between the model's complexity and its predictive power allows to perform reliable statistical inference via new model validation techniques (see, e.g., [17]).The main tool of model selection is the cross-validation, see, e.g., [16].Its idea is to estimate parameters by involving only a part of the sample (training sample) and afterwards use the remaining observations (test sample) to test the predictive power of the obtained estimates.Then an average over several realizations of randomly chosen training and test samples is taken, see [18].
where the sum The next result provides wide sufficient conditions for co exist a subset U  X and a nsistency of estimate (10)

Rema
of The by ore restrictive assumption 3')    Proof of Then the l.h.s. of (12) equals The absolute value of the second term in the r.h.s. of (1 tends to 0 a.s.if m   from the strong law of la mbers for arrays (SLLNA), see [19].Not .This statement follows rge nu e that According to SLLNA the first term in the r.h.s. of ( Then we obtain where SLLNA and condition 2 imply that, for . if m   .Combining ( 12)-( 16) . f 0 m  a.s he desi we obtain t red result Let us return to the proof o and take }: , we obtain that almost surely and n important problem is to make sure that the predictti gives statistically reliable results.The quality of an gorithm is determined by its predicttion error (9) which is unknown and therefore the inference is based on consistent estimates of this error.Clearly the high quality of an algorithm means that it captures the dependence between predictors and response variables, so the error is made more rarely than it would be if these variables were independent.Consider a null hypothesis 0 al H that X and Y are independent.If they are in fac epende , then f r any reasonable pre-diction algorithm . This shows th the results of the al-gohm could not be obtained by chance.For such a procedure, we take a permutation test which allows to [20]) of the true p-value For details we refer to [21].Now we pass to the description of various statistical m ionality Reduction lyzing e balanced error of a theoretical predicttio troduced in ethods and their applications (in Section 3) to the cardiovascular risk detection.

Multifactor Dimens
MDR is a flexible non-parametric method of ana gene-gene and gene-environment interactions.MDR does not depend on a particular inheritance model.We give a rigorous description of the method following ideas of [7] and [8].
To calculate th n function we use formula (2).Note that the approach based on penalty functions is not the only possible.Nevertheless it outperforms substantially other approaches involving over-and undersampling (see [8]).
As mentioned earlier, the probability p(x) in ( 4) is unknown.To find its estimate one can apply maximum likelihood approach assuming that the random with unknown param ribution eter   p x .Then we come to (6).
A direct calculation of estimate ( 6) with exhaustive search over all possible values of x is highly inefficient, since the number of different values of x grows exponentially with number of risk factors.Moreover, such a search leads to overfitting.Instead, it is usually supposed that   p x depends non-trivially not on all, but on certain v les x i .That is, there exist , < , l l n  N and a vector ariab In other words only few factors influence the disease an where S is introduced in (5).

Theorem 2. Let  
* l be the most significant co ny fixed , w proves the 1 is met.
ick one or a few co nific "independent rule".We propose m m In view of this result it is n ral mbinations of factors with the smallest EPEs as an approximation for the most significant combination.
The last step in MDR is to determine statistical sig ance of the results.Here we test a null hypothesis of independence between predictors X and response variable Y.This can be done via the permutation test mentioned in Section 2.1.

MDR method with
ultifactor dimensionality reduction with "independent rule" (MDRIR) method to improve the estimate of probability   p x .This approach is motivated by [22] which deals w assification of large arrays of binary data.The principal difficulty with employment of formula (6) is that the number of observations in numerator and denominator of the formula might be small even for large N (see, e.g., [23]).This can lead to inaccurate estimates and finally to a wrong prediction algorithm.Moreover, for some samples the denominator of ( 6) might equal zero.
The Bayes form ith cl ula implies that   where the trivial cases   (3) w following expression for the prediction function: As in standard MDR method described above, we assume that formula (17) holds.It was proved in [22] that for a broad class of models (e.g., Bahadur and logit models) the conditional probability   x Y y  where = 1, y can be estimated in the following way: here (cf.( 8)) Combining ( 17) and ( 21)-( 23) we find the desired estimate of * ( ). f x A numb ob er of servations in numerator and denominator of (23) increases considerably comparing with (18).It allows to estimate the conditional probability more precisely whenever the estimate introduced in ( 22) is reasonable.For instance, sufficient conditions justifying the application of ( 22) are provided in [22,Cor.5.1].MDRIR might have some advantage over MDR in case when the size l of the most significant combination is large.However, MDR for small l can e better behavior than MDRIR.Thus, as opposed to standard MDR me

Ternary Logic Regression
g the most sig-es alternative estimates of conditional probabilities.All other steps (prediction algorithm construction, EPE calculation) remain the same.As far as we know, this modification of MDR has not been applied before.It is based on a combination of the original MDR method (see [7]) and the ideas of [22].
  We suppose that trivial sit not occur and omit them from the con estimate   * p x we pass to the logistic transform where S was introd q .The next theorem g ra timation method whenever the model is correctly specified, i.e. *  .q M  To formulate this result introduce f.At first we show that X and all   Using SLLNA, we get that a.s.
These relations imply the desired estimate of for any x X and all . Consequently, at we see th By SLLNA, we get that .
By the (conditional) information inequality, attains its maximum over all functions h only at * q in-* * * 1 1 2 Elog 1 troduced in (24).Under conditions of the theorem, * .q M  Therefore, by definitions of N h and q we have By (26) and SLLNA ble only when N fo X.Indeed, for almost all    we can always take a Hence, by the inform 3 we no To establish the second part of Theorem te Then the conclusion follows from Remark afte Theorem 1.The proof is complete. A wide and easy to handle class of models is obtained by taking functions linear in variables , , .
x x  or/and in their products.In turn these functions admit a convenient representation by elementary polynomials (EP).Recall that EP is a function T in ternary variables 1 , , . .The addition and m ltiplication of ternary variables is considered by modulo 3. Any EP can be represented as a binary tree in which knots (vertices which are not leaves) contain either the addition or multiplication sign, and each leaf corresponds to a variable.Different trees may correspond to the same EP, thus this relation is not one-to-one.However, it does not influence our problem, so we keep the notation T for a tree.A finite set of trees Copyright © 2012 SciRes.

OJS
Define also the normalized prediction error of a forest a bina tex of T toget ith its offspring).The addition and m A subgraph B of a tree T is called a branch if it is itself ry tree (i.e. it can be obtained by selecting one verher w ultiplication signs standing in a knot of a tree are called operations, thus * stands for sum or product.Following [9], call the tree T  a neighbor of T if it is obtained from T via one and only one of the following transformations. 1) Changing one variable to another in a leaf of the tree T (variable change).
2) Replacing an operation in a knot of a tree T with another one, i.e. sum to product or vice versa (operator change).
3) Changing a branch of two leaves to one of these leaves (deleting a leaf).
4) Changing a leaf to a branch of two leaves, one of which contains the same variable as in initial leaf (splitting a leaf).
5) Replacing a branch T rs.The neighborhood relation defines a finite connected graph on all forests of equal size s with complexity not exceeding r.To each vertex F of this graph we assign a number ( ).F  To find the global minimum of a function defined on a finite graph we apply the simulated annealing method (see, e.g., [24]).This method constructs some specified Markov process which takes values in the graph vertices and converges with high probability to the global minimum of the function.To avoid stalling at a local minimal point the process is allowed to pass with some small probability to a point F having greater value of ( ) F  than current one.We propose a new modification of this method in which the output is the forest corresponding to the minimal value of a function ( ) F  over all (randomly) visited points.Since simulated annealing performed involves random walks on a complicated graph consisting of trees as vertices, the algorithm was realized by means of the MSU supercomputer.

Machine Learning Methods
Let us describe two machine learning methods: random ing.They will be rmance in a number of studies (see [1 forests and stochastic gradient boost used in Section 3. We employ classification and regression trees (CART) as a base learning algorithm in RF and SGB because it showed good perfo 8]).Classification tree T is a binary tree having the following structure.Any leaf of T contains either 1 or -1 and for any vertex P in T (including leaves) there exists a subset P A of the explanatory variable space , X such that the following properties hold: 1) , P A  X if P is the root of T. 2) If rtices P ve  and P are children of P, then ub rre rm the partition of X .A cl defined by tion tree is introduced as s.To obtain a prediction of o In particular, s sets co sponding to the leaves fo assifier a classificafollow Y given a certain value x  X of the random vector X, one should g along the path which starts from the root and ends in some leaf turning at each parent vertex P to that child P for which At the end of the x-specific path, one gets either 1 or -1 which serves as a prediction of Y. Classification tree could be constructed vi ART algorith , see [18].
RF is a non-parametric method of estimating conditional probability Its idea is to improve prediction power of CART tree by taking the average of th es of thi ese trees grown on many bootstrap samples, see [18, ch. 15].The advantag s method are low computational costs and the ability to extract relevant predictors when the number of irrelevant ones is large, see [25].
SGB is another non-parametric method of estimating conditional probability   p x .SGB algorithm proceeds iteratively in such a way that, on each step, it builds a new estimate of   p x and a new classifier decreasing the number of misclassif ases from the previous step, see, e.g., [12].
Standard RF a GB work poorly for unbalanced samples.One needs either to balance given datasets (as in [26]) before ied c nd S these methods are applied or use special modifications of RF and SGB.To avoid overfitting, permutation test is performed.A common problem of all machine learning methods is a complicated functional form of the final probability estimate   ˆ, x .In genetic studies, one wants to pick up all relevant combinations of SNPs and risk factors, based on a biological pathway causing the disease.Therefo mate re, the final esti   ˆ, p x  is to be analyzed.We describe one of possible methods for such analysis within RF framework called conditional variable importance measure (CVIM).One c ermine CVIM for each predictor i ould det X in X and range all i X in terms of this measure.Following [27], CVIM of predictor i X given certain subvector i Z of X is calculated as follows (supposing i Z takes values permute randomly the elements of For eac ese samples, construct a CART

 
, and calculate i Z ( 1, , i n   ) one can take all the components k X ( k i  ) such that the hypothesis of the independence between k X and i X is not rejected at e significance level (e.g., 5%).CVIM-like algorithm som could be used to ra e combinations of predictors w.r.t. the level of association to the disease.This will be published elsewhere.

Applications: Risks of CHD and MI
We employ here ng various statistical methods described above to analyze the influence of genetic and non-genetic sease ation with K = 6.As shown in [16], the st ase data.Note that sponding SNP.To (conventional) factors on risks of coronary heart di and myocardial infarction using the data for 454 individuals (333 cases, 121 controls) and 333 individuals (165 cases, 168 controls) respectively.These data contain values of seven SNPs and four conventional risk factors.Namely, we consider glycoprotein Ia (GPIa), connexin-37 (Cx37), plasminogen activator inhibitor type 1 (PAI-1), glycoprotein IIIa (GPIIIa), blood coagulation factor VII (FVII), coagulation factor XIII (FXIII) and interleukin-6 (IL-6) genes, as well as obesity (Ob), arterial hypertension (AH), smoking (Sm) and hypercholesterolemia (HC).The choice of these SNPs was based on biological considerations.For instance, to illustrate this choice we recall that connexin-37 (Cx37) is a protein that forms gapjunction channels between cells.The mechanism of the SNP in Cx37 gene influence on atherosclerosis develop-ment is not fully understood, but some clinical data suggest its importance for CHD development [28].In the Russian population homozygous genotype can induce MI development, especially in individuals without CHD anamnesis [29].
The age of all individuals in case and control groups ranges from 35 to 55 years to reduce its influence on the risk analysis.For each of considered methods, we use K-fold cross-valid andard choice of partition number of cross-validation from 6 to 10 does not change the EPE significantly.We take K = 6 as the sample sizes do not exceed 500.The MSU supercomputer "Chebyshev" was involved to perform computations.As shown below, all applied methods permit to choose appropriate models having EPE for CHD dataset less than 0.25.Thus predictions constructed have significant predictive power.Note that, e.g., in [30] the interplay between genotype and MI development was also studied, with estimated prediction errors 0.30 -0.40.

MDR and MDRIR Method
Coronary heart disease.Table 1 contains EPEs of the most significant combinations of predictors obtained by MDR analysis of coronary heart dise we write the gene meaning the corre estimate the empirical c.d.f. of the prediction error when the disease is not linked with explanatory variables, we used the permutation test.Namely, 100 random uniform permutations of variables 1 , , N Y Y  were generated.For Table 1 contains also the results of MDRIR method, which are similar to results of MDR method.However, MDRIR method allows to identify additional combinations (listed ws from the same table that hypertension and hypercholesterolemia are the most important non-genetic risk factors.Indeed, these two factors appear in each of 6 combinations. To perform a more precise analysis of influence of SNPs on CHD provoking we analyzed gene-gene interactions.We used two different strategies.Namely, we applied MDR m ere not subject to any of the non-genetic risk factors, i.e. to non-smokers without obesity and without hypercholesterolemia, 51 cases and 97 controls (risk-free sample).Another strategy was to apply MDR method to the whole sample, but to take into account only genetic factors rather than all factors.Table 2 contains the most significant combinations of SNPs and their EPEs.
Thus based on coronary heart disease data with the help of Tables 1 and 2 we can make the following conclusions.Combination of two SNPs (in GPIa and Cx37) and two non-genetic factors (hypertension olesterolemia) has the biggest influence on CHD.Also FXIII gives additional predictive power if AH and HC are taken into account.
It turns out that both methods yield similar results.Combination of SNPs in GPIa and Cx37 has the biggest influence on CHD.EPE is about 0.28 -0.34, and smaller error corresponds to a risk ws from Tables 1 and 2

Ternary Logic Regression
We performed several research procedures for CHD and MI data, with different res tical model.Set where a vector   We considered models in order to analyze both total influence of genetic and non-genetic factors and losse force appearing when some factors were excluded.In ou where v   R and v T are polynomials identified with trees.In other words we require that non-genetic factors are present only in trees consisting of one variable.
Model 2. Now we assume that any function h M  represe has the ntation where v   R and v T are polynomials identified with trees.Thus we allow the interaction of genes and nongenetic factors in order to find significant gene-environment interactions.However we impose additional ons to to restricti avoid o complex combinations of non- Copyright © 2012 SciRes.OJS genetic risk factors.We do not tackle here effects of interactions where several non-genetic factors are involved.Namely, we consider only the trees satisfying the following two conditions.1) If there is a leaf containing non-genetic factor variable then the root of that leaf contains product operator.
2) Moreover, another branch growing from the same root is also a leaf and contains a genetic (SNP) variable.
MDR yielded the same conclusion ent interactions were allowed (Model 2), no considerable increase in predictive power has been detected.How Thus the first tree has the greatest weight (coefficient equals -1.144), the second tree (i.e.SNP in Cx37) is on the second place, and non-genetic factors are less impor-    As seen from this table, the elimination of Cx37 SNP leads to a noticeable increase in the EPE.This fact agrees with results obtained by MDR analysis of the same dataset.

Sinc
we have two natural ways to compare them with our me thods.Namely, these are the prediction error and the final significance of each predictor.The given datasets were unbalanced w.r.t.response variable and we first applied the resampling technique to them.That means enlargement of the smaller of two groups case-control in the sample by additional bootstrap observations until the final proportion case:control is 1:1.Note that due to the resampling techniques the following effect arises: some observations in small groups (case or control) appear in the new sample more frequently than other ones.Therefore, we took the average over 1000 iterations.
Coronary heart disease.Results of RF and SGB methods are given in Table 6.It shows that RF and SGB methods gave statistically reliable results (EPE in the permutation test is close to 50%).Moreover, additional SNP information improved predicting ability by 13% (SGB).It seems that SGB method is fitted better to CHD data than RF.
To compute CVIM for each i X , we constructed a vector i Z as follows.Let i Z contain all predictors j X , j i  , for which chi-square criteria rejected independence hypothesis between j X and i X at 5% significance level.Table 7 shows that the most relevant predicto for CHD are AH, HC and Cx37.CVIM was calculated according to (29) and is given in Table 7.Thus, the most relevant predictors for MI are Cx37, Sm and AH.

Conclusions and Final Remarks
In the current study genetic data.Namely, we proposed the MDR pendent rule and ternary logic regression with a new version of simulated annealing.We compared them with several popular methods which appeared during the last decade.It is worth to emphasize that all considered methods yielded similar qualitative results for dataset under study.
Let us briefly summarize the main results obtained.The analysis of CHD dataset showed that two non-genetic risk factors out of four considered (AH and HC) had a strong connection with the disease risk (the error cl tio se requires a more detailed description of indivi as well.The study can be continued an m , R. Shimizu and V. V. Ulyanov, "Multivariate Statistics: High-Dimensional and Large-Sample Approximations," Wiley, Hoboken, 2010. [2] S. Szymczak, ell, O. Gonzálezassification based on non-genetic factors only is 0.25 -0.26 with p-value less than 0.01).Also, the classification based on SNPs only gave an error of 0.28 which is close to one obtained by means of non-genetic predictors.Moreover, the most influential SNPs were in genes Cx37 and GPIa (FXIII also entered the analysis only when AH and HC were present).EPE decreased to 0.13 when both SNP information and non-genetic risk factors were taken into account and SGB was employed.Note that excludeing any of the 5 remaining SNPs (all except for two most influential) from data increased the error by 0.01 -0.02 approximately.So, while the most influential data were responsible for the situation within a large part of population, there were smaller parts where other SNPs came to effect and provided a more efficient prognosis ("small subgroups effect").The significance of SNP in GPIa and FXIII genes was observed in our work.AH and HC influence the disease risk by affecting the vascular wall, while GPIa and FXIII may improve prognosis accuracy because they introduce haemostatic aspect into analysis.
The MI dataset gave the following results.The most significant factors of MI risk were the SNP in Cx37 (more precisely, homozygous for recessive allele) and smoking with a considerable gene-environment interactn present.The smallest EPE of methods applied was 0.33 -0.35 (with p-value less than 0.01).The classification based on non-genetic factors only yielded a greater error of 0.42.Thus genetic data improved the prognosis quality noticeably.While two factors were important, other SNPs considered actually did not improve the prognosis essentially, i.e. no small groups effect was observed.
While CHD data used in the study permitted to specify the most important predictors with EPE about 0.13, the MI data lead to less exact prognoses.Perhaps this complex disea dual's genetic characteristics and environmental factors.
The conclusions given above are based on several complementary methods of modern statistical analysis.These new data mining methods allow to analyze other datasets d the edical conclusions need to be replicated with larger datasets, in particular, involving new SNP data.
EPE) of a prediction algorithm .In Section 3 we use the estimated prediction error PA f which is based on K-fold cross-validation   and h M  .In contrast t our n of L re precisely, TLR) scheme involves normalization (cf.(1)), i.e. taking the observations with weights dependent on the proportion of cases and controls in subsample versio   ξ S .An easy computation yields that    arg min h M L h  equals arg max h M  That is, minimizing the score on is equivalent ua ntees strong consistency of this es functi to the normalized maximum likelihood estimation of * .
For a tree T its complexity C(T) is the number of leaves.The complexity C(F) of a forest F is the maximal complexity of trees constituting F. It is clear that if g G then there exists

Table 1 .
that EPE dropped significantly after additional non-genetic factors were taken into account (the error is 0.247 if additional non-genetic factors are taken into account and 0.343 if not).For all 100 simulations of b 

Table 3 .
EPE in Model 1 for CHD was only 0.19.For the same model we performed also fast the optimal forest which was much more time-efficient, and a reasonable application showed that non-genetic factors play an important role in CHD genesis, as classification based on non-genetic factors only gave the error less than 0.23, while usage of SNPs only (Model 4) let the error grow to 0.34.Model 1 gave the minimal EPE.For the optimal forest

Table 3 .
The analogous procedure is done for other columns (containing the values of other SNPs) and the errors found are given in Table4(recall that the EPE equals 0.19 if no permutation is done).It is seen that the error increases considerably when the values of GPIa and Cx37 are permuted.The statement that they are the main sources of risk agrees with what was obtained abo Myocardial infarction.For the MI dataset, under the same notation that above, the results obtained for our four models are given in To comment them we should first note that non-genetic risk ightly less important role compared with CHD risk: if they are used without genetic information, the error increases by 0.09, see Models 1 and 3 (while the same increase for CHD was 0.03).The function   