Transition Logic Regression Method to Identify Interactions in Binary Longitudinal Data

Logic regression is an adaptive regression method which searches for Boolean (logic) combinations of binary variables that best explain the variability in the outcome, and thus, it reveals interaction effects which are associated with the response. In this study, we extended logic regression to longitudinal data with binary response and proposed “Transition Logic Regression Method” to find interactions related to response. In this method, interaction effects over time were found by Annealing Algorithm with AIC (Akaike Information Criterion) as the score function of the model. Also, first and second orders Markov dependence were allowed to capture the correlation among successive observations of the same individual in longitudinal binary response. Performance of the method was evaluated with simulation study in various conditions. Proposed method was used to find interactions of SNPs and other risk factors related to low HDL over time in data of 329 participants of longitudinal TLGS study.


Introduction
Regression analysis is an important tool in evaluating the functional relationship between dependent variable, and a set of independent variables.On most issues, regression models can only relate the main effects of predictor variables to the response variable and evaluation of interaction effects cannot be exceeded of two-way or at most three-way, due to complexity of such interactions.
In order to consider such interactions in the regression models, some combinations of explanatory variables can be constructed and these combinations can be used as new predictors instead of using individual variables.
"Logic Regression" is a type of generalized regression and classification method based on logic combinations of binary variables which can make Boolean combinations of original binary explanatory variables in order to reveal interactions [1].Logic regression is different from logistic regression with "logit" link function that is a member of generalized linear model family for modeling response variables with binomial distribution.Although we can evaluate interactions using logistic regression, these interactions need to be known in advance, and used as input variables in the model.By contrast, Logic Regression is applicable for any type of response, as long as the predictors are binary.Interactions of interest need not be known in advance, quite the contrary, the detection of important variable interactions is the main aim of logic regression [2].Logic regression is introduced and used for case control or cohort studies with independent observations [2].
Furthermore, some extensions have been performed to this model in several ways.Namely, Multinomial Logic Regression has been developed for multinomial categorical responses [2].Trio Logic Regression with conditional Logic Regression model has been proposed to analyze data of case parents trios [3].Monte Carlo Logic Regression has been developed to generate a list of predictors related to the response [4].Logic FS has been introduced and used to identify different Logic Regressions associated with response [5].Genetic programming for association studies [6] has been proposed for classification settings, and uses genetic programming as search algorithm.
On the other hands, a longitudinal study is defined as an investigation where subject's responses are recorded at multiple follow-up times.A longitudinal study yields "repeated measurements" on each subject.In compare to cross sectional studies, longitudinal studies have some benefits such as measurement of individual change in outcomes, separation of time effects, and control for cohort effects [7].
Like other kind of regression models, interactions among predictors are important in modelling of longitudinal data.In addition, one of the goals of longitudinal studies is to examine whether the relationship between the response and the predictors changes over time.In other words, if there is any interaction between variables and time or not.It seems that logic regression theory can be used to assess interactions in modeling of longitudinal data.To find such time dependent interactions in quantitative longitudinal response, recently, "logic mixed model", based on linear mixed model, has been proposed and used to assess the interactions of SNP associated with longitudinal quantitative cholesterol level [8], but Logic Regression has not been developed for analysis of correlated binary observations of longitudinal studies up to now.
So, due to the importance of the interactions related to such responses, in this paper we proposed "Transition Logic Regression" model as an extension of logic regression to detect and assess higher order interactions over time in longitudinal data with binary response.Furthermore, we carried out a simulation study to evaluate the performance of our model in different settings and compare it with standard model.In addition, as an application, we assessed effects of some SNPs and other risk factors on having low level of HDL over time using our proposed Transition Logic Regression model.
The present paper was initially motivated by the SNP dataset with potential important interactions among SNPs related to binary longitudinal response.

Logic Regression
Logic Regression is a generalized regression and classification method that enables identification of interactions by using Boolean combinations as new independent variables of the original binary variables.We try to find Boolean statements involving the binary predictors that enhance the prediction for the response.These Boolean combinations are logic expressions such as ( ) ( ) . It means that if the response is binary as well (which is not required in general), we attempt to find decision rules such as "if 1 X and 3 X are true", or " 5 X but not 7 X are true", then the response is more likely to be in class 0. Let 1 , , k X X  be binary predictors, Y be a response variable and 1 , , p Z Z  be quantitative covariates, Logic Regression models are of the form: where g is a link function for response and j L is a Boolean combination of the binary predictors i X .Logic regression is an adaptive algorithm which for a given model selects those j L that minimize the score function of the model.Logic Regression framework includes many forms of regression (such as linear and logistic regression, Cox proportional hazards model).For every model type a score function is defined indicating the "quality" of the model.In general, any type of model can be considered, as long as a scoring function (such as a deviance or likelihood) is defined [2].

Simulated Annealing for Logic Regression
The number of logic expressions that can be built from a given set of binary predictors is huge, and there is no straight method to enlist all logic terms that yield different score.So, it is infeasible to do an exhaustive assessment of all different logic terms and select the best model.In order to solve this problem in Logic Regression, a simulated annealing as a stochastic search algorithm is used to search for the best logic combinations and estimate the j β [1].
There are some permissible moves in logic regression theory such as alternating a predictor, alternating an operator, deleting a predictor and so on, which called permissible moves.These moves are used in Annealing algorithm to generate new logic expressions in the search for the best logic regression model according to a score function.For more information about permissible moves see [1].In each iteration of the simulated annealing algorithm, a new logic term is proposed by randomly executing a move from the set of permissible move and so related new Logic Regression model is fitted.The acceptance probability for the new logic term is based on the score function of the new and current models, and a simulated annealing parameter called temperature [2].

Transition Model: Marginal Modelling of Binary Longitudinal Data Using Markov Chains
In order to extend Logic Regression to longitudinal study, we considered one kind of transition model for binary longitudinal data introduced by Gonçalves [9].This model is a marginal modelling of binary longitudinal data using Markov chains.Below this model is briefly described.For notation, , with mean ( ) For each subject at each time, let it X be a set of p covariates that first column of its can be a vector of ones to consider intercept term.Logistic regression model that marginally connects the probability distribution of the response and auxiliary variables is: where β is a p vector of unknown parameters.To take into account the correlation among successive observations of the same individual, the model considers a Markovian type of first order ( 1 ψ ) or of second order ( 2 ψ ) dependence structure.For the sake of simplicity, the subject subscript i was ignored, since individuals are assumed to be independent from each other.In the first order binary Markov chain model, the joint distribution are determined by the distribution of 1 Y and a set of conditional probabilities: For a pair of successive observations ( ) is already assigned.In order to analyze the binary data, the quantity odds ratio is the preferred measure of dependence between observations: ) After solving following equations with respect to p 0 and p 1 : If 1 1 ψ = , the variables are independent and j t p θ = .Similarly, in the second order binary Markov chain model, for ( ) , , First and second order dependence are: , , hj p can be calculated using these equations: ) where q equals the number of parameters in the model.

Our Proposed Method: Transition Logic Regression
In this paper, mentioned first and second order Markov chain transition model with AIC (Equation ( 3)) as a score function of the model, was used to develop Logic Regression to longitudinal data.Therefore, "Transition Logic Regression" was defined as: , with mean ( ) and it Z is vector of quantitative covariates and it L is vector of Boolean expression from binary predictors it X .and γ β are vectors of un- known parameters.To take into account the correlation among successive observations of the same individual, the model considers a Markovian type of first order ( 1 ψ ) or of second order ( 2 ψ ) dependence structure (Equa- tions (1) and ( 2)).
Searching to find best it L so that the fitted model has low AIC, was done using Annealing algorithm.Therefore, Annealing algorithm searched for Boolean combinations which according to the AIC statistic had the lowest score and therefore had the best fitting in Transition Logic Regression model.This extension allows for the fit of a Transition Logic Regression model.The program of Transition Logic Regression was written in FORTRAN 77 and added to "LogicReg" package [1].Modified "LogicReg" package was recompiled an installed in R(2.15.3) to analyze data.

Simulation Study
Simulation study was done to assess the performance of proposed model and to compare it with the standard model.Data was produced from binomial distribution with first order Markov chain dependence structure for three time points.
Given specific sample size, for each sample in time t, ten covariates were simulated from Bernoulli (5): The simulated model assumed as the interaction effect between predictors at time t.For each sample in each time t, three repeated measurements were constructed as the response variable t Y each with a predetermined probability of success t θ related to the interaction t L via logit link function: Starting with the first response, 1 y was produced from Bernoulli distribution with mean 1 θ Transition probabilities are: ( ) Respect to our desired values of t θ and 1 ψ , these first order transition probabilities were calculated.So, if 1 y equals to one, 2 y produced from Bernoulli distributed with probability of 1 p else if 1 y equals to zero, 2 y was simulated from Bernoulli distributed with mean 0 p .In order to produce 3 y under desired consideration, we calculated following transition probabilities: 3 y under desired consideration, we calculated below transition probabilities: ( ) ( ) ( ) , and coefficients of the interaction term ( ) 0, 0.5,1.5,3β = .With respect to simulated interaction term, we considered all covariates as the search space and one combination with two variables as the model size in annealing algorithm setting.For this simulation study, 500 datasets were generated for each condition.
Percentage of identification of exact simulated interaction was considered as quality of performance of the Transition Logic Regression model.Also, AIC of Transition Logic Regression was compared with AIC of Transition model as the standard model which only includes all ten covariates as the main effects in the model.
In addition, MSE and 95% empirical confidence interval of estimators in models that could identify interaction truly were calculated.Lower bound of empirical confidence intervals is 0.025th quantile and upper bound is 0.975th quantile of estimated values of parameters.
The results of simulation study are shown in Tables 1-4.According to these tables, as expected with increasing sample size and coefficient of interaction term, the rate of identification of true interaction increases.For example, in 200 n = and 3 β = method was able to find true interaction term in all 500 data sets.The value of the first order dependence did not have considerable effect on the performance of the method.
The same holds, MSE and confidence intervals of estimations get better with increasing of sample size.In small sample sizes, amount of coefficient of interaction and first order dependence have effect on MSE of 1 ψ so that in strong interaction effect or strong first order dependence, MSE of 1 ψ is large.
Maximum type I error was 0.01 that method had found 1 2 L X X = ∨ as interaction effect when there was not such interaction in data ( )

Application of Proposed Model on TLGS Data
Interactions usually play an important role in SNP (Single-nucleotide polymorphism) association studies.High order interactions of SNPs are supposed to explain the differences between low-and high-risk groups [10].In addition to the main effects of SNPs, their interactions are assumed to be responsible for low HDL.SNPs interactions can be time-dependent.So, our aim of this study was investigation SNPs interactions related to low HDL over time.Subjects in this study were selected from among participants of the Tehran Lipid and Glucose Study (TLGS).TLGS is a prospective study to determine the risk factors and outcomes of non-communicable disease [11].The structure of this study includes some major components.The TLGS design has been explained elsewhere [12].Longitudinal data from the three phases of the TLGS study was analyzed to assess the association between the some related polymorphisms and other risk factors with low levels of HDL over time.In order to assess this association, Transition Logic Regression models with first and second order Markov chain were fitted.
First order Markov chain Transition Logic Regression model with three tree logic (Boolean combination) and 8 leaves (predictor variables) was fitted.
A total of 329 subjects (127 (38.6%) men and 202 (61.4%) women) who were present in phase I, II, III of TLGS study with age ≥20 years and without any missing value in evaluated variables were randomly selected and included in the current study.
Low HDL-C level was defined as <40 mg/dL for men and <50 mg/dL for women.High waist circumference (WC) was defined as WC ≥95 cm for Iranian men and women [13].High triglyceride (TG) level was defined as TG ≥150 mg/dL, subjects who had blood pressure (BP) ≥130/85 mmHg or used anti-hypertension drug, and subjects with fasting blood sugar (FBS) ≥110 mg/dL or users of anti-diabetic drugs were considered as high BP and high FBS respectively [14] [15].Subjects who smoke daily or occasionally were considered as smokers.Phase of study was considered as time.
Table 5 pictures the summary of demographic characteristic and clinical and lipid profiles of these subjects in three phases of study.Highest prevalence of having low HDL (79.3%) was seen in phase 2 of study.
The polymorphisms of ApoA1M1, ApoA1M2, ApoB, ApoAIV, ApoCIII, ABCA1, SRB1 and ApoE genes that have been shown to be associated with HDL-C disorder [16]- [20] were investigated.Allele frequencies given in Table 6 show genotype distributions.The +/+ genotype of Apo A1M2 gene had the highest prevalence (91.2%) and TT genotype of Apo AIV gene had the lowest frequency (0.3%).
Each SNP was considered as a random variable taking values 0, 1, and 2 corresponding to the nucleotide pairs.We coded each of these variables into two dummy binary variables corresponding to a dominant and a recessive effect.By this approach, we generated 2p binary predictors out of p SNPs to perform interaction terms for Logic Regression [1].
The results of Transition Logic Regression with first order Markov chain show that subjects with high triglyceride and high waist circumstance have an odds ratio of 2.29 to have low level of HDL.Also, (being in phase 2 and ((carrier of the minor allele of ApoA1M1) or (being homozygous for the common allele of ApoCIII))) was  associated with an increased odds of having low HDL (OR = 2.30).The odds ratio for having low level of HDL in subjects with ((high Blood pressure and male) or (being homozygous for the minor allele of SRB1)) combination is 0.38.The first order Markov chain dependence between adjacent observations of response was estimated 2.5 indicating the strong dependence between successive observations.The AIC for this model was 100.72.Also, second order Markov chain Transition Logic Regression model with three tree logic and 8 leaves was fitted.The result of Transition Logic Regression with second order Markov chain was fairly similar to the result of the first order.According to this model, The odds ratio for having low level of HDL in subjects with ((high Blood pressure and male) or (being homozygous for the minor allele of SRB1)) combination is 0.37 and being in phase 2 or being homozygous for the minor allele of ApoCIII was associated with an increased odds of having low HDL.Also, subjects with high triglyceride that have high waist circumstance or high blood pressure have an odds ratio of 2.51 to have low level of HDL.The first order Markov chain dependence between adjacent observations of response was estimated 2.32 and the second order Markov chain dependence was 1.65.The AIC for this model was 974.96.Results of first and second-order Transition Logic Regression are shown in Table 7.

Discussion
In the first part of the paper, we extended Logic Regression and proposed a model which allowed first and second order Markov dependence in longitudinal binary data for which the marginal probability of success was modeled via a form of Logic Regression.In the second part, a simulation study was done that evaluated performance of the proposed model in different conditions.The simulation study indicated a satisfactory behavior for proposed model so that, in all condition AIC of Transition Logic Regression models were less than AIC of transition models with only main effect.Moreover, Transition Logic Regression was able to find moderate or strong interaction effects nearly in all datasets for sample sizes more than 50.In sample size 50 the quality of the estimators were poor.In this sample size, MSE of ψ 1 is not acceptable especially for strong interaction effect and high dependency.Also in this sample size, confidence intervals of estimators of β in β = 0.5 have not consisted true value of the parameter.By increasing the sample size, MSE measures of estimation for ψ 1 and β were decreased so that in other sample sizes, the performance of the method and quality of estimators are acceptable.
In the last part of the paper, proposed models were applied to the data from TLGS study and some interactions among SNPs and other covariates related to low HDL were identified.The results of first and second order Markov chain were fairly similar to each other and both of them had similar combinations.AIC of Transition Logic Regression model with second order dependency was less than the model with first order so it can be concluded that second order model is able to fit the data better than first order.
In this study, we had to work only with complete dataset because in Logic Regression methodology, missing problem has not been solved yet.It will be helpful if missing data is addressed in Logic Regression in future research.

Conclusion
Considering the identification of interactions in longitudinal study with binary response, Transition Logic Regression was introduced and used to find interactions influencing low HDL over time and the most important interactions were identified.
likelihood function for the entire sample is obtained by calculating the sum of the likelihood of all subjects[9]: Simulation study was done for various sample sizes (number of cases: 50, 200, 500, 1000), first order Markov chain dependences ( ) 1 0.2, 0.5,1, 2,5 ψ =

Table 5 .
Demographic characteristic and clinical and lipid profiles of subjects in phases of study.Entries are mean ± sd for Age and number (%) for the rest categorical variables.* is time dependent variable.† is time independent variable.

Table 6 .
Genotype and allele frequencies of Apo E, Apo A1M1, Apo A1M2, Apo B, Apo AIV, Apo CIII, and SRB1 in the study population.

Table 7 .
Results of Transition Logic Regression model with 3 Boolean combination of 8 binary predictor variables for first and second order Markov chain dependence structure to study interaction effects of SNPs and other risk factors on having low level of HDL.