^{1}

^{2}

^{3}

^{4}

^{5}

Logic regression is an adaptive regression method which searches for Boolean (logic) combinations of binary variables that best explain the variability in the outcome, and thus, it reveals interaction effects which are associated with the response. In this study, we extended logic regression to longitudinal data with binary response and proposed “Transition Logic Regression Method” to find interactions related to response. In this method, interaction effects over time were found by Annealing Algorithm with AIC (Akaike Information Criterion) as the score function of the model. Also, first and second orders Markov dependence were allowed to capture the correlation among successive observations of the same individual in longitudinal binary response. Performance of the method was evaluated with simulation study in various conditions. Proposed method was used to find interactions of SNPs and other risk factors related to low HDL over time in data of 329 participants of longitudinal TLGS study.

Regression analysis is an important tool in evaluating the functional relationship between dependent variable, and a set of independent variables. On most issues, regression models can only relate the main effects of predictor variables to the response variable and evaluation of interaction effects cannot be exceeded of two-way or at most three-way, due to complexity of such interactions.

In order to consider such interactions in the regression models, some combinations of explanatory variables can be constructed and these combinations can be used as new predictors instead of using individual variables.

“Logic Regression” is a type of generalized regression and classification method based on logic combinations of binary variables which can make Boolean combinations of original binary explanatory variables in order to reveal interactions [

Furthermore, some extensions have been performed to this model in several ways. Namely, Multinomial Logic Regression has been developed for multinomial categorical responses [

On the other hands, a longitudinal study is defined as an investigation where subject’s responses are recorded at multiple follow-up times. A longitudinal study yields “repeated measurements” on each subject. In compare to cross sectional studies, longitudinal studies have some benefits such as measurement of individual change in outcomes, separation of time effects, and control for cohort effects [

Like other kind of regression models, interactions among predictors are important in modelling of longitudinal data. In addition, one of the goals of longitudinal studies is to examine whether the relationship between the response and the predictors changes over time. In other words, if there is any interaction between variables and time or not. It seems that logic regression theory can be used to assess interactions in modeling of longitudinal data. To find such time dependent interactions in quantitative longitudinal response, recently, “logic mixed model”, based on linear mixed model, has been proposed and used to assess the interactions of SNP associated with longitudinal quantitative cholesterol level [

So, due to the importance of the interactions related to such responses, in this paper we proposed “Transition Logic Regression” model as an extension of logic regression to detect and assess higher order interactions over time in longitudinal data with binary response. Furthermore, we carried out a simulation study to evaluate the performance of our model in different settings and compare it with standard model. In addition, as an application, we assessed effects of some SNPs and other risk factors on having low level of HDL over time using our proposed Transition Logic Regression model.

The present paper was initially motivated by the SNP dataset with potential important interactions among SNPs related to binary longitudinal response.

Logic Regression is a generalized regression and classification method that enables identification of interactions by using Boolean combinations as new independent variables of the original binary variables. We try to find Boolean statements involving the binary predictors that enhance the prediction for the response. These Boolean combinations are logic expressions such as

Let

Logic Regression models are of the form:

where g is a link function for response and

Logic regression is an adaptive algorithm which for a given model selects those

The number of logic expressions that can be built from a given set of binary predictors is huge, and there is no straight method to enlist all logic terms that yield different score. So, it is infeasible to do an exhaustive assessment of all different logic terms and select the best model. In order to solve this problem in Logic Regression, a simulated annealing as a stochastic search algorithm is used to search for the best logic combinations and estimate the

There are some permissible moves in logic regression theory such as alternating a predictor, alternating an operator, deleting a predictor and so on, which called permissible moves. These moves are used in Annealing algorithm to generate new logic expressions in the search for the best logic regression model according to a score function. For more information about permissible moves see [

In order to extend Logic Regression to longitudinal study, we considered one kind of transition model for binary longitudinal data introduced by Gonçalves [

For notation,

where β is a p vector of unknown parameters. To take into account the correlation among successive observations of the same individual, the model considers a Markovian type of first order (

For a pair of successive observations

In order to analyze the binary data, the quantity odds ratio is the preferred measure of dependence between observations:

After solving following equations with respect to p_{0} and p_{1}:

It yields:

where

If

First and second order dependence are:

Likelihood inference is performed based on sample of n subjects who are assumed to be independent from each other. If

where

Clearly, the likelihood function for the entire sample is obtained by calculating the sum of the likelihood of all

subjects [

AIC statistic for the model is calculated as:

where q equals the number of parameters in the model.

In this paper, mentioned first and second order Markov chain transition model with AIC (Equation (3)) as a score function of the model, was used to develop Logic Regression to longitudinal data. Therefore, “Transition Logic Regression” was defined as:

Searching to find best

Simulation study was done to assess the performance of proposed model and to compare it with the standard model. Data was produced from binomial distribution with first order Markov chain dependence structure for three time points.

Given specific sample size, for each sample in time t, ten covariates were simulated from Bernoulli (5):

The simulated model assumed

Starting with the first response,

Respect to our desired values of

In order to produce

To simulate_{2} equals to one,

Simulation study was done for various sample sizes (number of cases: 50, 200, 500, 1000), first order Markov chain dependences

With respect to simulated interaction term, we considered all covariates as the search space and one combination with two variables as the model size in annealing algorithm setting. For this simulation study, 500 datasets were generated for each condition.

Percentage of identification of exact simulated interaction was considered as quality of performance of the Transition Logic Regression model. Also, AIC of Transition Logic Regression was compared with AIC of Transition model as the standard model which only includes all ten covariates as the main effects in the model.

In addition, MSE and 95% empirical confidence interval of estimators in models that could identify interaction truly were calculated. Lower bound of empirical confidence intervals is 0.025th quantile and upper bound is 0.975th quantile of estimated values of parameters.

The results of simulation study are shown in Tables 1-4. According to these tables, as expected with increasing sample size and coefficient of interaction term, the rate of identification of true interaction increases. For example, in

The same holds, MSE and confidence intervals of estimations get better with increasing of sample size. In small sample sizes, amount of coefficient of interaction and first order dependence have effect on MSE of

Maximum type I error was 0.01 that method had found

Interactions usually play an important role in SNP (Single-nucleotide polymorphism) association studies. High order interactions of SNPs are supposed to explain the differences between low- and high-risk groups [

First order Markov chain Transition Logic Regression model with three tree logic (Boolean combination) and 8 leaves (predictor variables) was fitted.

A total of 329 subjects (127 (38.6%) men and 202 (61.4%) women) who were present in phase I, II, III of TLGS study with age ≥20 years and without any missing value in evaluated variables were randomly selected and included in the current study.

Low HDL-C level was defined as <40 mg/dL for men and <50 mg/dL for women. High waist circumference (WC) was defined as WC ≥95 cm for Iranian men and women [

The polymorphisms of ApoA1M1, ApoA1M2, ApoB, ApoAIV, ApoCIII, ABCA1, SRB1 and ApoE genes that have been shown to be associated with HDL-C disorder [

Each SNP was considered as a random variable taking values 0, 1, and 2 corresponding to the nucleotide pairs. We coded each of these variables into two dummy binary variables corresponding to a dominant and a recessive effect. By this approach, we generated 2p binary predictors out of p SNPs to perform interaction terms for Logic Regression [

The results of Transition Logic Regression with first order Markov chain show that subjects with high triglyceride and high waist circumstance have an odds ratio of 2.29 to have low level of HDL. Also, (being in phase 2 and ((carrier of the minor allele of ApoA1M1) or (being homozygous for the common allele of ApoCIII))) was