^{1}

^{*}

^{2}

^{3}

^{3}

^{3}

Binary logistic regression models are commonly used to assess the association between outcomes and covariates. Many covariates are inherently continuous, and have a variety of distributions, including those that are heavily skewed to the left or right. Existing theoretical formulas, criteria, and simulation programs cannot accurately estimate the sample size and power of non-standard distributions. Therefore, we have developed a simulation program that uses Monte Carlo methods to estimate the exact power of a binary logistic regression model. This power calculation can be used for distributions of any shape and covariates of any type (continuous, ordinal, and nominal), and can account for nonlinear relationships between covariates and outcomes. For illustrative purposes, this simulation program is applied to real data obtained from a study on the influence of smoking on 90-day outcomes after acute atherothrombotic stroke. Our program is applicable to all effect sizes and makes it possible to apply various statistical methods, logistic regression and related simulations such as Bayesian inference with some modifications.

Logistic regression models have been used to determine the association between risk factors and outcomes in various fields, including medical and epidemiological research [

Theoretical formulas, criteria, and software applications have been developed to enable the accurate determination of sample size and statistical power in a binary logistic regression model [

We consider a case-control study in which the binary response variable y denotes each patient’s disease status (y = 1 for cases and y = 0 for controls). For each subject, we have a set of p covariates X_{1}, X_{2} , _{p}. Let the conditional probability that an outcome is present be denoted by. The logit of the multiple logistic regression model is

in which case the logistic regression model is

where β is an unknown parameter.

We replace the linear term associated with the covariate X_{1} in the standard binary logistic regression model with a two-segment function containing a change-point. The relationship between the logit outcome and X_{1} is different either side of this change-point. The two-segment logistic regression model shown in

where represents the value of the change-point and; α_{1}, α_{2} are the unknown regression coefficients of X_{1}.

Our Monte Carlo simulation program is written in the SAS/STAT/IML language; the source code is given in the appendix. The program consists of three parts: data generation, parameter estimations, and statistical power calculation. Users should modify and add to these conditions according to their specific purposes and interests.

Continuous distributions are generated by specifying the mean, standard deviation, skewness, kurtosis, and correlation, or by assigning frequencies in each designated interval of a continuous variable (see

In the proposed program, users assign values for the proportion of events, sample size, Type I error, regression coefficients, distribution type (dichotomous, polytomous, or continuous), and distribution shape, as well as the quantile number for the categorization approach. The output of this program shows the average and standard error of each coefficient, as well as the power. A flowchart describing this program is shown in

The validity of the program is confirmed by comparing its results with those given by Hsieh’s program. The output of our program is almost the same as that from Hsieh’s program. For example, when the event proportion, sample size, and regression coefficient were set to 0.01, 12,580, and −0.223, respectively, our program estimated a power of 0.81, whereas Hsieh’s gave a result of 0.8. When the event proportion, sample size, and regression

Input parameter | Explanation |
---|---|

SEED | Random number seed (should be a positive integer) |

ALEVEL | Significance level of the statistical test (Type I error) |

P | Event proportion (response probability) |

NITER | Number of iterations performed |

N_REPEAT | Number of iterations performed |

^{*}NITER and N_REPEAT should be the same number | |

PATH | Directory in which results are saved |

TABLE | |

R | Number of categorized groups |

Example: continuous = 1, median = 2, tertile = 3, quantile = 4 | |

^{*}If model includes nominal variables, R should be >1 | |

CHANGE_POINT | Change point (see |

Regression coefficients for the covariates in the full model, except for predictors and intercept, specified as: | |

MODEL_1 | %NRSTR(α_{1}^{*}X1+,_{i}X_{i}) |

MODEL_2 | %NRSTR(α_{1}^{*}(the value of change_point) + α_{2}^{*}(X1 − (the value of change_point) +,_{i}X_{i}) |

α and β are the given regression coefficient values | |

^{*}If model is linear, the regression coefficients α_{1} and α_{2} are the same | |

Sample size, mean, standard deviation, skewness, kurtosis, and correlation are specified as: | |

Example DATA a (type=CORR); LENGTH _TYPE_ $40; INPUT _NAME_ $_TYPE_$ X1 X2 ; IF TRIM(LEFT(_TYPE_))=’N’ THEN call symput(‘NSP’, X1); CARDS; . MEAN 70 50 . STD 4 5 . N 300 300 X1 CORR 1 0 X2 CORR 0 1 ; RUN; | |

^{*}If only one covariate is defined, the correlation should be set to 1. The sample size of all covariates should be the same. | |

SKW_KRT | %NRSTR ({skewness 1 kurtosis 1, skewness 2 kurtosis 2, |

^{*}If covariates are normally distributed, both skewness and kurtosis are set to 0. | |

LIST_VARNAME | %NRSTR (X1, X2, |

MIN | Minimum value of a continuous variable |

MAX | Maximum value of a continuous variable |

SUB_GROUP | Number of subgroups |

CATEGORIZATION | %NRSTR (list of covariates to be categorized) |

CATEGORIZATION_R | %NRSTR (list of new covariate names after categorization) |

CONTI_MODEL | %NRSTR (list of covariates in a continuous logistic regression model) |

^{*}Even if some parameters are not needed, please assign all parameters and specify necessary variables in a logistic regression model. |

To assign the desired number of observations to each subgroup, as shown |
---|

For sample size, n_{k}, observation values of the k^{th} subgroup are extracted from a randomly generated U(a_{k}, b_{k}). U1 corresponds to the lowest interval subgroup, and U2 corresponds to the next lowest interval subgroup. For example, the minimum value, maximum value and number of subgroups are set to 1, 21 and 5, respectively. Therefore, the subgroups are (1, 5), (5, 9), (9, 13), (13, 17), (17, 21). The subgroups are assigned frequencies of 0.55, 0.05, 0.2, 0.15 and 0.05, respectively. &NSP denotes the total sample size; ID is the observation identification. |

%MACRO RATIO; |

IF 1=< ID <&NSP.*0.55 THEN _H1=U1; |

ELSE IF &NSP.*0.55 = |

ELSE IF &NSP.*0.6 =< ID <&NSP.*0.8 THEN _H1=U3; |

ELSE IF &NSP.*0.8 =< ID <&NSP.*0.95 THEN _H1=U4; |

ELSE _H1=U5; |

%MEND RATIO; |

If model includes discrete variables, then specify the model in part of PROC LOGISTIC in %MODEL_CATEGORICAL, and ensure the input parameter R is greater than 1. |

%MACRO MODEL_CATEGORICAL; |

ODS OUTPUT PARAMETERESTIMATES=PARAM_&R CONVERGENCESTATUS=STATUS_&R TYPE3=TYPE3_&R; |

PROC LOGISTIC DATA=G&R; |

/*******modification**********************/ |

CLASS C1(PARAM=REF REF="0") D1(PARAM=REF REF="0") ; |

MODEL Y(EVENT='1')= C1 X1 X2 X3 D1 |

/*********************************/ |

/TECH=NR MAXITER=8 XCONV=0.01; |

BY STRATA; RUN; |

%MEND; |

coefficient were set to 0.5, 225, and 0.405, respectively, our program estimated a power of 0.89, which compares well with Hsieh’s result of 0.9 [

Non-normal or normal multivariate continuous variables are generated by specifying the mean, standard deviation, kurtosis, skewness, and correlation through a procedure in the %COEFF and %CONTINOUS SAS modules. A detailed explanation can be found in a book on SAS^{®} for Monte Carlo Studies [

A continuous variable is divided into l subgroups of equal intervals as shown in

The length of the interval of each subgroup is

The k^{th} subgroup ranges from a_{k}_{ }to b_{k} (k = 1, _{k} and b_{k} can be expressed as

Random numbers from a uniform distribution on the interval (0, 1) are converted to a uniform distribution on the interval (a_{k}, b_{k}) with the equation a_{k} + (b_{k} ? a_{k}) × (generated number). The k^{th} subgroup, consisting of n_{k} observations in the interval (a_{k}, b_{k}), is denoted by the variable H_{1}.

If the covariate is assumed to follow a probability distribution, the RAND function can be inserted into a macro PDF module. In the example given for this program, nominal variables are generated using the SAS TABLE function.

The individual probability of event occurrences is calculated from the assigned parameters and generated covariates using a logistic regression model. The initial intercept value is set to zero, and then the average is calculated. The intercept is determined from

Intercept =

After determining the intercept, the individual probability π(X_{i}) (for i = 1, _{i}) and random numbers from a uniform distribution on the interval (0, 1). If π(x_{i}) is less than the corresponding random number, Y_{i} = 1 (denoting that the event occurred); otherwise, Y_{i} = 0. Finally, we have a dataset consisting of a covariate and response variable (Y). For skewed distributions, the event proportion of the generated dataset might not be the same as the input value. However, our program outputs the event proportion of this dataset. This difference can be adjusted by changing the input parameters of the event proportion.

We conducted a logistic regression analysis in a model including continuous and/or design variables to obtain maximum likelihood estimates of and the significance level, or p-value, for the null hypothesis with population regression coefficient β = 0.

There is a possibility of non-convergence if the data are completely or partially separated. This is because one or more parameters in the model become theoretically infinite, and it may not be possible to obtain reliable maximum likelihood estimates [

The test module outputs the mean of the asymptotic standard error, and the statistical power. The proportion of tests in which the p-value is less than the Type I error level is defined as the power.

Variable H_{1} was set to be the National Institute of Health Stroke Scale (NIHSS, a tool used by healthcare providers to objectively quantify the impairment caused by a stroke). The minimum, maximum, and the number of subgroups were set as 1, 21, and 5, respectively. Therefore, the subgroups ranged from (1, 5), (5, 9), (9, 13), (13, 17), and (17, 21). The frequency of each subgroup was assumed to be 0.55, 0.05, 0.2, 0.15, and 0.05, respectively. The generated numbers were rounded off, and the event proportion was set to 0.2. Regression coefficient parameters (,) were taken as (0.00, 0.00), (0.06, 0.06), and (0.06, 0.15), and the change point was set to 4. H_{1} was set to be either continuous, divided at median or tertile points, or categorized into three groups: 1 - 4, 5 - 15, and ≥16. We executed the logistic model for these values of H_{1}, and present the results in

We used age and systolic arterial pressure as continuous variables X_{1 }and X_{2}, respectively. The mean and standard deviation of X_{1} were 70 and 8, and the skewness and kurtosis were set to combinations of 0 and 0, −0.5 and 0.5, and −1.0 and 1.0. The regression coefficient of X_{1} was 0.05 under a linear relationship. The mean and standard deviation of X_{2} were 160 and 25, and the skewness and kurtosis were set to combinations of 0 and 0, 0.4 and 0.3, and 0.8 and 0.6. The regression coefficient of X_{2} was set to 0.02 as a linear relationship. The correlation between the variables was set to 0, 0.3, and 0.6. The binary variable D_{1} denotes smoking or non-smoking. The proportion of non-smokers and smokers was 0.5 and 0.5, and the regression coefficient was 0.83. The sample size was set to 300 and 500. We executed the logistic model for X_{1}, X_{2}, and D_{1}. The results are shown in

α_{1} | α_{2} | Categorization | Coefficient | SE | Power |
---|---|---|---|---|---|

0.00 | 0.00 | Continuous | 0.001 | 0.027 | 0.061 |

Median | 0.006 | 0.294 | 0.055 | ||

Tertile | 0.013 | 0.367 | 0.049 | ||

1 - 4 | 0.013 | 0.368 | |||

5 - 15 | −0.001 | 0.314 | 0.043 | ||

≥16 | −0.033 | 0.546 | |||

0.06 | 0.06 | Continuous | 0.062 | 0.026 | 0.682 |

Median | 0.616 | 0.293 | 0.545 | ||

Tertile | 0.238 | 0.397 | 0.459 | ||

1 - 4 | 0.768 | 0.374 | |||

5 - 15 | 0.530 | 0.313 | 0.516 | ||

≥16 | 0.904 | 0.477 | |||

0.06 | 0.15 | Continuous | 0.153 | 0.027 | 1.000 |

Median | 1.527 | 0.309 | 0.999 | ||

Quantile | 0.588 | 0.456 | 1.000 | ||

1 - 4 | 1.896 | 0.417 | |||

5 - 15 | 1.323 | 0.326 | 1.000 | ||

≥16 | 2.287 | 0.468 |

X_{1} X_{2} | X_{1} | Left-skewed covariate | X_{2} | Right- skewed covariate | D_{1 }binary | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

(skewness, kurtosis) | Mean | SD | (S, K) Coefficient | SE | Power | Mean | SD | (S, K) Coefficient | SE | Power | Coefficient SE | SE | Power |

70 | 8 | β_{1} = 0.05 | 160 | 25 | β_{2} = 0.02 | β_{2} = 0.83 | |||||||

Correlation of (X_{1}, X_{2}) = 0.0 | |||||||||||||

(0.0, 0.0) (0.0, 0.0) | 70.0 | 8.0 | (0.0, 0.0) 0.052 | 0.019 | 0.774 | 160.0 | 24.9 | (0.0, 0.0) 0.020 | 0.006 | 0.931 | 0.860 | 0.304 | 0.801 |

(−0.5, 0.5) (0.0, 0.0) | 70.0 | 8.0 | (−0.5, 0.5) 0.052 | 0.020 | 0.754 | 160.0 | 24.9 | (0.0, 0.0) 0.020 | 0.006 | 0.933 | 0.864 | 0.304 | 0.815 |

(−1.0, 1.0) (0.0, 0.0) | 70.0 | 8.0 | (−1.0, 1.0) 0.053 | 0.021 | 0.727 | 160.0 | 24.9 | (0.0, 0.0) 0.020 | 0.006 | 0.933 | 0.864 | 0.304 | 0.818 |

(0.0, 0.0) (0.4, 0.3) | 70.0 | 8.0 | (0.0, 0.0) 0.052 | 0.019 | 0.764 | 160.0 | 24.9 | (0.4, 0.3) 0.020 | 0.006 | 0.948 | 0.861 | 0.305 | 0.805 |

(0.0, 0.0) (0.8, 0.6) | 70.0 | 8.0 | (0.0, 0.0) 0.052 | 0.019 | 0.766 | 160.0 | 24.9 | (0.8, 0.6) 0.020 | 0.006 | 0.957 | 0.861 | 0.306 | 0.805 |

(−0.5, 0.5) (0.4, 0.3) | 70.0 | 8.0 | (−0.5, 0.5) 0.052 | 0.020 | 0.753 | 160.0 | 24.9 | (0.4, 0.3) 0.020 | 0.006 | 0.948 | 0.863 | 0.305 | 0.812 |

(−1.0, 1.0) (0.8, 0.6) | 70.0 | 8.0 | (−1.0, 1.0) 0.053 | 0.021 | 0.723 | 160.0 | 24.9 | (0.8, 0.6) 0.020 | 0.006 | 0.955 | 0.861 | 0.305 | 0.816 |

Correlation of (X_{1}, X_{2}) = 0.3 | |||||||||||||

(0.0, 0.0) (0.0, 0.0) | 70.0 | 8.0 | (0.0, 0.0) 0.051 | 0.020 | 0.742 | 160.0 | 25.0 | (0.0, 0.0) 0.021 | 0.007 | 0.910 | 0.871 | 0.305 | 0.820 |

(−0.5, 0.5) (0.0, 0.0) | 70.0 | 8.0 | (−0.5, 0.5) 0.052 | 0.021 | 0.704 | 160.0 | 25.0 | (0.0, 0.0) 0.021 | 0.007 | 0.910 | 0.871 | 0.304 | 0.830 |

(−1.0, 1.0) (0.0, 0.0) | 70.0 | 8.0 | (−1.0, 1.0) 0.052 | 0.022 | 0.665 | 160.0 | 25.0 | (0.0, 0.0) 0.021 | 0.007 | 0.906 | 0.875 | 0.306 | 0.826 |

(0.0, 0.0) (0.4, 0.3) | 70.0 | 8.0 | (0.0, 0.0) 0.052 | 0.020 | 0.747 | 160.0 | 25.0 | (0.4, 0.3) 0.020 | 0.006 | 0.922 | 0.873 | 0.303 | 0.827 |

(0.0, 0.0) (0.8, 0.6) | 70.0 | 8.0 | (0.0, 0.0) 0.052 | 0.020 | 0.755 | 160.0 | 25.0 | (0.8, 0.6) 0.020 | 0.006 | 0.935 | 0.874 | 0.305 | 0.823 |

(−0.5, 0.5) (0.4, 0.3) | 70.0 | 8.0 | (−0.5, 0.5) 0.052 | 0.021 | 0.704 | 160.0 | 25.0 | (0.4, 0.3) 0.020 | 0.006 | 0.926 | 0.874 | 0.305 | 0.829 |

(−1.0, 1.0) (0.8, 0.6) | 70.0 | 8.0 | (−1.0, 1.0) 0.052 | 0.023 | 0.662 | 160.0 | 25.0 | (0.8, 0.6) 0.020 | 0.006 | 0.938 | 0.874 | 0.305 | 0.829 |

Correlation of (X_{1}, X_{2}) = 0.6 | |||||||||||||

(0.0, 0.0) (0.0, 0.0) | 70.0 | 8.0 | (0.0, 0.0) 0.052 | 0.024 | 0.587 | 160.0 | 25.0 | (0.0, 0.0) 0.021 | 0.008 | 0.779 | 0.873 | 0.305 | 0.817 |

(−0.5, 0.5) (0.0, 0.0) | 70.0 | 8.0 | (−0.5, 0.5) 0.052 | 0.025 | 0.542 | 160.0 | 25.0 | (0.0, 0.0) 0.020 | 0.008 | 0.780 | 0.875 | 0.304 | 0.823 |

(−0.1, 1.0) (0.0, 0.0) | 70.0 | 8.0 | (−1.0, 1.0) 0.052 | 0.027 | 0.508 | 160.0 | 25.0 | (0.0, 0.0) 0.021 | 0.008 | 0.805 | 0.875 | 0.303 | 0.819 |

(0.0, 0.0) (0.4, 0.3) | 70.0 | 8.0 | (0.0, 0.0) 0.052 | 0.024 | 0.595 | 160.0 | 25.0 | (0.4, 0.3) 0.020 | 0.007 | 0.812 | 0.873 | 0.307 | 0.822 |

(0.0, 0.0) (0.8, 0.6) | 70.0 | 8.0 | (0.0, 0.0) 0.052 | 0.024 | 0.608 | 160.0 | 25.0 | (0.8, 0.6) 0.020 | 0.007 | 0.844 | 0.873 | 0.308 | 0.815 |

(−0.5, 0.5) (0.4, 0.3) | 70.0 | 8.0 | (−0.5, 0.5) 0.052 | 0.025 | 0.557 | 160.0 | 25.0 | (0.4, 0.3) 0.020 | 0.007 | 0.815 | 0.872 | 0.305 | 0.818 |

(−1.0, 1.0) (0.8, 0.6) | 70.0 | 8.0 | (−1.0, 1.0) 0.052 | 0.026 | 0.530 | 160.0 | 25.0 | (0.8, 0.6) 0.020 | 0.007 | 0.878 | 0.869 | 0.305 | 0.812 |

S, skewness; K, kurtosis.

(b)

S, skewness; K, kurtosis.

The mean, standard deviation, skewness, and kurtosis in the generated variables were almost equal to those of their input parameters. An increase in sample size leads to higher power, whereas a higher correlation produces a lower power. Our results clearly illustrate that the power will differ depending on the shape of the distribution. Negatively skewed distributions exhibit low power, whereas a positive skew results in high power. There is an inverse relationship between the logit outcome and the covariates. For example, when the skewness of X_{1} was changed from −0.5 to −1.0, the power decreased from 0.75 to 0.73. When the skewness of X_{2} was changed from 0.4 to 0.8, the power increased from 0.81 to 0.84 or 0.88 for a sample size of 300 and correlation of 0.0.

It is important to establish that the results observed in the above simulations hold for real data. For this purpose, we used data from a study of the influence of smoking on 90-day outcomes after acute atherothrombotic stroke in 292 Japanese men [

Input parameter | Explanation |
---|---|

SEED | 9 |

ALEVEL | 0.05 |

P | 0.15 |

NITER | 1000 |

PATH | C:\ |

TABLE | Table_samplerun_3 |

R | 2 |

CHANGE_POINT | 4 |

MODEL_1 | %NRSTR(0.04*H1+0.8*D1+ 0.06*X1 + 0.02*X2 + 1.1*X3) |

MODEL_2 | %NRSTR(0.04*4+0.15*(H1-4)+0.8*D1+0.06*X1+0.02*X2 +1.1*X3) |

DATA a (type=CORR); LENGTH _TYPE_ $40; INPUT _NAME_ $_TYPE_$ X1 X2 X3 ; IF TRIM(LEFT(_TYPE_))=’N’ THEN call symput(‘NSP’,X1); CARDS; . MEAN 70 160 36.4 . STD 8 25 0.5 . N 292 292 292 X1 CORR 1 -0.1 -0.1 X2 CORR -0.1 1 0.1 X3 CORR -0.1 0.1 1 ; RUN； | |

SKW_KRT | %NRSTR({-0.5 0.5, 0.4 0.3, -0.08 0.7}) |

LIST_VARNAME | %NRSTR(X1 X2 X3) |

CONTI_MODEL | %NRSTR(X1 X2 X3 C1) |

Min= | 1 |

Max= | 21 |

SUB_GROUP= | 5 |

CATEGORIZATION | %NRSTR(H1) |

CATEGORIZATION_ | %NRSTR(H1_R) |

Risk factor | Results of epidemiological study | Results of simulation | ||
---|---|---|---|---|

Coefficient | p value | Coefficient | Power | |

Smoker | 0.82 | 0.019 | 0.83 | 0.686 |

Age (years) | 0.06 | 0.014 | 0.06 | 0.792 |

Systolic arterial pressure | 0.02 | 0.0096 | 0.02 | 0.847 |

Body temperature | 1.18 | 0.0013 | 1.14 | 0.893 |

NIHSS score at admission | ||||

5 - 15 | 1.40 | 0.001 | 1.33 | 1.000 |

≥16 | 2.25 | 0.001 | 2.30 |

Event proportion = 0.206.

Estimating the sample size or inference of statistical power is critical. If the sample size is too low, the experiment will lack the precision needed to provide reliable answers to the questions it is investigating. If the sample size is too large, time and resources will be wasted, often for minimal gain [

Our program is flexible enough to accommodate any number or type (continuous or discrete) of covariate and categorization, continuous distribution shapes and correlations, and the association level between logit outcome and covariates, although some modifications may be necessary. This program can also be applied to other statistical methods, logistic regression and Bayesian inference. The SAS/STAT/IML program written for the simulations and a user manual are available upon request [

The authors wish to thank Dr. Motonori Hatta for helpful advice on the study.

The authors have no conflicts of interest to declare.

K.A. and N.K. were responsible for the study conception. N.K. wrote the program and drafted this manuscript. N.K., K.A., Y.O., H.K., and Y.H. made revisions to the manuscript. All authors approved the final version of the manuscript.

%LET SEED=;

%LET ALEVEL=;

%LET PATH=;

%LET TABLE=;

%LET NITER=;

%LET SKW_KRT=%NRSTR({});

%LET LIST_VARNAME=%NRSTR();

%LET MODEL_1=%NRSTR();

%LET MODEL_2=%NRSTR();

%LET CATEGORIZATION=%NRSTR();

%LET CATEGORIZATION_R=%NRSTR();

%LET CONTI_MODEL=%NRSTR();

%LET CHANGE_POINT=;

%LET MAX=;

%LET MIN=;

%LET SUB_GROUP=;

%LET P=;

/*EXAMPLE*/

DATA A (TYPE=CORR);

LENGTH _TYPE_ $40;

INPUT _NAME_ $ _TYPE_$ X1 X2 ;

IF TRIM(LEFT(_TYPE_))='N'THENCALL SYMPUT('NSP', X1);

CARDS;

. MEAN 70 160

. STD 8 25

. N 300 300

X1 CORR 1 0

X2 CORR 0 1

;

RUN;

%DATASET(N_REPEAT=);

%LR(R=);

/************************%COEFF and %CONTINUOUS **************/

/*%COEFF and %CONTINUOUS generate random variables following a Multivariate /*Normal distribution with given means, standard deviations, and correlation matrix, /*and then transform each variable to the desired distributional shape with specified /*population univariate skewness and kurtosis

/*%COEFF

/*Macro COEFF calculates coefficients of the Fleishman’s power transformation

/*Equation X= A + B*C1 + C*C2^2 +D*C3^3 where A=-C

/*Parameters

/*SKW_KRT; %NRSTR({skewness1 kurtosis1, skewness2 kurtosis2,…, });

/*LIST_VARNAME; list of variable names that define the skewness and kurtosis.

/*OUT the name of the output file (name of COEFF) that has thecoefficient values (A B C) of each variable.

/********************************************************************/

%MACROCOEFF;

PROC IML;

/* COEFFICIENTS OF B, C, D FOR FLEISHMAN'S POWER TRANSFORMATION*/

SKEWKURT=&SKW_KRT;

MAXITER=25;

CONVERGE=.000001;

START FUN;

C1=COEF [

C2=COEF [

C3=COEF [

F= (C1**2+6*C1*C3+2*C2**2+15*C3**2-1)//

(2*C2*(C1**2+24*C1*C3 +105*C3**2+2)-SKEWNESS)//

(24*(C1*C3+C2**2*(1+C1**2+28*C1*C3)+C3**2*

(12+48*C1*C3+141*C2**2+225*C3**2))-KURTOSIS);

FINISH FUN;

START DERIV;

J= ((2*C1+6*C3) || (4*C2) || (6*C1+30*C3))//

((4*C2*(C1+12*C3)) || (2*(C1**2+24*C1*C3+105*C3**2+2))||

(4*C2*(12*C1+05*C3)))//((24*(C3+C2**2*(2*C1+28*C3)+48*C3**3))||

(48*C2*(1+C1**2+28*C1*C3+141*C3**2))||

(24*(C1+28*C1*C2**2+2*C3*(12+48*C1*C3+141*C2**2+225*C3**2)

+C3**2*(48*C1+450*C3))));

FINISH DERIV;

START NEWTON;

RUN FUN;

DO ITER = 1 TO MAXITER

WHILE (MAX(ABS(F))> CONVERGE);

RUN DERIV;

DELTA=-SOLVE(J,F);

COEF=COEF+DELTA;

RUN FUN;

END;

FINISH NEWTON;

DO;

NUM=NROW (SKEWKURT);

DO VAR = 1 TO NUM;

SKEWNESS = SKEWKURT [VAR,1];

KURTOSIS = SKEWKURT [VAR, 2];

COEF = {1.0, 0.0, 0.0};

RUN NEWTON;

COEF = COEF`;

SK_KUR= SKEWKURT [VAR,];

COMBINE=SK_KUR||COEF;

IF VAR = 1 THEN RESULT=COMBINE;

IF VAR >1 THEN RESULT=RESULT//COMBINE;

END;

END;

RESULT=RESULT`;

CREATE _COEF_ FROM RESULT [COLNAME={&LIST_VARNAME}];

APPEND FROM RESULT;

DATA _COEF;

SET _COEF_;

LENGTH _TYPE_ $40;

MARK = _N_; _TYPE_="COEFF";

FORMAT MARK;

RUN;

DATA COEFF (DROP= MARK);

SET _COEF;

IF MARK >2 THEN OUTPUT COEFF;

RUN;

%MEND COEFF;

/***********************%CONTINUOUS********************************/

/*This program generates random variables following a Multivariate Normal /*distribution with given name, standard deviation, and correlation matrix, and then /*transforms each variable to the desired distributional shape with Fleishman’s /*coefficient.

/*Parameter

/*N_Repeat; the number of iterations

/*SEED; seed of the random number generator

/*DATA the name, A, of the input file that determines the characteristics of the random /*numbers to be generated. The file specifies the mean, standard deviation, number of /*observations of each random number, and the correlation coefficients between the /*variables. It must be a TYPE=CORR file, and its structure must comply with that of /*such files. The file has _Type_=MEAN, STD, N, CORR. Its variables are _TYPE_, /*_NAME_ and the variables to be generated. The number of observations should be /*the same value. In this file, the sample size 'NSP' should be specified as a parameter, /*using IF TRIM(LEFT(_TYPE_))='N' THEN CALL SYMPUT('NSP', X1 (one of the /*variable names)).

/*

/*Example

/*DATA A (TYPE=CORR);

/*LENGTH _TYPE_ $40;

/* INPUT _NAME_ $ _TYPE_$ X1 X2 ;

/* IF TRIM(LEFT(_TYPE_))='N' THEN CALL SYMPUT('NSP', X1);

/* CARDS;

/* . MEAN 70 160

/* . STD 8 25

/* . N 300 300

/* X1 CORR 1 0

/* X2 CORR 0 1

/* ;

/* RUN;

/*OUT random variables generated according to the file given in parameter DATA and observation identification number (ID)

/***************************************************************/

%MACROCONTINUOUS;

PROC CONTENTS DATA=A (DROP=_TYPE_ _NAME_)

OUT=_DATA_ (KEEP=NAME) NOPRINT;

RUN;

/*SUPPOSE WE HAVE X1,......, X_{P} VARIABLE IN DATASET A WHICH IS AN INPUT DATASET.

WE ASSIGN THESE VARIABLES AS NAME OF V1,..., V_{P} MACRO REFERENCE OF &NV IS ASSIGNED THE NUMBER OF TOTAL VARIABLES*/

DATA _DATA_;

SET _LAST_ END=END;

RETAIN N 0;

N=N+1;

V=COMPRESS('V' || COMPRESS(PUT (N, 6.0)));

CALL SYMPUT(V, NAME);

IF END THEN CALL SYMPUT('NV', LEFT(PUT (N, 6.0)));

RUN;

%LET VNAMES=&V1;

%DO I=2%TO&NV.;

%LET VNAMES=&VNAMES &&V&I;

%END;

/*OBTAIN THE MATRIX OF FACTOR PATTERNS AND OTHER STATISTICS.*/

PROC FACTOR DATA=A NFACT=&NV NOPRINT

OUTSTAT=PATTERN_(WHERE=(_TYPE_ IN('MEAN','STD','N','PATTERN')));

RUN;

DATA _PATTERN_;

SET COEFF PATTERN_;

RUN;

/*GENERATE THE RANDOM NUMBERS.*/

%LET NV2=%EVAL(&NV.*&NV.);

%LET NV3=%EVAL(3*&NV.);

DATA B&REPEAT. (KEEP=&VNAMES);

SET _PATTERN_ (KEEP=&VNAMES _TYPE_ RENAME=(

%DO I=1%TO&NV;

&&V&I = V&I

%END;

)) END=LASTFACT;

RETAIN;

/*SET UP ARRAYS TO STORE THE NESSESARRY STATISTICS.*/

ARRAY VCOEFF(3,&NV) C1-C&NV3;

ARRAY FPATTERN(&NV,&NV) F1-F&NV2;

ARRAY VSTD(&NV) S1-S&NV;

ARRAY VMEAN(&NV) M1-M&NV;

ARRAY V(&NV)V1-V&NV;

ARRAY VTEMP(&NV)VT1-VT&NV;

LENGTH LBL $40;

/* READ AND STORE THE MATORIX OF FACTOR PATTERNS. */

IF _TYPE_='PATTERN' THEN DO;

DO I=1 TO &NV;

FPATTERN(_N_ -6, I)=V(I);

END;

END;

IF _TYPE_='COEFF' THEN DO;

DO I=1 TO &NV;

VCOEFF(_N_,I) =V(I);

END;

END;

/* READ AND STORE THE MEANS */

IF _TYPE_ = 'MEAN' THEN DO;

DO I=1 TO &NV;

VMEAN(I)=V(I); END;

END;

/* READ AND STORE THE STD. */

IF _TYPE_ = 'STD' THEN DO;

DO I=1 TO &NV;

VSTD(I) =V(I); END;

END;

/* READ AND STORE THE NUMBER OF OBSERVATIONS.*/

IF _TYPE_ = 'N' THEN NNUMBERS=V(1);

IF LASTFACT THEN DO;

/* SET UP LABELS FOR THE RANDOM VARIABLES. THE LABELSARE STORED IN MACRO VARIABLES LBL1, LBL2,.... AND USED IN THE SUBSEQUENT PROC DATASETS.*/

%DO I=1%TO&NV;

LBL="ST.NORMAL VAR. ,M-"||COMPRESS(PUT(VMEAN(&I), BEST8.))||",STD="||COMPRESS(PUT(VSTD(&I),BEST8.));

CALL SYMPUT("LBL&I",LBL);

%END;

DO K=1 TO NNUMBERS;

DO I =1 TO &NV;

SEED=(&SEED.+&REPEAT.+1);

VTEMP(I)=RANNOR(SEED);

END;

/* IMPOSE THE INTERCORRELATION ON EACH VARIABLE. THE

TRANSFORMED VARIABLES ARE STORED ARRAY 'V'.*/

DO I=1 TO &NV;

V(I)=0;

DO J=1 TO &NV;

V(I) = V(I) + VTEMP(J)*FPATTERN(J, I);

END;

END;

/* TRANSFORM THE RANDOM VARIABLES SO THEY HAVE

MEANS AND STANDARD DEVIATIONS AS REQUESTED. */

DO I=1 TO &NV;

V(I)= VCOEFF(2, I)*(-1)+V(I)*VCOEFF(1, I) +VCOEFF(2,I)*V(I)*V(I)+VCOEFF(3, I)*V(I)*V(I)*V(I);

V(I) = VSTD(I) *V(I) + VMEAN(I);

END;

OUTPUT;

END;

END;

RENAME

%DO I=1%TO&NV;

V&I = &&V&I

%END;

;

RUN;

DATA BB&REPEAT.;

SET B&REPEAT.;

ID=_N_;

FORMAT ID;

RUN;

%MEND CONTINUOUS;

/***************** HISTGRAM and RATIO *************************/

/* This program generate random variables as in

/*%Macro HISTGRAM

/*Parameters

/*SEED= seed of the random number

/*N_REPEAT= the number of iterations

/*NSP= total sample size, which is already defined as an input parameter of the DATA /*file used to execute %macro CONTINUOUS.

/*MAX=Maximum value of an original variable

/*MIN=Minimum value of an original variable

/*SUB_GROUP= the number of subgroups

/*OUT

/*The name of the output file (name of HH&REPEAT) containing the random variable, /*H1 and ID number.

/*

/*%RATIO assigns the frequencies of each subgroup, using an IF function.

/*&NSP.*0.55 = (sample size × accumulated percentage)=frequencies of subgroup.

/*Example IF 1=< ID <&NSP.*0.55 THEN _H1=U1;

/*U1 indicates random number for the lowest subgroup.

/**************************************************************/

%MACROHISTGRAM;

DATA HH&REPEAT.;

DO ID=1 TO &NSP.;

CALL STREAMINIT(&REPEAT. + &NSP. + &SEED. + 2.);

SCALE=%EVAL((&MAX.-&MIN.)/&SUB_GROUP.);

%DO I=1%TO&SUB_GROUP.;

_U&I.=RAND("UNIFORM");

U&I.=&MIN.+(&I.-1)*SCALE+SCALE*_U&I.;

%END;

%RATIO;

OUTPUT;

END;

RUN;

%MEND HISTGRAM;

%MACRORATIO;

IF 1=

ELSE IF &NSP.*0.55 =

ELSE IF &NSP.*0.6 =

ELSE IF &NSP.*0.8=

ELSE _H1=U5;

H1=INT(_H1);

IF H1=<4 THEN C1=0;

ELSE IF 4< H1 =<15 THEN C1=1;

ELSE IF H1 >15 THEN C1=2;

%MEND RATIO;

/************************** PDF ***********************************/

/*This macro generates random variables from the RAND function.

/*RAND function generates random numbers with certain probability distributions.

/*Parameter

/*SEED= seed of the random number

/*N_REPEAT= the number of iterations

/*NSP= total sample size, which is already defined as an input parameter of the DATA file used to execute %macro CONTINUOUS.

/*RAND function.

/*OUT

/*The name of the output file (name of CC&REPEAT) containing the random variable /*defined by the probability distributions given by the RAND function and ID number.

/**************************************************************/

%MACROPDF;

DATA CC&REPEAT.;

DO ID=1 TO &NSP.;

CALL STREAMINIT( &REPEAT. + &NSP. + &SEED. + 3);

STRATA=&REPEAT.;

UN=RAND("UNIFORM");

/*INSERT RAND FUNCTION TO GENERATE RANDOM NUMBER USING RAND FUNCTION*/

X=RAND("NORMAL", 0,1);

_D1=RAND("TABLE", 0.5 , 0.5);

IF _D1=1 THEN D1=0;

ELSE IF _D1=2 THEN D1=1;

OUTPUT;

END;

%MEND PDF;

/********************** Merge ***************************/

/*This program merges all datasets including randomly generated variables specified /*in %macro CONTINUOUS (BB&REPEAT), %HISTGRAM (HH&REPEAT) /*and %macro PDF(CC&REPEAT) by ID number.

/*OUT file name of _D&REPEAT.

/**************************************************************/

%MACRO MERGE;

PROC SORT DATA=BB&REPEAT.; BY ID; RUN;

PROC SORT DATA=HH&REPEAT.; BY ID; RUN;

PROC SORT DATA=CC&REPEAT.; BY ID; RUN;

DATA _D&REPEAT.;

MERGE BB&REPEAT. CC&REPEAT. HH&REPEAT.;

BY ID;

RUN;

DATA DATASET0;

SET DATASET0;

DATA DATASET&REPEAT.;

SET DATASET%EVAL(&REPEAT. -1) _D&REPEAT.;

RUN;

%MEND MERGE;

/*************************** OUTCOME ***************/

/*This program generates outcome variable, y, from the individual probability of event /*occurrence. Individual probability is calculated using two segment logistic regression /*model.

/*Parameter

/*CHANGE_POINT= flexion point of two segment logistic regression model

/*MODEL_1 = logistic regression model when values of covariates ?values of /*CHANGE_POINT

/*MODEL_2 = logistic regression model when values of covariates > values of /*CHANGE_POINT

/*NITER= number of final datasets

/*P = event proportion

/*OUT DATASET For logistic regression model

/********************************************************************/

%MACROOUTCOME;

DATA _D_&NITER.;

SET DATASET&NITER.;

%IF H1 =<&CHANGE_POINT.%THEN%DO;

G=&MODEL_1;

%END;

%IF H1 >&CHANGE_POINT.%THEN%DO;

G=&MODEL_2;

%END;

RUN;

PROC SUMMARY DATA=_D_&NITER.;

VAR G;

OUTPUT OUT= PROCMEAN&NITER. MEAN=;

RUN;

DATA M&NITER. (KEEP=INT ID NITER);

SET PROCMEAN&NITER.;

DO ID=1 TO &NSP;

MEAN=%SCAN(G, 1);

INT=LOG(&P/(1-&P))-MEAN;

NITER=&NITER.;

OUTPUT;

END;

RUN;

PROC SORT DATA=M&NITER.;BY ID;RUN;

PROC SORT DATA=_D_&NITER.;BY ID;RUN;

DATA D_&NITER.;

MERGE M&NITER. _D_&NITER.;

BY ID;

RUN;

DATA D&NITER. ;

SET D_&NITER.;

PRO=EXP(INT+ G)/(1 +EXP(INT+ G));

IF 0=

ELSE Y=0;

RUN;

PROC SORT DATA=D&NITER. ;BY STRATA;RUN;

%MEND OUTCOME;

/********************************************************************/

/*This program performs a stratified continuous logistic regression model and produces a repeated number of parameters (coefficient, its standard error, and p value), then calculates the average coefficient value, average standard error, and power.

/*Parameter

/*NITER=number of final datasets

/*CONTI_MODEL=a continuous logistic regression model

/*ALEVEL=significance level of the statistical test (Type I error)

/*NITER=specify final dataset

/*PATH=directory in which results are saved

/*TABLE=table name for saved results

/*OUT=Result (excel format)

/*Results include event proportion, mean, standard deviation, skewness, and kurtosis of a variable average coefficient and average standard error of logistic regression model and power

/********************************************************************/

%MACROCONTINUOUSLR;

/****MODEL**********************************************************/

ODS OUTPUT PARAMETERESTIMATES=PARAM CONVERGENCESTATUS=STATUS;

PROC LOGISTIC DATA=D&NITER. ;

MODEL Y (EVENT='1')=&CONTI_MODEL

/TECH=NR MAXITER=8 XCONV=0.01 ;

BY STRATA;

RUN;

/******************************************************************/

PROC SORT DATA=PARAM

OUT=PARAM2

(RENAME=(ESTIMATE=ESTIMATION STDERR=STANDARDERRORS));

BY STRATA;

RUN;

PROC SORT DATA=STATUS;

BY STRATA;

RUN;

DATA RESULT;

MERGE PARAM2 STATUS ;

BY STRATA;

RUN;

DATA RESULT_CONTINUOUS E;

SET RESULT;

IF 0=< PROBCHISQ<&ALEVEL. THEN POWER=1;

ELSE POWER=0;

IF STATUS=0 THEN OUTPUT RESULT_CONTINUOUS;

ELSE OUTPUT E;

RUN;

ODS HTML PATH="&PATH" BODY="&TABLE..XLS";

PROC TABULATE DATA=D&NITER. OUT=J;

VAR Y &CONTI_MODEL;

TABLE (&CONTI_MODEL)*(MEAN STD SKEWNESS KURTOSIS)/MISSTEXT = 'NO DATA';

RUN;

PROC FREQ DATA=D&NITER;

TITLE 'PROPORTION';

TABLE Y/NOCOL NOROW;

RUN;

PROC FREQ DATA=RESULT_CONTINUOUS;

TITLE 'POWER';

TABLE VARIABLE*(POWER)/NOCOL NOPERCENT;

RUN;

PROC TABULATE DATA=RESULT_CONTINUOUS;

TITLE 'MEAN OF COEFFICIENT AND THEIR MEAN OF STANDARD ERROR';

CLASS VARIABLE ;

VAR ESTIMATION STANDARDERRORS;

TABLE VARIABLE,(ESTIMATION STANDARDERRORS)*(N MEAN*F=8.4)/MISSTEXT = 'NO DATA';

RUN;

ODS HTML CLOSE;

%MEND CONTINUOUSLR;

/************************* CONTINUOUSLR ***************************/

/*Continuous variables are divided into categorical groups by quantile, and then a stratified logistic regression model is executed. Users specify the model in %MACRO CATEGORICAL_MODEL. Then, parameters (coefficient, standard error, and p value) and average coefficient values, average standard error of each group of a variable are calculated, and the power is calculated.

/*Parameter

/*R=Number of categorized groups

/*Example: Continuous=1, median=2, tertile=3, quantile=4,

/*CATEGORIZATION=%NRSTR(List of covariates to be categorized)

/*CATEGORIZATION_R=%NRSTR(List of new covariate names after /*categorization)

/*NITER=number of final datasets

/*CONTI_MODEL=a continuous logistic regression model

/*ALEVEL=significance level of the statistical test (Type I error)

/*NITER= specify final dataset

/*PATH=directory in which results are saved

/*TABLE=table name for saved results

/*OUT= Result (excel format)

/*Results include average coefficient and average standard error of logistic regression model and power for each categorized group and overall power of a variable.

/********************************************************************/

%MACROCATEGORICAL_MODEL;

ODS OUTPUT PARAMETERESTIMATES=PARAM_&R CONVERGENCESTATUS=STATUS_&R TYPE3=TYPE3_&R.;

PROC LOGISTIC DATA=G&R.;

CLASS C1(PARAM=REF REF="0") X1_R(PARAM=REF REF="0") ;

MODEL Y(EVENT='1')= C1 X1_R X2 /TECH=NR MAXITER=8 XCONV=0.01;

BY STRATA;

RUN;

%MEND CATEGORICAL_MODEL;

%MACROCATEGORICALLR;

PROC RANK DATA= D&NITER. GROUPS=&R. OUT=G&R.;

VAR &CATEGORIZATION;

RANKS &CATEGORIZATION_R ;

BY STRATA ;

RUN;

%CATEGORICAL_MODEL;

PROC SORT DATA=PARAM_&R

OUT=P_&R (RENAME=( ESTIMATE=ESTIMATION STDERR=STANDARDERRORS));

BY STRATA;

RUN;

PROC SORT DATA=STATUS_&R OUT =S_&R(KEEP = STRATA STATUS) ;

BY STRATA;

RUN;

PROC SORT DATA=TYPE3_&R OUT =_TYPE3_&R;

BY STRATA;

RUN;

DATA _POWER_&R;

MERGE _TYPE3_&R S_&R;

BY STRATA;

RUN;

DATA POWER_&R E_&R;

SET _POWER_&R;

RENAME;

TYPE3_WALDCHISQ=WALDCHISQ;

TYPE3_PROBCHISQ=PROBCHISQ;

LABEL TYPE3_PROBCHISQ="P VALUE OF TYPE3";

TYPE3_WALDCHISQ="CHISQ OF TYPE 3";

IF 0 =< TYPE3_PROBCHISQ <&ALEVEL. THEN POWER=1;

ELSE IF TYPE3_PROBCHISQ >= &ALEVEL. THEN POWER=0;

IF STATUS=0 THEN OUTPUT POWER_&R;

ELSE OUTPUT E_&R;

KEEP POWER STATUS TYPE3_WALDCHISQ TYPE3_PROBCHISQ EFFECT;

RUN;

DATA _RESULT_CATEGORICAL_&R;

MERGE P_&R S_&R;

BY STRATA;

RUN;

DATA RESULT_CATEGORICAL_&R E_&R;

SET _RESULT_CATEGORICAL_&R;

IF CLASSVAL0=. THEN CLASSLEVEL=0;

ELSE CLASSLEVEL=CLASSVAL0;

IF 0=< PROBCHISQ<&ALEVEL. THEN GROUP_POWER=1;

ELSE IF PROBCHISQ>= &ALEVEL. THEN GROUP_POWER=0;

DROP CLASSVAL0;

IF STATUS=0 THEN OUTPUT RESULT_CATEGORICAL_&R;

ELSE OUTPUT E_&R;

RUN;

ODS HTML PATH="&PATH" BODY="GROUP_&R._&TABLE..XLS";

PROC FREQ DATA=POWER_&R;

TITLE 'POWER OF Β';

TABLE EFFECT*(POWER)/NOCOL NOPERCENT;

RUN;

PROC TABULATE DATA= RESULT_CATEGORICAL_&R;

CLASS VARIABLE CLASSLEVEL;

VAR ESTIMATION STANDARDERRORS;

TABLE VARIABLE*CLASSLEVEL,(ESTIMATION STANDARDERRORS ) *(N MEAN*F=8.4) /RTS=20 MISSTEXT = 'NO DATA';

RUN;

PROC TABULATE DATA= RESULT_CATEGORICAL_&R;

CLASS VARIABLE CLASSLEVEL GROUP_POWER;

TABLE VARIABLE*CLASSLEVEL,( GROUP_POWER )*(N ROWPCTN) /RTS=20 MISSTEXT = 'NO DATA';

RUN;

ODS HTML CLOSE;

%MEND CATEGORICALLR;

/********************************************************************/

Dataset generation

One dataset is created from each iteration of %COEFF, %CONTINUOUS, %HISTGRAM, %PDF, and %MERGE.

This dataset is accumulated until the iterations are complete and the iteration time is identified as strata.

/********************************************************************/

%MACRO DATASET(N_REPEAT=);

%COEFF;

%DO REPEAT=1%TO&N_REPEAT.;

%CONTINUOUS;

%HISTGRAM

%PDF;

%MERGE;

%END;

%OUTCOME;

%MEND DATASET;

/********************************************************************/

The parameter estimations and statistical power calculation.

/********************************************************************/

%MACRO LR(R=);

%DO R=1%TO&R;

%IF&R=1%THEN%DO;

%CONTINUOUSLR;

%END;

%ELSE%DO;

%CATEGORICALLR;

%END;

%END;QUIT;

%MEND LR;