Profile Likelihood Tests for Common Risk Ratios in Meta-Analysis Studies

It is well-known that the power of Cochran’s Q test to assess the presence of heterogeneity among treatment effects in a clinical meta-analysis is low due to the small number of studies combined. Two modified tests (PL1, PL2) were proposed by replacing the profile maximum likelihood estimator (PMLE) into the variance formula of logarithm of risk ratio in the standard chi-square test statistic for testing the null common risk ratios across all k studies ( 1, , i k =  ). The simply naive test (SIM) as another comparative candidate has considerably arisen. The performance of tests in terms of type I error rate under the null hypothesis and power of test under the random effects hypothesis was done via a simulation plan with various combinations of significance levels, numbers of studies, sample sizes in treatment and control arms, and true risk ratios as effect sizes of interest. The results indicated that for moderate to large study sizes ( 16 k ≥ ) in combination with moderate to large sample sizes ( , 50 T C i i n n ≥ ), three tests (PL1, PL2, and Q) could control type I error rates in almost all situations. Two proposed tests (PL1, PL2) performed best with the highest power when 16 k ≥ and moderate sample sizes ( , 50,100 T C i i n n = ); this finding was very useful to make a recommendation to use them in practical situations. Meanwhile, the standard Q test performed best when 16 k ≥ and large sample sizes ( , 500 T C i i n n ≥ ). Moreover, no tests were reasonable for small sample sizes ( , 10 T C i i n n ≤ ), regardless of study size k. The simply naive test (SIM) is recommended to be adopted with high performance when k = 4 in combination with ( , 500 T C i i n n ≥ ).

proposed by replacing the profile maximum likelihood estimator (PMLE) into the variance formula of logarithm of risk ratio in the standard chi-square test statistic for testing the null common risk ratios across all k studies ( 1, , i k =  ). The simply naive test (SIM) as another comparative candidate has considerably arisen. The performance of tests in terms of type I error rate under the null hypothesis and power of test under the random effects hypothesis was done via a simulation plan with various combinations of significance levels, numbers of studies, sample sizes in treatment and control arms, and true risk ratios as effect sizes of interest. The results indicated that for moderate to large study sizes ( 16 k ≥ ) in combination with moderate to large sample sizes ( , 50

Introduction
In a clinical trial with binary outcomes, the risk ratio (RR) as an intervention effect is defined by the ratio of probabilities (risks) of having an adverse event between a treatment group and a control group [1] [2]. Let x T and x C be the number of events out of n T and n C , the total number of persons (or the total of times that every person exposed) in the treatment arm and the control arm, respectively. Then the maximum likelihood estimate for RR is obtained as ˆT T T C C C p x n RR p x n = = [3] [4].
A meta-analysis of study size k is a statistical approach that combines the results from k studies, conducted on the same topic and with the similar methods, into a single summary result. In clinical trials, meta-analysis is an essential tool to obtain a better understanding of how well the treatment effects work. Two popularly statistical models used are the fixed effect model and the random effect model. Under the assumption of the fixed effect model, we assume that all studies share a common effect size. It means that there is no heterogeneity between the studies; all studies contain only one true effect size over all k independent trials, and the observed effect is determined by the common true effect plus the sampling error (within-study error). On the contrast, under the random effects model, the true effect is not the same in all studies; we allow that there is a distribution of true effect sizes. It follows that the combined estimate is not an estimate of one value, but rather it is the average of distribution values. Hence, there are two levels of errors (within-study error and between-study error).
Consequently, the observed effect is determined by the mean of all true effects plus the within-study error and the between-study error. In this sense, heterogeneity may refer to various true effect sizes from studies to studies, or the difference of studies gives the difference of the effect sizes so that one can incorporate this heterogeneity into a random effect model. Alternatively, heterogeneity in the effect sizes from different studies may be explained by a set of covariates, such as characteristics of studies, type of treatment status, some average or aggregate characteristics of patients, even publication bias; therefore, a meta-regression approach may be used to account for variation from such covariates among these heterogeneous effects.
Traditionally, before combining the effects of separate studies by using either the fixed effect model as homogeneity or the random effect model as heterogeneity, the conventional Cochran's Q test is adopted to test whether these treatment effects are homogeneous, or not. Unfortunately, it is widely known that the standard Q test may be inaccurate in testing the null homogeneity of effect sizes in the sense of low power of test. Kulinskaya and Dollinger [5] and Boissel et al. [6] stated that Cochran's Q test had low power in most situations, especially, when the number of studies (k) was small. The work of Kulinskaya, Dollinger, and Bjørkestøl [7], Lipsitz et al. [1] and Lui's [2] were also confirmed the low power problem of Cochran's Q test. The low power of Q test implies the  [8] recommended using a cut-off significance level of 0.1, rather than the usual 0.05. This has also been a common customary practice for the Cochran's Q homogeneity test in meta-analysis. Considerably, the way to increase the power is equivalent to the reduction of the chance of type II error.
But this reduction of the chance of type II error also increases the risk or the chance of type I error. Obviously, when we make a low power problem better by using a cut-off of 10% for significance criterion, the new problem of allowance for the increase of the chance of type I error may occur. The increasing risk of type I error potentially leads to the problem of not maintaining the type I error at the conventional level of significance. Additionally, Shandish and Haddock [9] stated that when the sample sizes in each study were very large, the null hypothesis of the equal population effects might be rejected even if the individual effect estimates did not really differ much.
Profile likelihood estimation, stated by Ferrari et al. [10] and Böhning et al. [11], deals with elimination of the nuisance parameters. Generally, let the log-likelihood ( )

Motivational Applications
Two examples of meta-analysis are presented to illustrate the implementation of the related Q test and the other usefulness demonstrates how to set the parame- of meta-analysis is created by R package provided by Schwarzer et al. [13], http://meta-analysis-with-r.org/.
Mottillo et al. [14] considered the data from meta-analysis of 16 trails about the metabolic syndrome and cardiovascular risk. The value of Cochran's Q-test

Deriving Profile Likelihood Tests for Common Risk Ratio
The purposes of study are 1) to derive the profile likelihood tests for testing a null common risk ratio RR across k studies in which is equivalent to homogeneity of treatment effects overall k studies ( 1, ,  the different formulas of the variance estimates of logarithm of risk ratio with the conventional Cochran's Q test for testing a null common risk ratio RR across k studies ( 0

: i H RR has a specific distribution).
We followed the work and the notation of Böhning et al. [11] and further proposed some profile likelihood tests by modifying the standard 2 χ test for homogeneity through the various ways of the variance estimates of the logarithm of risk ratios at the i th study.

Profile Likelihood Estimator under a Fixed Effect Point for a Common Risk Ratio across Studies
The result of the work of Böhning et al. [11] under profile likelihood concept provides a fixed-effect point RR for all k studies ( 1, , leading to the iterative processes of the profile maximum likelihood estimator

Some Tests Based on Various Formulas of Variance Estimate of Logarithmic RRi
For testing the null hypothesis, the true relative risks ( i RR ) are the same in all k centers/studies, versus the alternative that at least one of the effect sizes ( i RR ) differs from the remainder.
Alternatively, this is reasonable to assume that all null parameters of the centers to be combined are summarized into a single underlying population parameter, against the alternative parameters different among centers are likely to have a wholly random with a specific distribution. Our proposed tests are modified on the base of a standard 2 χ test for homogeneity in the following form: 2) Profile likelihood 2 χ test (PL1) with the same form above will be obtained but getting the different formula due to the variance estimate under the null hypothesis as 3) Profile likelihood 2 χ test (PL2) will also be obtained after using the different formulas of variance estimate as  [11] under profile likelihood concept.

4)
Cochran's Q test as the weighted sum of squares is distributed as a chi-square statistic with k − 1 degrees of freedom, under the null of homogeneity of treatment effects across k studies, denoted as

Monte Carlo Simulation
We perform two simulation plans. One is conducted on type I error for testing a null common risk ratio, RR, over all k studies or in other words for testing the null homogeneity we have . The other is used for comparing the performance of tests with the highest power after all test statistics could be controlled within the same limit range of the empirical type I error. Type I error among the tests is considered by comparing the actual (estimated) type I error (α ) with the nominal level of significance ( α ). The departure of the estimated type I error from the nominal level of significance must not exceed the precise limit. In this study, the evaluation for two-sided tests in terms of the probability is based on Bradley limit [15]   where m U is a uniform over (-mm, mm) for a given mm = 0.2, 0.4, 0.6, and U is a uniform over (0, 1). Baseline risks

Results
Since it is difficult to present all enormous results from the simulation study, we just have illustrated some instances, coping with 0.05 levels of significances, some common true relative risk values of 1 and 2, in both equal and unequal sample sizes.

Comparing Powers of Tests
• The process of power comparisons is conducted after all candidate tests can previously maintain the same limit range of type I error.
• Table 2 showed that both of the PL1 test and the PL2 test are best with the highest powers when study size is moderate to large ( 16 k ≥ ) and sample sizes are moderate ( , 50,100 ) in every degrees of variation (mm = 0.2, 0.4, 0.6), coping with RR = 1, 2. Additionally, in more detail, PL2 seems better than PL1 with higher power.
• When study size is moderate to large ( 16 k ≥ ) and sample size is large ( , , the Q test is best with the highest power of test in every degrees of variation (mm = 0.2, 0.4, 0.6), coping with RR = 1, 2.
• For the number of studies is small (k = 4) in combination with large sample sizes ( , , the best performance of test is the SIM test since it is only one test that can formerly meet the criterion of controlling type I error.

Studying Type I Errors
• Table 3 indicates that for RR = 2 and moderate to large study size ( The SIM cannot control type I error in every case of sample sizes. • Table 4 is considered to highlight only for small study sizes (k = 4). For small study sizes (k = 4), the SIM seems to control type I error at least when one sample size of treatment groups is large. Both of PL1 and PL2 tests can control type I error when one sample size of treatment groups is small. The Q test can rarely control type I error in every sample size for small study sizes.

Studying Power of Tests
• Table 5 indicates that for moderate to large study sizes ( 16 k ≥ ) in combination with moderate sample sizes (

50, 100
T C i i n n = = ), two proposed tests (PL1, PL2) perform best and quite close together.
• For moderate to large study sizes ( 16 k ≥ ) in combination of at least one treatment arm being large sample sizes (

50, 500
), Q test seems to have best performance with the highest power, followed by PL2 and PL1 tests.

Discussion
In this study, we further focus on a comparison of the performance among four statistical tests including the simply naive test approach (SIM), the conventionally null approach of profile likelihood (PL1), the full profile likelihood approach      [16]; they stated that estimating between-study heterogeneity in meta-analysis of a small number of sample sizes ( , • The work of Willis and Riley [17] was also confirmed the properties of Q test to be a good test when there are large study sizes (50 studies or more), but for fewer studies the Q test has the low power.
• We are scientist group that have attempted to propose some new/modified tests to bridge the gaps of limitation of the Q test. The idea of this paper shows how to use two proposed tests (PL1, PL2) based on substituting profile maximum likelihood estimates into the different variance formulas for obtaining the modified standard chi-square tests of heterogeneity.
• Our profile likelihood tests (PL1 and PL2) for moderate to large study sizes defeat the Q test with the higher power after capturing the same range of type I error limits. • The work of Bagheri, Ayatollahi and Jafari [18] and Viechtbauer [19] which also could evaluate the influence of the size of centers (k) and sample sizes ( , T C i i n n ) on the type I error and the power for the null homogeneity testing in some situations. It means that the investigators should pursue their attempts to find some new/modified tests further.
• In contrast, although two proposed tests (PL1, PL2) perform well in above situations, they cannot defeat the Q test when the number of studies is moderate to large ( 16 k ≥ ) in combination with large sample sizes ( , Additionally, in unbalanced cases, for moderate to large study sizes ( 16 k ≥ ) and combination of moderate sample size and large sample sizes ( 50, 500 ), the Q test performs best with the highest power, followed by PL2 and PL1 tests.

Conclusion
In summary, the idea of replacement of profile likelihood estimates into the variance formulas of logarithm of relative risks works well when

Recommendation
Two proposed tests (PL1, PL2) based on substituting profile maximum likelih- In contrast, although two proposed tests (PL1, PL2) perform well with the high powers in above situations, they cannot defeat the Q test when numbers of studies are moderate to large ( 16 k ≥ ) in combination with large sample sizes ( , 500 T C i i n n ≥ ) in both balanced and unbalanced cases. This result leads to the suggestion to use the Q test in these situations. It means that it should be further investigated to find the new appropriate test to fill the gaps of low power of Q test in such situations.