Mixed-Effects Parametric Proportional Hazard Model with Generalized Log-Logistic Baseline Distribution

Abstract

Clustered survival data are widely observed in a variety of setting. Most survival models incorporate clustering and grouping of data accounting for between-cluster variability that creates correlation in order to prevent underestimate of the standard errors of the parameter estimators but do not include random effects. In this study, we developed a mixed-effect parametric proportional hazard (MEPPH) model with a generalized log-logistic distribution baseline. The parameters of the model were estimated by the application of the maximum likelihood estimation technique with an iterative optimization procedure (quasi-Newton Raphson). The developed MEPPH models performance was evaluated using Monte Carlo simulation. The Leukemia dataset with right-censored data was used to demonstrate the model’s applicability. The results revealed that all covariates, except age in PH models, were significant in all considered distributions. Age and Townsend score were significant when the GLL distribution was used in MEPPH, while sex, age and Townsend score were significant in MEPPH model when other distributions were used. Based on information criteria values, the Generalized Log-Logistic Mixed-Effects Parametric Proportional Hazard model (GLL-MEPPH) outperformed other models.

Share and Cite:

Peter, M. , Mwalili, S. , Wanjoya, A. and Muse, A. (2023) Mixed-Effects Parametric Proportional Hazard Model with Generalized Log-Logistic Baseline Distribution. Journal of Data Analysis and Information Processing, 11, 81-102. doi: 10.4236/jdaip.2023.112006.

1. Introduction

Many different contexts have observed clustered survival data. Data analysis from various medical studies or belonging to various clusters identified by geographical or administrative regions are common today. Any survival model must consider the clustering and grouping of data accounting for between-cluster variability that creates this correlation to avoid underestimating the parameter estimators’ standard deviations. In hazard regression, neglecting unobserved heterogeneity caused by clustering impacts model specification since the marginal model produced by integrating out random effects might inconsistent with the model neglecting the random clustering [1] [2] . As a result, the hazard function and effects of covariates should be carefully interpreted without considering the possibility of unobserved covariates.

Fitting a stratified survival model that includes a frailty factor is a straightforward technique to explain clustered survival data [3] . One can more explicitly model the clustering effect by using random effects, which provide for unobserved variation or stratification between clusters and capture the hierarchical nature of the data. Mixed-effect models are a subset of regression models, particularly survival models, which include random effects. The most popular modeling method in survival analysis (Cox proportional hazards regression) [4] , requires no specification of a central piece of its formulation, the baseline hazard function h 0 ( t i j ) . This freedom from parameterizing the baseline hazard function is one of the main reasons that proportional hazards regression is widely used, however, a fully parametric proportional hazard (PH) model needs a distribution assumption [5] .

Unfortunately, hierarchical modeling methods typically require such parameterization. With or without making the assumption a probability distribution for survival times, regression models play a crucial role in survival analysis. Distributional presumptions on survival times result to parametric models, while distribution-free assumptions result to semi-parametric cox or non-parametric methods [6] .

A universal proportional hazards model with random effects was proposed by [7] [8] [9] for managing clustered survival data, similar to how random effects are treated in linear, generalized linear, and non-linear mixed models. The most prevalent extensions of popular regression survival models for time-to-event data are proportional regression models [10] [11] . By including random effects or Gaussian processes to the linear predictor function in hazard-based regression models, a similar class of mixed-effects models is created [12] [13] .

A Bayesian semi-parametric temporally stratified PH model with spatial flaws was created by [14] . County-level frailties were used to introduce the geographical correlation, while the proportional hazard model’s segmentation was used to introduce the temporal effect. A Bayesian semiparametric method to the enhanced hazard model was developed by [15] , with generalization to wide spatially clustered data. The normal transformation approach made some allowance for a geographical correlation at the county level. They expanded parametric frailty models, such as Gompertz, Weibull and exponential proportional hazards (PH) models, as well as log-normal, generalized gamma and log-logistic, accelerated failure time models, in the study cited in [16] so that any number of normally distributed random effects could be used to look into baseline risk and covariate effects heterogeneity. To examine data on loblolly pine survival, [17] suggested a semi-parametric PH model with time-dependent covariates.

In a Bayesian semi-parametric framework, [18] created a complete, unified method for modeling spatial survival data that has been arbitrarily censored by adding random effects to models of proportional hazards, proportional odds, and accelerated failure time. Incorporating random effects into the linear predictor of various fixed effects hazard-based regression models in order to capture the clustering effect is what distinguishes these model classes. In an unique mixed-effects general hazard (MEGH) structure which incorporates random effects at both the hazard scale and the temporal scale, [19] established a hazard-based regression model. They considered the Power Generalized Weibul (PGW) and Log-Logistic (LL) as baseline hazard distributions.

Despite all the various advancements in incorporating random effects to regression models, especially survival models. Majority of studies have focused on non-parametric and semi-parametric models that are closed on classical distributions within the proportional hazards (PH) framework, such as Gompertz, Weibull and exponential distributions as well as gamma, log-normal, and log-logistic accelerated failure time (AFT) models. Only a few of them are closed under the proportional hazard model, and none of them is highly adaptable enough to reflect a variety of survival data [20] . These models aren’t sufficiently flexible to capture non-monotone hazard rate functions. As a result, parametric survival models that can accommodate both non-monotone and monotone hazard rate functions are needed for clustered survival analysis. By parametrically modeling the baseline hazard function to calculate absolute risk metrics, this is immediately achievable.

There are several applications for the Log-Logistic (LL) and log-normal distributions in the study of survival data. They are members of the accelerated failure time family and are useful in modeling uni-modal survival data sets. Weibull hazard function is monotonic and closed under both AFT and PH at the same time. The cumulative hazard function (chf) and failure rate function of the log-logistic distribution, however, have desirable quality belonging to all classes of parametric hazard-based regression models of the survival analysis when they are generalized. A generalized log-logistic was proposed by [20] as an expansion of the closed log-logistic model that takes the PH connection into account. In the limit, it approaches the Weibull and displays log-logistic traits. With applications to right-censored healthcare data sets, [21] proposed a Bayesian inference for the generalized log-logistic proportional hazard model.

In this study, we develop a mixed effects parametric proportional hazard model with generalized log-logistic as baseline distribution that provides a reliable and flexible method for analyzing non-normal, skewed, and clustered survival data and takes into account unobserved variability or heterogeneity between clusters to account for both non-monotone and monotone hazard rate functions. The model’s parameters are estimated via Maximum Likelihood Estimation (MLE).

The remainder of the paper is structured as follows: Section 2, outlines the conceptualization of the mixed-effects parametric proportional hazard model (MEPPH) and PH model and discusses the baseline distribution and its sub-models. In Section 3, we provide a general framework for the likelihood function. Section 4 discusses a comprehensive simulation exercise to illustrate the model’s performance and demonstrate how to simulate using the specified MEPPH. In Section 5, we apply real data to the time-to-event settings to demonstrate the developed model. Finally, in Section 6, we briefly discuss the key findings for this study before offering some suggestions for further research and drawing a conclusion.

2. Model Formulation

We describe the formulation of the developed MEPPH model and the PH model, which is referred to as a subclass of MEPPH, in this segment. Instead of the cumulative distribution function (CDF) and probability density function (PDF), the most broadly used parametric hazard-based regression models are frequently interpreted using hazard rate function (hrf) and cumulative hazard function (chf).

2.1. Mixed Effects Parametric Proportional Hazard (MEPPH)

In survival analysis, the proportional hazard (PH) model is frequently employed. The baseline hazard distribution is left parametric and a parametric PH model with time-dependent variables is used in this study. To study the spatial heterogeneity in patient survival times, we additionally include dimensional random effects in the survival model.

Let S i j , where i = 1 , 2 , , m represent the cluster and j = 1 , 2 , 3 , , n i symbolize the individual, be a sample of time-to-event interest and x i j = vector of covariates connected with the ijth individual. Let C i j be the right censoring time and t i j = M i n ( S i j , C i j ) be the observed survival time. Let δ i j = 1 ( S i j < C i j )

be the vital status indicator. Let N = i = 1 m n i be the sample size across m clusters.

We develop a mixed-effects parametric proportional hazard model, which is defined via the individual hazard function such that;

h ( t i j | x i j , b i ) = h 0 ( t i j ) exp { x i j Τ α + b i } , (2.1)

where h 0 ( . ) is a baseline hazard, α = ( α 1 , α 2 , α 3 , , α m ) Τ and b i ~ i i d N ( 0, σ b 2 ) is the random effects. The cumulative hazard rate function (chf) conforming to the hazard rate function (hrf) in Equation (2.1) is stated as follows:

H MEPPH ( t i j | x i j , b i ) = 0 t i j h ( u ) d u = 0 t i j f ( u ) S ( u ) d u = 0 t i j d S ( u ) S ( u ) d u = log e { S ( t i j ) } = log e { [ S 0 ( t i j ) ] exp ( x i j Τ α + b i ) } (2.2)

Here, we take the link function exp ( x i j Τ α + b i ) to be equivalent to exp ( η i j ) = ψ ( x ) . Some of the other frequent lifetime functions describing the developed MEPPH model include;

1) The Survival function of the MEPPH. Considering the following expression, the survival function of the MEPPH is defined as;

S MEPPH ( t i j | x i j , b i ) = exp { H ( t i j | x i j , b i ) } = exp { 0 t i j ψ ( x ) h 0 ( t i j ) d t i j } = exp { ψ ( x ) 0 t i j h 0 ( t i j ) d t i j } = [ exp { 0 t i j h 0 ( t i j ) d t i j } ] ψ ( x ) = [ S 0 ( t i j ) ] ψ ( x ) = [ S 0 ( t i j ) ] exp ( x i j Τ α + b i ) (2.3)

2) The cumulative distribution function (cdf) of the MEPPH model, also referred as the lifetime distribution function, is denoted by

F MEPPH ( t i j | x i j , b i ) = 1 S ( t i j ) = 1 exp { H ( t i j ) } = 1 [ S 0 ( t i j ) ] ψ ( x ) = 1 [ S 0 ( t i j ) ] exp ( x i j Τ α + b i ) (2.4)

3) The probability density function (pdf) of the MEPPH is calculated as follows;

f ( t i j | x i j , b i ) = h ( t i j | x i j , b i ) S ( t i j | x i j , b i ) = h ( t i j | x i j , b i ) exp { H ( t i j | x i j , b i ) }

Therefore, the pdf becomes

f MEPPH ( t i j | x i j , b i ) = h 0 ( t i j ) ψ ( x ) [ S 0 ( t i j ) ] ψ ( x ) = f 0 ( t i j ) S 0 ( t i j ) exp ( x i j Τ α + b i ) [ S 0 ( t i j ) ] exp ( x i j Τ α + b i ) = f 0 ( t i j ) exp ( x i j Τ α + b i ) [ S 0 ( t i j ) ] exp ( x i j Τ α + b i ) 1 (2.5)

2.2. Proportional Hazard (PH) Model

The proportional hazards assumption states that a covariate will have an effect on the hazard by a proportionate amount that does not depend on t. By using the link function without the random effect so that b i = 0 and the hazard function with the covariate vector x (fixed/time dependent), the developed model (MEPPH) simplifies to PH;

h ( t ; x ) = h 0 ( t ) ψ ( x Τ α ) = h 0 ( t ) exp ( x Τ α ) = h 0 ( t ) e x Τ α . (2.6)

It is clear from Equation (2.6) that the hazard ratio comparing any two formulations of the covariates, say x and x * , is

h ( t ; x ) h ( t ; x * ) = exp [ ( x x * ) α ] (2.7)

which remains over time. In other words, the risk for one person is inversely correlated with the risk for any other person, with the proportionality constant being independent of time. Other important lifetime functions of PH are also given as follows:

1) Survival function, which is given as

S PH ( t ) = [ S 0 ( t ) ] exp ( x Τ α ) (2.8)

2) The lifespan distribution, sometimes referred to as the cumulative density fuction of the PH model, has the function denoted by

F PH ( t ) = 1 S ( t ) = 1 exp { H ( t ) } = 1 [ S 0 ( t ) ] exp ( x Τ α ) (2.9)

3) The probability density function (pdf) of the PH is calculated as;

f PH ( t ) = h 0 ( t ) ψ ( x ) [ S 0 ( t ) ] ψ ( x ) = f 0 ( t ) S 0 ( t ) exp ( x Τ α ) [ S 0 ( t ) ] exp ( x Τ α ) = f 0 ( t ) exp ( x Τ α ) [ S 0 ( t ) ] exp ( x Τ α ) 1 (2.10)

By assuming that the baseline hazard function h 0 ( t ) is random and undefined, Equation (2.6) can be used to create a PH model. As a result, the popular Cox PH model [4] is obtained, which does not depend on every distributional assumptions for the dependent variables. It is also possible to construct a fully parametric PH model by presuming that h 0 ( t ) has a parametric form.

2.3. Baseline Distribution

The hazard shapes that the survival model (2.1) can apprehend depend on the choice of the parametric baseline distribution. For instance, the Weibull hazard function can increase or decrease and is closed both under AFT and PH, whereas log-normal and log-logistic hazard functions are uni-modal and not closed under PH. Other (three-parameter) distributions, such as the generalized log-logistic, Generalized Weibull (GW), Power Generalized Weibull (PGW), Exponentiated Weibull (EW) and Generalized Gamma (GG) distributions, can represent the fundamental shapes of the hazard (growing, decreasing, uni-modal, and bathtub). We considered Generalized Log-Logistic (GLL) as the parametric baseline hazard distribution based on its numerical tractability and flexibility, which is closed under PH. The generalized log-logistic distribution is a three-parameter continuous probability distribution with three-shape parameters ( β , κ , γ ), is sufficiently resilient and competent of supporting a range of rate of failure functions. This is one of the characteristics that set the GLL apart from the two-parameter log-logistic distribution. The GLL also shows characteristics that are comparable to those of the log-logistic distribution. The generalized distribution additionally has the benefit of having a limit that is close to the desired Weibull distribution, which is a desirable characteristic. This model was created up by [20] and is suitable for including various hrf forms, including constant, monotonic (increasing and decreasing), and non-monotonic (unimodal and bathtup).

Assuming that the random variable T has a positive sign, probability density function (PDF) and cumulative distribution function (CDF) of the three-parameter generalized log-logistic distribution are given respectively as follows:

f ( t ; x ) = β κ ( κ t ) β 1 [ 1 + ( γ t ) β ] κ β γ β + 1 , t 0, κ , β , γ > 0, (2.11)

F ( t ; x ) = [ 1 + ( γ t ) β ] κ β γ β 1 [ 1 + ( γ t ) β ] κ β γ β , t 0 , κ , β , γ > 0 , (2. 12)

where β > 0 ; κ > 0 ; γ > 0 are unknowable distributional parameters and x = ( β , κ , γ ) .

Some other important functions which are obtained from the GLL are:

1) Survival function S ( t ; x ) .

S ( t ; x ) = 1 F ( t ; x ) = [ 1 + ( γ t ) β ] κ β γ β , t 0 , κ , β , γ > 0. (2.13)

2) The hazard rate function (hrf).

The hazard rate function of the GLL model is given by:

h ( t ; x ) = f ( t ; x ) S ( t ; x ) = β κ ( κ t ) β 1 1 + ( γ t ) β , t 0 , κ , β , γ > 0. (2.14)

3) The reversed hazard rate function r ( t ; x ) which is given by:

r ( t ; x ) = f ( t ; x ) F ( t ; x ) = β κ ( κ t ) β 1 [ 1 + ( γ t ) β ] κ β γ β + 1 [ 1 + ( γ t ) β ] , t > 0 , κ , β , γ > 0. (2.15)

4) The cumulative hazard rate function of GLL model H ( t ; x ) is calculated as follows:

H ( t ; x ) = 0 t h ( u ) d u = 0 t f ( u ) S ( u ) d u = 0 t d S ( u ) S ( u ) d u = log { S ( t ; x ) } = κ β γ β log [ 1 + ( γ t ) β ] , t 0, κ , β , γ > 0. (2.16)

Figure 1. Different hazard rate forms of the GLL distribution with varying parameter values are represented visually.

The GLL distribution’s potential hazard rate shapes are shown in Figure 1 for various parameter values. According to [21] , it can capture the constant, monotone, and non-monotone shape.

5) The Quantile function of the GLL (which is the inverse of the CDF) which crucial in statistical and quantitative data analysis is given by:

1 [ 1 + ( γ t ) β ] ( k β / γ β ) = p T q = F 1 ( q ; κ , β , γ ) = [ 1 + ( γ t ) β ] ( κ β / γ β ) = 1 p = 1 [ 1 + ( γ t ) β ] κ β / γ β = 1 p = [ 1 + ( γ t ) β ] κ β / γ β = 1 1 p

= 1 + ( γ t ) β = ( 1 1 p ) γ β / κ β = ( γ t ) β = ( 1 1 p ) γ β / κ β 1 = γ t = { [ 1 1 p ] γ β / κ β 1 } 1 / β q = { [ 1 / ( 1 p ) ] γ β / κ β 1 } 1 / β γ (2.17)

The upper quartile, the median and the lower Quartile are obtained by replacing P by the values 3/4, 1/2 and 3/4 in the quartile function respectively:

Hence, the upper quaetile is:

T q 3 = { 4 γ β / κ β 1 } 1 / x γ , (2.18)

The median is:

T q 2 = Median = { 2 γ β / κ β 1 } 1 / β γ , (2.19)

And the lower quartile is

T q 1 = { [ 4 / 3 ] γ β / κ β 1 } 1 / β γ . (2.20)

2.4. Sub Models

Other distributions which can be obtained from the baseline hazard distribution include :

1) Log-logistic

The hazard rate function form of a log-logistic is obtained by substituting κ = γ in Equation (2.14) for the two parameters of ( κ , β ) in the log-logistic distribution.

The GLL hrf is

h ( t ; x ) = β κ ( κ t ) β 1 1 + ( γ t ) β , t 0, κ , β , γ > 0.

Equating κ = γ , we get:

h ( t ; x ) = β k ( k t ) β 1 1 + ( k t ) β , t 0, k , β > 0. (2.21)

2) Weibull Distribution

It is obtained by taking γ β 0 of Equation (2.14).

The hazard rate function of Equation (2.14) is given by

h ( t ; x ) = β κ ( κ t ) β 1 1 + ( γ t ) β , t 0, κ , β , γ > 0.

h ( t ; x ) = β k ( k t ) β 1 , t 0, k , β > 0. (2.22)

3) Equation (2.14) simplifies to the hrf of standard log logistic (SLL), which is obtained by substituting κ = γ = 1 in the equation.

h ( t ; x ) = β t β 1 1 + t β , t 0, β > 0. (2.23)

4) Two-parameter Burr-X11 (BX11-2) distribution obtained by replacing γ = 1 in Equation (2.14) simplifies to the hrf of the BX11-2 distribution, which is

h ( t ; x ) = β κ ( κ t ) β 1 1 + t β , t 0, κ , β > 0. (2.24)

2.5. Developed Model

The generalized log-logistic distribution, which is parametric and closed under proportional hazard (PH), can be used to replace the baseline hazard in Equation (2.1) and capture good hazard rate shapes, such as constant, monotone (growing and decreasing), and non-monotone rates (unimodal and bathtup). As a result, it has been discovered that when applied to clustered survival data, parametric models perform better than their semiparametric counterparts [22] . Therefore, in order to estimate these distributional parameters effectively, higher sample sizes and greater censoring rates are often needed [19] [23] [24] [25] . The developed model has a hazard, and cumulative hazard functions, respectively expressed as follows

h ( t i j | x i j , b i ) = β κ ( κ t i j ) β 1 1 + ( γ t i j ) β exp { x i j Τ α + b i } , (2.25)

H ( t i j | x i j , b i ) = κ β γ β log [ 1 + ( γ t i j ) β ] exp { x i j Τ α + b i } . (2.26)

3. Maximum Likelihood Estimation (MLE)

Here, we discuss the conditional and marginal likelihood functions utilized by the marginal maximum likelihood estimators to calculate the parameters of the suggested model, which maximizes the likelihood of the observed data. For the purpose of characterizing the likelihood for the ith cluster under the MEPPH, we will use the variables t i = ( t i 1 , , t i n ) and x i = ( x i 1 , , x i n ) to represent the sample and the corresponding matrix of covariates, respectively. After integrating out the random effects b i , the cluster-specific marginal probability function related to the ith cluster is provided by

L i ( θ , η , b i ) = { j = 1 n i f ( t i j , δ i j | x i j , u i , θ ) } f ( b i | θ ) d b i = { j = 1 n i [ h ( t i j | θ , x i j , b i ) S ( t i j | θ , x i j , b i ) ] δ i j [ S ( t i j | θ , x i j , b i ) ] 1 δ i j } f ( b i | θ ) d b i = j = 1 n i [ h ( t i j | θ , x i j , b i ) ] δ i j S ( t i j | θ , x i j , b i ) f ( b i | θ ) d b i = j = 1 n i [ h ( t i j | θ , x i j , b i ) ] δ i j exp [ H ( t i j | θ , x i j , b i ) ] f ( b i | θ ) d b i = j = 1 n i [ h ( t i j | θ , x i j , b i ) exp ( η i j ) ] δ i j exp [ [ H ( t i j | θ , x i j , b i ) ] exp ( η i j ) ] f ( b i | θ ) d b i (3.1)

where θ i = vector of distributional parameter with the baseline hazard and η = ( t i j , σ i j , x i j ) denotes the observed data, such that t i j is survival time, σ i j is censoring time and x i j is covariates. The random effects are thought to follow a multivariate normal distribution, such that;

f ( u i | θ ) = ( 2 π | Σ | ) G / 2 exp ( u i Σ 1 u i 2 ) (3.2)

with variance-covariance matrix Σ and number of random effects G. Equation (3.1) is analytically intractable due to integrating over the random effects; hence it requires a numerical technique to evaluate. An iterative optimization process (Newton-Raphson algorithm) was used to obtain the maximum likelihood estimation θ ^ of θ . Hence, the log-likelihood function conditional on the random effects b i is given by

log ( L i ( θ , η , b i ) ) = log ( j = 1 n i [ h ( t i j | θ , x i j , b i ) exp ( η i j ) ] δ i j exp [ [ H ( t i j | θ , x i j , b i ) ] exp ( η i j ) ] f ( b i | θ ) ) = i = j n i δ i j log [ h ( t i j | θ , x i j , b i ) + η i j ] i = j n i ( [ H ( t i j | θ , x i j , b i ) exp ( η i j ) ] + log ( f ( b i | θ ) ) ) (3.3)

4. Simulation Study

This section contains a full simulation study that shows how effective the developed MEPPH is. We simulated individuals survival time using the PH model. The Generalized log-logistics (GLL) distribution was taken as the baseline hazard’s distribution. Four covariates were considered in the simulation study; x1 = sex, x2 = age, x3 = white blood cell count, and x4 = Townsend score were present. Three continuous covariates (x2, x3 and x4) were generated using the standard normal distribution, while one binary covariates x1 was generated using the Bernoulli (0.5) distribution. Our assumption was that cluster random effect b i have a normal distribution with a mean of 0 and three standard deviations σ b = ( 0.2 , 0.5 , 1.2 ) . For the GLL hazard function, we selected κ = 1.0 , β = 1.5 , and

λ = 1.0 , h ( t ; x ) = β κ ( κ t ) β 1 1 + ( γ t ) β and X . In accordance with estimates produced

from the acute myeloid leukemia data application, as seen in Section 5, X = ( 0.7,1.4,1.0,1.5 ) were selected. Using the reversed transform technique, the Exponentiated Weibull (EW) distribution was used to simulate lifetime data from the PH model framework [26] .

The following are steps for executing the developed MEPPH model:

1) Set the initial values of the model’s parameters.

2) Utilize the reversed transform technique to create the lifetime data by inverting the cumulative hazard rate function of the developed model.

3) Utilize the various estimates to evaluate the estimations’ values.

4) Analyze the inferential properties of the estimates, taking into account the estimates, biases and mean square errors.

Using various scenarios, we first assessed MEPPH model’s statistical inference and parameter estimation accuracy in the simulations. We used the following formulas to calculate the Average Biases (ABs) and Mean squared Errors (MSEs).

B i a s θ [ θ ^ ] = E x | θ [ θ ^ ] θ = E x | θ [ θ ^ θ ] (4.1)

M S E θ [ θ ^ ] = E x | θ ( ( θ ^ θ ) 2 ) (4.2)

We also looked into the impact of disregarding random effects. In the simulation, various situations were considered for models that included random effects and those that did not (n = 250,500 and 1000). In particular, the parameter estimates, censoring rates on inference, the effect of sample size and the capacity to recover the baseline hazard shapes were all illustrated. A censoring rates of 25% for all scenarios was used and 250 replications was conducted in each scenario. In accordance with real data application in Section 5, we analyzed both the mixed hazard structure MEPPH and the hazard structure PH models in the simulations.

h ( t i j | x i j , b i ) = h 0 ( t i j ) exp { α 1 × s e x i j + α 2 × a g e i j + α 3 × w b c i j + α 4 × t p i i j + b i } , (4.3)

where the covariates x i j Τ = ( s e x i j , a g e i j , w b c i j , t p i i j ) are per the acute myeloid leukemia data in Section 5, and the baseline hazard we are focusing on in this article, h 0 ( ) , has the parameters η = ( κ , β , γ ) .

From the simulation results based on Tables 1-3 it can be seen that for both the GLL-PH and GLL-MEPPH models, the parameter estimations tend to true values and the MSEs approach zero as the number of samples grows. It is clear that both the GLL-MEPPH model with mixed hazard structures and the model neglecting random effects provide estimates with significantly less bias. Owing to the existence of random effects being disregarded, the bias of the model excluding them is higher for the majority of parameters. However, as stated in [19] , we pointed out that the estimates produced from models with or neglecting random effects are not always comparable. The model neglecting random effects is marginal, but the model including random effects is conditional on random effects. Additionally, it can be seen from the findings that most parameter estimates are low when the variance is high, or σ b = 1.2 , as opposed to when the variance is low, or σ b = 0.2 .

5. Real Data Application

We used the developed model to examine acute myeloid leukemia survival in northwest England. 1043 patients were documented between 1982-1998, making up the data. This data set was first analyzed by [27] and is available in the R package spBayesSurv [18] [19] [28] . Numerous earlier analyses of this data set have recommended that there is indication of variability in survival across in this region, as shown in the non-parametric Kaplan-Meier estimators of the survival curves by district shown in Figure 2. These studies include [18] [19] [27] and many more.

White blood cells at diagnosis, truncated at 500 (Wbc), survival time (in years), sex (1-Male,0-Female), the Townsend score (tpi), for which higher values indicate less affluent areas, age (in years), vital status at the end of the follow-up (1-dead, 0-right censored) and administrative districts of residence (24 districts) are all used. Figure 3 displays the TTT plots, box plot, violin plot and histogram for the data set. From the TTT plot and histogram, the hazard rate function seems to have a bathtub shape.

We fitted the MEPPH models with Generalized Log-Logistic (GLL), Power Generalized Weibull (PGW), Modified Kumaraswamy Weibull (MKW), and Weibull (W) baseline distributions and normal distribution for random effects. For all fitted models, the hazard effects x i j considered are sex, age, white blood cells and Townsend score. The random effects were also defined in terms of the variable administrative districts. Equation (5.1) contains the mixed model that was fitted to the data as provided below

h ( t i j | x i j , b i ) = h 0 ( t i j ) exp { α 1 × s e x i j + α 2 × a g e i j + α 3 × w b c i j + α 4 × t p i i j + b i } , (5.1)

where the subscripts j and i denote the individual level and the region (districts) level, respectively, and measure the random effects, which are districts. We also fitted models ignoring the random effects for comparison such that

Table 1. Simulation results for the GLL-PH and GLL-MEPPH models when σ b = 0.2 .

Table 2. Simulation results for the GLL-PH and GLL-MEPPH models when σ b = 0.5 .

Table 3. Simulation results for the GLL-PH and GLL-MEPPH models when σ b = 1.2 .

Figure 2. Data exploratory using Kaplan-Meier estimators of survival by district of acute myeloid leukemia data.

Figure 3. Plots for the survival time data of patients with acute myeloid leukemia that are non-parametric.

h ( t i j | x i j , b i ) = h 0 ( t i j ) exp { α 1 × s e x i j + α 2 × a g e i j + α 3 × w b c i j + α 4 × t p i i j } . (5.2)

The results in Table 4 show the leukemia data application on PH models for four different baseline distributions. It is clear that from the results, all covariates were significant in all PH models which was considered except age, whose p-value is greater than 0.05. However, in terms of models including the random effect as shown in Table 5, it is observed that age and tpi were significant when GLL-MEPPH model was used and had an influence on the survivor’s time which coincided with results obtained by [19] . This was difficult to be obtained from the PH models ignoring the random effect. However, using PGW-MEPPH, MKW-MEPPH and W-MEPPH models it was observed that sex, age and tpi were significant and had an influence to survivor’s time and white blood cells covariate had no effects.

The GLL-MEPPH, MKW-MEPPH and W-MEPPH variance component was significant with their p-value being less than 0.05, hence, GLL-MEPPH gave better results with the smallest standard error for variance component when compared to those other competing models. This finding implied that the random effect influences survival time and should be included in the model. This suggested that after taking into account the data on the observed covariates, there would be a large impact on between-cluster variability. Consequently,

Table 4. Hazard ratio, standard errors, p-values and 95% confidence intervals of fitting the GLL-PH, PGW-PH, W-PH and MKW-PH models without random effects.

Table 5. Results from the fitted mixed effects parametric proportional hazard models to Lukemia data.

survival gaps between different districts are explained by a combination of complex factors, including different age and Townsend distribution, a conclusion that could have been harder to obtain with fully non-parametric methods that require data stratification.

Model Comparison

The competitor models’ best-fitting features are found using specific analytical metrics. To choose the most suited ones, the values of the Hannan-Quinn Information Criterion (HQIC), Bayesian Information Criterion (BIC), Corrected Akaike Information Criterion (CAIC) and Akaike Information Criterion (AIC), were employed.

Table 6. Model comparison without random effect.

Table 7. Model comparison with random effect.

AIC = 2 m 2 log ( l ^ ) (5.3)

BIC = m ln ( n ) 2 log ( l ^ ) (5.4)

CAIC = 2 n m n m 1 2 log ( l ^ ) (5.5)

HQIC = 2 m ln ( ln ( n ) ) 2 log ( l ^ ) (5.6)

where n = number of observations, m = number of parameters, and l ^ = likelihood function’s maximum value. The AIC, BIC, CAIC, and HQIC values of the ideal model are the lowest. We observed that GLL-MEPPH and GLL-PH produced the best match and fitting when comparing all the models with and models neglecting random effects since they have the lowest values of the measured analytical tools. Additionally, as seen in Table 6 and Table 7, GLL-MEPPH offered better outcomes with the minimum value of AIC compared to the model ignoring random effects.

6. Conclusion

Despite the fact that parametric survival models are gaining popularity because it has been discovered that they outperform their semiparametric counterparts when applied to clustered survival data [22] . A generalized log-logistic, an expansion of the closed log-logistic model that takes the PH connection into account, was provided by [20] , in which the limit approaches the Weibull and displays log-logistic traits. The contribution for this paper is the addition of random effects to the framework used in [20] along with an adjustable parametric method that permits us to demonstrate asymptotic conclusions under usual regularity constraints. Our simulation studies showed that the developed models’ estimates, which are based on marginal likelihood, have good frequentist properties. Additionally, they demonstrated that ignoring random effects results in an underestimation of the standard error estimates, which leads to incorrect result interpretation of the covariates that affect survival times. The model was compared to other models using various baseline distributions, and the generalized log logistic mixed-effects proportional hazard model (GLL-MEPPH) performed better than the other models in comparison, according to the results of the information criterion.

Author Contribution

Conceptualization: MWP and SMM, methodology: MWP, SMM, AKW and AHM, software: MWP and AHM, validation MWP, SMM, AKW, data curation MWP and AHM, formula analysis MWP, SMM, AKW and AHM, resources: MWP, SMM, AKW and AHM, writing original draft preparation: MWP, SMM and AKW, writing review and editing MWP, supervision SMM and AKW, project administration MWP.

Data Availability

The real data used to illustrate the developed model is within the manuscript.

Acknowledgment

The corresponding author would like to express her appreciation to the Pan African University, Institute for Basic Sciences, Technology and Innovation (PAUSTI) for supporting this work.

Conflicts of Interest

All authors have read and agreed to the published version of the manuscript.

References

[1] Hougaard, P. (1995) Frailty Models for Survival Data. Lifetime Data Analysis, 1, 255-273.
https://doi.org/10.1007/BF00985760
[2] Omori, Y. and Johnson, R.A. (1993) The Influence of Random Effects on the Unconditional Hazard Rate and Survival Functions. Biometrika, 80, 910-914.
https://doi.org/10.1093/biomet/80.4.910
[3] Gutierrez, R.G. (2002) Parametric Frailty and Shared Frailty Survival Models. The Stata Journal, 2, 22-44.
https://doi.org/10.1177/1536867X0200200102
[4] Cox, D.R. (1972) Regression Models and Life-Tables. Journal of the Royal Statistical Society: Series B (Methodological), 34, 187-202.
https://doi.org/10.1111/j.2517-6161.1972.tb00899.x
[5] Klein, J.P. and Moeschberger, M.L. (2003) Survival Analysis: Techniques for Censored and Truncated Data, Volume 1230. Springer, Berlin.
https://doi.org/10.1007/b97377
[6] Kalbfleisch, J.D. and Prentice, R.L. (2002) The Statistical Analysis of Failure Time Data. John Wiley & Sons, Hoboken.
https://doi.org/10.1002/9781118032985
[7] Ripatti, S., Larsen, K. and Palmgren, J. (2002) Maximum Likelihood Inference for Multivariate Frailty Models Using an Automated Monte Carlo EM Algorithm. Lifetime Data Analysis, 8, 349-360.
https://doi.org/10.1023/A:1020566821163
[8] Ripatti, S. and Palmgren, J. (2000) Estimation of Multivariate Frailty Models Using Penalized Partial Likelihood. Biometrics, 56, 1016-1022.
https://doi.org/10.1111/j.0006-341X.2000.01016.x
[9] Vaida, F. and Xu, R.H. (2000) Proportional Hazards Model with Random Effects. Statistics in Medicine, 19, 3309-3324.
https://doi.org/10.1002/1097-0258(20001230)19:24<3309::AID-SIM825>3.0.CO;2-9
[10] Banerjee, S. and Carlin, B.P. (2003) Semiparametric Spatio-Temporal Frailty Modeling. Environmetrics: The Official Journal of the International Environmetrics Society, 14, 523-535.
https://doi.org/10.1002/env.613
[11] Xu, R.H. (2004) Proportional Hazards Mixed Models: A Review with Applications to Twin Models. Advances in Methodology and Statistics, 1, 205-212.
https://doi.org/10.51936/lmzi2020
[12] Austin, P.C. (2017) A Tutorial on Multilevel Survival Analysis: Methods, Models and Applications. International Statistical Review, 85, 185-203.
https://doi.org/10.1111/insr.12214
[13] Li, Yi and Ryan, L. (2002) Modeling Spatial Survival Data Using Semiparametric Frailty Models. Biometrics, 58, 287-297.
https://doi.org/10.1111/j.0006-341X.2002.00287.x
[14] Hanson, T.E., Jara, A. and Zhao, L.P. (2012) A Bayesian Semiparametric Temporally Stratified Proportional Hazards Model with Spatial Frailties. Bayesian Analysis, 7, 147-188.
https://doi.org/10.1214/12-BA705
[15] Li, L., Hanson, T. and Zhang, J.J. (2015) Spatial Extended Hazard Model with Application to Prostate Cancer Survival. Biometrics, 71, 313-322.
https://doi.org/10.1111/biom.12268
[16] Crowther, M.J., Look, M.P. and Riley, R.D. (2014) Multilevel Mixed Effects Parametric Survival Models Using Adaptive Gauss-Hermite Quadrature with Application to Recurrent Events and Individual Participant Data Meta-Analysis. Statistics in medicine, 33, 3844-3858.
https://doi.org/10.1002/sim.6191
[17] Li, J., Hong, Y.L., Thapa, R. and Burkhart, H.E. (2015) Survival Analysis of Loblolly Pine Trees with Spatially Correlated Random Effects. Journal of the American Statistical Association, 110, 486-502.
https://doi.org/10.1080/01621459.2014.995793
[18] Zhou, H.M. and Hanson, T. (2018) A Unified Framework for Fitting Bayesian Semiparametric Models to Arbitrarily Censored Survival Data, Including Spatially Referenced Data. Journal of the American Statistical Association, 113, 571-581.
https://doi.org/10.1080/01621459.2017.1356316
[19] Alvarez, F.J.R. and Drikvandi, R. (2022) MEGH: A Parametric Class of General Hazard Models for Clustered Survival Data. Statistical Methods in Medical Research.
[20] Khan, S.A. and Khosa, S.K. (2016) Generalized Log-Logistic Proportional Hazard Model with Applications in Survival Analysis. Journal of Statistical Distributions and Applications, 3, 1-18.
https://doi.org/10.1186/s40488-016-0054-z
[21] Muse, H.A., Ngesa, O., Mwalili, S., Alshanbari, H.M. and El-Bagoury, A.-A. (2022) A Flexible Bayesian Parametric Proportional Hazard Model: Simulation and Applications to Right-Censored Healthcare Data. Journal of Healthcare Engineering, 2022, Article ID: 2051642.
https://doi.org/10.1155/2022/2051642
[22] Prenen, L., Braekers, R. and Duchateau, L. (2017) Extending the Archimedean Copula Methodology to Model Multivariate Survival Data Grouped in Clusters of Variable Size. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79, 483-505.
https://doi.org/10.1111/rssb.12174
[23] Alvares, D. and Rubio, F.J. (2021) A Tractable Bayesian Joint Model for Longitudinal and Survival Data. Statistics in Medicine, 40, 4213-4229.
https://doi.org/10.1002/sim.9024
[24] Rossell, D. and Rubio, F.J. (2022) Additive Bayesian Variable Selection under Censoring and Misspecification. Statistical Science, 38, 13-29.
https://doi.org/10.1214/21-STS846
[25] Rubio, F.J., Remontet, L., Jewell, N.P. and Belot, A. (2019) On a General Structure for Hazard-Based Regression Models: An Application to Population-Based Cancer Research. Statistical Methods in Medical Research, 28, 2404-2417.
https://doi.org/10.1177/0962280218782293
[26] Leemis, L.M., Shih, L.-H. and Reynertson, K. (1990) Variate Generation for Accelerated Life and Proportional Hazards Models with Time Dependent Covariates. Statistics & Probability Letters, 10, 335-339.
https://doi.org/10.1016/0167-7152(90)90052-9
[27] Henderson, R., Shimakura, S. and Gorst, D. (2002) Modeling Spatial Variation in Leukemia Survival Data. Journal of the American Statistical Association, 97, 965-972.
https://doi.org/10.1198/016214502388618753
[28] Fahrmeir, L., Kneib, T., et al. (2011) Bayesian Smoothing and Regression for Longitudinal, Spatial and Event History Data. Oxford University Press, Oxford.
https://doi.org/10.1093/acprof:oso/9780199533022.001.0001

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.