L0 Regularization for the Estimation of Piecewise Constant Hazard Rates in Survival Analysis ()
1. Introduction
In survival analysis, when interest lies on the estimation of the hazard rate, an attractive and popular model is the piecewise constant hazard model. This model is easy to interpret as the hazard rate is supposed to be constant on some pre- defined time intervals and plotting the hazard rate gives a quick sense of the evolution of the event of interest through time. Many epidemiological studies use this model to represent the hazard rate function either because it provides an interesting way to fit the hazard function or because the data are not available on the individual level. See for instance Table 1 of [1] where the authors displayed the incidence of breast cancer on ten-year intervals for different subpopulations.
While this model can be used in a nonparametric setting, it is often used in combination with covariates effects. This is the case for instance for the popular Poisson regression model (see [2] [3] ) which assumes a proportional effect on the covariates and a piecewise constant hazard model for the baseline hazard. This model is widely used in practice typically when dealing with register data. On one hand it allows to perform survival analysis with large computational savings (and save considerable data storage requirements) and, on the other hand, it allows to easily estimate the baseline hazard rate as a piecewise constant function and to give a very easy interpretation of the baseline hazard rate. Among many practical examples, we refer the reader to [4] [5] [6] . In practice, as noticed by [7] for Poisson regression, “the choice of time intervals should generally be guided by subject matter aspects, but the numbers of events and numbers at risk within intervals may also be considered when specifying the number and lengths of the intervals. A study of a rare event and/or a small exposure group may require longer intervals.” While this might be true, it is clear that in some situations there might be no a priori knowledge for the choice of the time intervals and then they are usually arbitrarily chosen. This is the case for example in [1] where the time intervals in Table 1 were arbitrarily chosen as ten years length.
When modeling covariates effect through a proportional hazard model, [8] proposed an estimator that allows the baseline to stay unspecified. In this model, the baseline is taken as a function that only puts mass on the observed events
and the likelihood simplifies into the Cox partial likelihood where the regression effect can be estimated separately from the baseline. While this is a very interesting aspect of the Cox model, this nice separation between baseline estimation and regression effect estimation does not hold anymore in many extensions of this model. For instance, in frailty models (see among many other authors [9] [10] [11] [12] ) keeping a non-parametric baseline makes the estimation method much more complicated since baseline and regression parameters must be estimated simultaneously. As a consequence, the literature on estimation proce- dures in the frailty context is vast. As a matter of fact the estimation procedures in [13] [14] and [12] all lead to similar but still different estimates. Importantly, in [14] it is said that Poisson regression and Cox models give results that tend to be very similar, with or without frailties.
In the joint modeling framework one wants to model the association between a longitudinal variable and a time to event response through a random effect (see [15] [16] ). Only parametric baseline functions are implemented in the widely used jm R package (see [17] ). As a matter of fact, the author in [16] recommends either to use the piecewise constant baseline hazard or a spline basis baseline hazard which he says “often work quite satisfactorily in practice” (see page 53 of the book). The R frailtypack package (see [18] ) deals with more survival analysis situations involving a random effect such as nested frailty models (see [19] ) or joint inference of recurrent and terminal events (see [20] ). In this package, the possible baseline hazard functions are the piecewise constant hazard, Weibull hazard and spline functions. In the last case, the authors introduce a penalized likelihood estimation method that allows to obtain smooth estimates of the baseline hazard function. However the use of spline baseline functions requires to specify in advance the number of knots used in the esti- mation and therefore can be seen as a smoothed version of the piecewise constant hazard functions where one must choose in advance the number of cuts.
Other contexts where the partial likelihood approach does not work anymore include the cure models framework (see for instance [21] and [22] ) and the analysis of interval-censoring data (see [23] for instance). In the latter case, the nonparametric maximum likelihood estimator for the cumulative hazard or the survival function is known to be slow with a convergence rate of order
and the limiting distribution is not Gaussian (see [24] for current status data and [25] for case II intervals censored data). This problem pertains in the regression framework (see sections 5.2.3 and 6.2.2 in [23] for instance). On the other hand, using parametric baseline functions such as the piecewise hazard functions allows to obtain classical parametric rate of convergence and makes the estimation procedure much more stable. In this article, we only consider the nonparametric setting of estimating the baseline hazard function in a piecewise constant hazard model in the situation of right-censored data. We propose a new method to automatically find the appropriate number and location of the cuts used in this model. Our algorithm is based on the recent work from [26] where starting from a large set of possible cut points an L0 penalty on the likelihood of the model forces many successive cuts to be equal providing a parsimonious estimate of the hazard function. The procedure is data-driven and inference taking into account both the variability from the estimates and the cut points positions can be derived.
In Section 2, the piecewise constant hazard model is recalled and the adaptive ridge estimator is applied to this model. Section 3 proposes two different pro- cedures to choose the penalty term involved in the estimation procedure. Section 4 proposes a bootstrap based method to obtain valid inference for survival dis- tribution quantities such as the survival function. A simulation study is conduc- ted in Section 5, where the efficiency of the estimation method is evaluated and the two different procedures to choose the penalty term are compared. The method is applied to the Mayo Clinic trial on primary biliary cirrhosis in Section 6 and a small discussion concludes the paper in Section 7.
2. Model and Estimation Procedure
2.1. The Piecewise Constant Hazard Rate Model
Let
represent the survival time of interest. In practice
might be censored by a random variable
so that we observe
. The data consist of
independent replications
. We aim at estimating the hazard function defined for
by:
In the following, this hazard function is assumed to be piecewise constant on
cuts represented by
, with the convention that
and
. Let
. We suppose that
for
. Note that the exponential baseline hazard is obtained
from
in the piecewise constant hazard family. Let
re-
presents the cumulative hazard function and denote by
the model parameter we aim to estimate.
In order to make inference on the model parameter we will assume inde- pendent right censoring and non-informative censoring (see [27] for instance for a complete review of these assumptions). Estimation is then carried out using classical likelihood arguments. Let
represents the log-likelihood of the model. We have:
where the equality holds true up to a constant that does not depend on the model parameter
. For computational purpose, it is interesting to note that the log-likelihood can be written in a Poisson regression form. Introduce
, the total time individual
is at risk in the
th interval
,
, the number of events for individual
in the
th subinterval. Also
and
are sufficient statistics and estimation can be carried out using only these two statistics. The log- likelihood can then be written again as (see [3] p. 223-225 for more details):
(1)
Since
is concave, the maximum likelihood estimator has an explicit solu- tion, obtained from derivation of the log-likelihood: for
,
(2)
2.2. The Adaptive Ridge Regression
For computational purpose, introduce
such that
and
the vector parameter we aim to estimate. Using the L0 penalty from [26] , we propose the following penalized likelihood:
(3)
where
are non-negative weights that will be iteratively updated in order for the weighted ridge penalty term to approximate the L0 penalty.
The score vector is denoted
and its
th com- ponent,
, is equal to:
with the convention
. Now introduce
, the opposite of the Hessian matrix.
is a
non-negative definite band matrix whose bandwidth equals 1. Its diagonal elements are equal to
other elements next to the diagonal are defined for
by
and all other elements are equal to zero, that is for
such that
,
.
The vector parameter
is updated using the Newton-Raphson algorithm. For a given sequence of weights
obtained at the
th step, the
th Newton Raphson iteration step is obtained from the equation
The inversion of the band matrix is performed through a fast (linear com- plexity) C++ implementation of the well-known LDL algorithm (variant of the LU decomposition for symmetric matrices). Initialization of the Newton Raph- son algorithm can be obtained from the classical unpenalised estimator of the piecewise constant hazard model, that is
. See [3] for de- tails about this estimator.
On the other hand, following the recommendation from [26] , the weights can be updated from the equation
for
with
. Briefly, this form of the weights is motivated by the fact that
is close to 0 when
and close to 1 when
. Hence the penalty term tends to approximate the L0 norm. The weights are initialized by
, which gives the standard ridge estimate of
.
3. Choice of the Penalty Term
In this section we propose two different ways to choose the correct penalty term. The first one is based on a standard cross-validation criterion while the second one is based on a BIC criterion.
In order to choose the right penalty term, one must first define a large grid of penalty values such that the criterion (cross-validation or BIC) will be evaluated at each of these penalty terms. For that purpose, the algorithm can benefit from a warm start of the penalty weights. Indeed, instead of initializing the weights to 1 for each penalty value, one can take the final weights of the previous (smaller) penalty as a starting point for the next (larger) penalty. In this way, full re- gularization path similar to those of the LASSO can be generated very efficiently. Note, however, that this warm-starting is not necessary since it is always possible to initialize the algorithm with neutral weights of value 1. A preliminary set of cut positions must also be chosen. For simplicity we recommend to take a large set of equally spaced points including the range of the observed time point values. See Sections 5 and 6 to see how this works in practice.
3.1. A Cross-Validation Criterion
Split the data in
pieces and define
as the maximizer of the penalized likelihood in Equation (3) when part
is left out from the data. Then the
- fold cross validated log-likelihood is defined by:
where
represents the unpenalized log-likelihood as in Equation (1) but computed only in part
of the data. Maximizing this quantity with respect to
gives the optimal penalty term.
Note that unlike the Cox regression framework where the baseline is left unspecified, this cross-validated criterion is well defined since in our case the hazard rate is constructed as a continuous function of time. Also, the relation
holds such that our criterion is completely equivalent to the cross-validated criterion developed by [28] and [29] in the standard Cox regression framework.
In order to improve efficiency and time speed in the computation programs, the 10-fold cross validation is recommended in practice.
3.2. A BIC Criterion
The following criterion can be used as an alternative to the choice of the penalty term. It is defined as a balance between good fit of the data and low complexity of the model parameters. It is fast to compute and has the following expression:
The parameter estimator
is defined as the maximizer of the penalized likelihood in Equation (3) while
represents the model complexity. It is equal to the number of distinct consecutive values of the
s in
:
with the convention
.
The performance in the choice of the penalty term by both criteria is inves- tigated in the simulation study in Section 5.
4. Statistical Inference for the Time to Event Distribution
In practice it is of interest to derive confidence intervals for marginal quantities directly related to the time to event variable such as the cumulative hazard function or the survival function. Asymptotic properties of the piecewise-cons- tant hazard model for a given set of cut points is straightforward and has been already derived (see for instance [3] ). However, the adaptive ridge estimator involves data driven choice of the cut points and using standard results to derive pointwise confidence intervals for the survival function for instance would lead to an overfitting of this function. This is of major concern and one should interpret such confidence intervals with caution.
One way to take into account the uncertainty in the choice of the cut points is to use a resampling technique where for each sample a different penalty term is chosen from the cross-validated or BIC criterion. This will provide a new hazard estimate with a different set of cut points for each sample. Taking the adequate quantile at each time point allows us to obtain pointwise confidence intervals of the correct order for the quantity of interest.
Interestingly, this resampling technique also allows us to compute an alternative pointwise estimate of the survival function (or of any marginal distribution quan- tity) by taking the pointwise medians of each bootstrap sample. This provides a very smooth estimate function and, in that sense, this kind of estimate can be seen as a smooth non-parametric estimate of the survival function.
This bootstrap procedure is illustrated in Sections 5 and 6 to derive con- fidence intervals and smooth estimates for the survival function.
5. Simulation Study
5.1. Simulations under a Piecewise Constant Hazard Model
We illustrate the proposed method to estimate the following hazard function:
The censoring distribution is simulated as a uniform distribution over the time interval [70,90] which gives on average
of observed failures. On average,
of the observations fall into the interval (20,40],
of the observations fall into the interval (40,50],
of the observations fall into the interval (50,70] and
of the observations fall into the interval
.
We start with a single sample of size 100 generated from this model. Using the true cuts, the classical unpenalized hazard estimator derived from Equation (2) is computed on Figure 1. The estimation is quite accurate on each cut interval. Figure 2 presents the two extreme situations where the penalized hazard estimate is computed using a very small penalty term on the left panel and using a very large penalty term on the right panel. We see that in the left panel the hazard function is overfitted while the right panel corresponds to the exponential model. A good choice of the penalty term should provide a good compromise between
Figure 1. True hazard rate function (dashed line) and unpenalized hazard rate estimate computed at the true cuts (solid line).
these two situations. The set of all possible cuts was chosen as all the integer values ranging from 1 to 100 and the set of penalty terms was taken, on the log scale, as the set of 100 equally spaced values ranging from
to
. On this sample the BIC and cross-validation criteria respectively chose the penalty values equal to 0.95 and 1.15 which both gave the same estimate. Figure 3 shows the regularization path for the penalty term and the penalized estimated hazard obtained from the penalty equal to 0.95. We see that both criteria find only three cuts in the estimation of the hazard function, and the cut interval
is not found by the method on this example. As an indicator of the estimation accuracy, the total variation distance between the true hazard and the penalized hazard estimate is computed on the time interval
. On our data example, the total variation is approximately equal to 0.29. Finally, confidence intervals are derived for the survival function using the resampling technique presented in Section 4. The curves are plotted in Figure 4 from 100 bootstrap samples. Our method shows very little difference from the classical Kaplan-Meier estimate and its pointwise confidence interval. Interesting- ly, our survival estimator and its pointwise confidence intervals have a smooth shape in contrast with the stepwise shape of the Kaplan-Meier estimator.
In order to assess the good performance of our penalized estimator, we also conducted Monte-Carlo simulations from the model scenario presented in this section with 600 sample replications. We considered samples of size 100, 400 and 1000 and in each case we computed the probability distribution of the number of cuts found by the BIC method and by the cross-validation method. The results are reported in Table 1. We also computed the total variation distance between true hazard and penalized hazard estimates in each case and reported the results in Table 2. We see that for
both methods tend to be overpenalized as they find a majority of three breakpoints instead of four. As
Figure 2. Penalized hazard rate estimates computed using a penalty equal to 0.1 (a) and a penalty equal to 1000 (b). Dashed line: true hazard rate. Solid lines: penalized hazard rate estimates.
the sample size increases, the proportion of times the four breakpoints are found increases. Looking at the total-variation distance, we see that for both methods, the estimate becomes more and more accurate as the sample size increases. In general, the BIC criterion outperforms the cross-validation criterion both in terms of breakpoints detection and fitting of the hazard function.
One should note that the simulation scenario presented here makes it difficult to estimate the hazard function due to the low value of the hazard rates for
. For a moderate sample size,
for instance, very few observations will fall in each cut interval (only
in the interval
for example) and therefore the method has difficulties to find all the cuts. The problem
Figure 3. Regularization for the choice of the penalty term using either the BIC or cross-validation criterion (a). Dashed line: penalty term obtained from both criteria. Penalized hazard rate estimate (b).
disappears as we increase the sample size. We considered other simulation settings where the proportion of observations falling into each cut interval was more balanced. This resulted in a very good performance of the estimator for small samples, both to detect the true number of cuts and to accurately fit the hazard function.
5.2. Simulations under a Weibull Hazard Model
We now consider the following Weibull model, where this time, the true hazard is a continuous function of time:
where
is the shape
Figure 4. Estimates of the survival function for the piecewise constant hazard scenario. Dashed line: Kaplan Meier estimator along with its 95% pointwise confidence interval (dotted lines). Solid line: bootstrapped adaptive ridge estimator along with its 95% pointwise confidence interval (dot dash lines).
(a) (b)
Table 1. Proportions of the number of cuts found by the BIC (a) and cross-validation (b) criteria for different sample sizes.
Table 2. Mean total variation distance between true hazard and penalized estimated hazard obtained from the BIC and cross-validation (CV) criteria for different sample sizes in the piecewise constant hazard scenario.
parameter and
is the scale parameter. This gives an average time value of 55 and a time standard deviation of 12.6. The censoring distribution is also simulated as a Weibull variable but with shape parameter equal to 30 and a scale parameter equal to 60. This gives the same average percentage of observed failures
as in the previous simulation setting.
As before we start with a single sample of size 100 generated from this model and we compute our adaptive ridge estimator using the same grid of cut points and the same grid of penalty values as in the previous scenario. The penalty value was chosen equal to 0.95 from the BIC criterion. Since we are estimating a continuous function of time it seems of interest to see how a smoother estimate would perform on this Weibull distribution. Our penalized likelihood can be easily modified to get a ridge estimate of the hazard by putting all the weights
equal to 1 in Equation (3). This gives a simpler algorithm where the weights do not need to be updated and only a Newton-Raphson agolrithm is performed on the parameter vector
. However no simple criterion can be proposed to choose the penalty value in this setting and we arbitrarily chose a large value equal to 40 in order to force the estimator to be smooth. Plots of our adaptive ridge estimator, our ridge estimator and the true Weibull hazard are displayed in Figure 5. It is seen that two cuts are chosen for the adaptive ridge estimator which gives a fairly good fit of the true curve. However, as one would expect, the ridge estimator captures much more accurately the fluctuations of the curve. The resampling technique was used as before (100 samples) to compute the survival function along with its
confidence interval in Figure 6. The time range for the figure was deliberately set to
even though no times were observed beyond 60 due to censoring. The fit of the survival estimate is very accurate for the whole time range. After time 60 the piecewise constant modeling allows to interpolate the estimate which provides a good fit of the Weibull distribution with slightly larger confidence intervals. The Kaplan-Meier estimator is not shown on this figure because it gives similar result as for the piecewise constant hazard simulation scenario: a very similar fit to the curve and almost identical confidence intervals on the restricted time range
. One should note however that our resampled estimator provides a much smoother fit than the stepwise shape of the Kaplan-Meier estimator and no interpolations can be provided after time 60 for the Kaplan-Meier estimator.
Finally, Monte-Carlo experiments were conducted to assess the quality of fit of our estimators for the Weibull hazard function. This was measured as before
Figure 5. Penalized hazard rate estimates for the Weibull scenario. Dashed line: true hazard. Solid line: adaptive ridge estimator. Dotted lines: ridge estimator.
Figure 6. Estimates of the survival function for the Weibull scenario. Dashed line: true hazard. Solid line: bootstrapped adaptive ridge estimator along with its 95% pointwise confidence interval (dot dash lines).
in terms of total variation distance between the true hazard and the adaptive ridge or the ridge estimator on the time interval
. As an illustration, on the sample example of size 100 of Figure 5, the total variation distance equals 0.37 for the adaptive ridge estimate and 0.13 for the ridge estimate. It is important to note that a fixed penalty was used for every sample for the ridge estimator (equal to 40 as before) while the penalty was adaptively chosen from the BIC criterion as described in Section 3 for the adaptive ridge estimator. In terms of comparison this gives an initial advantage to the adaptive ridge esti- mator. Nevertheless the results reported in Table 3 show a clear advantage for the ridge estimator for every sample size. For
the total variation error is 1.7 times bigger for the adaptive ridge estimator and for larger sample sizes it gets approximately 2 times bigger. These results indicate that if one aims at deriving smooth and accurate estimates of the hazard function, for prediction purposes for instance, one should favor the ridge version of our hazard esti- mator.
6. A Real Data Analysis
We consider here the dataset from the Mayo Clinic trial in primary biliary cirrhosis (pbc) presented in [30] . This dataset is available through the survival package of the R software. We focus our interest on time to death for the 424 patients of the dataset. The time variable was measured in days from inclusion until either death or liver transplantation or lost to follow-up. Only
of deaths are observed such that
of the observations are censored. The time variable ranges from 41 to 4795 days, so we decided to take as the set of all possible cuts the sequence of values going from 1 to 4800 by step of 10. As in the simulation study, the set of penalty terms was taken on the log scale, as the set of 100 equally spaced values ranging from
to
. The penalty terms chosen from the BIC and cross-validation criteria are respectively equal to 1.23 and 1.63. This leads to one cut point for the BIC criterion and no cut point for the cross-validation criterion. Following the results from the simulation study, we decided to choose the former criterion. The corresponding estimate has one cut point at time 3050 such that the hazard estimate equals
for
and equals
for
. The estimate and the regulation path for the penalty term are displayed on Figure 7.
Finally, the boostrap procedure is used to derive the survival estimate with its
pointwise confidence interval for the time to death. The curves are displayed on Figure 8 along with the Kaplan-Meier estimator and its
pointwise confidence interval. As in Section 5, the result from our estimator shows very little difference with the Kaplan-Meier estimator. With our bootstrap estimator, the median death time is estimated to approximately 3390 days and the
confidence interval for the survival at this time is approximately
. The 25th percentile is estimated to approximately 1501 days and the
confidence interval for the survival at this time is approximately
. With the Kaplan-Meier estimator the median is estimated at 3395 and the
confidence interval for the survival at this time is approximately
, the 25th percentile is estimated to approximately 1462 days and the
Table 3. Mean total variation distance between true hazard and penalized estimated hazard obtained from the adaptive ridge estimator and the ridge estimator for different sample sizes in the Weibull scenario.
Figure 7. Regularization for the choice of the penalty term using the BIC criterion on the pbc data (a). Dashed line: penalty term obtained from this criterion. Penalized hazard rate estimate of death on the pbc data (b).
confidence interval for the survival at this time is approximately
.
7. Concluding Remarks and Extensions
In this article, we proposed an innovative method to estimate the hazard rate in a piecewise constant model. The estimator is defined as the maximum of a penalized likelihood and allows to automatically detect the number and cuts location of the model and to estimate the hazard on each cut interval. A boot- strap procedure was also proposed in order to derive valid statistical inference
Figure 8. Estimates of the survival function on the pbc data. Dashed line: Kaplan Meier estimator along with its 95% pointwise confidence interval (dotted lines). Solid line: bootstrapped adaptive ridge estimator along with its 95% pointwise confidence interval (dot dash lines).
taking both into account the variability of the estimate and the variability in the choice of the cut points. In order to select the penalty term we recommend using the BIC criterion as it seems to outperform the cross-validation criterion and it is also very fast to compute. Finally if one is interested in obtaining a smooth estimate of the hazard function, a small modification of the original estimator allows to derive a ridge version which has been shown to provide a very good fit to continuous survival distributions.
This work was established in the nonparametric setting of right censored data but many extensions can be considered. Including covariates in the model through a Poisson regression modeling for instance should be straightforward. As a matter of fact, since the method uses a penalized likelihood approach, no explicit estimators are available and even in the nonparametric setting the estimator is derived from the Newton-Raphson algorithm. In the nonparametric and regres- sion settings, by modifying the likelihood formula, the method should also readily extend to truncation and to other types of censoring such as interval censoring. More difficultly it would be interesting to see how the penalized like- lihood approach works in a frailty, joint modeling or cure model context. Using the L0 approach in these contexts amounts to fit a penalized parametric model which makes our method very appealing due to the nice properties of parametric models. Besides, our resampling method allows to derive smooth estimates of time dependent quantities of interest. As a result it is seen that our method nicely combines both the advantages of a parametric implementation and non- parametric fit of survival quantities.
The L0 approach was used to constrain two consecutive cuts in the piecewise constant hazard model to be equal. Interestingly, a different model could be pro- posed where straight lines connect the consecutive cuts. In that model, the L0 approach could be derived by constraining two consecutive slopes of lines to be equal. In the same idea, spline hazard functions could also be constructed by penalizing further order derivatives of polynomial functions. All these extensions are left to future research.