^{1}

^{*}

^{1}

^{*}

^{2}

^{*}

A standard approach for analyses of survival data is the Cox proportional hazards model. It assumes that covariate effects are constant over time, i.e. that the hazards are proportional. With longer follow-up times, though, the effect of a variable often gets weaker and the proportional hazards (PH) assumption is violated. In the last years, several approaches have been proposed to detect and model such time-varying effects. However, comparison and evaluation of the various approaches is difficult. A suitable measure is needed that quantifies the difference between time-varying effects and enables judgement about which method is best, i.e. which estimate is closest to the true effect. In this paper we adapt a measure proposed for the area between smoothed curves of exposure to time-varying effects. This measure is based on the weighted area between curves of time-varying effects relative to the area under a reference function that represents the true effect. We introduce several weighting schemes and demonstrate the application and performance of this new measure in a real-life data set and a simulation study.

The Cox proportional hazards model [

The first proposal for an extension of the Cox model is given by Cox [

In the last years, several more sophisticated approaches have been proposed. The variety of underlying techniques for modelling time-varying effects among these approaches is broad. They include splines [

In a Cox model the PH assumption is often acceptable for several variables, but may be critical for some of them. Plotting scaled Schoenfeld residuals [

In addition, direct comparison between the different approaches does not answer the question about which one is best. To get an answer to this question, we have to evaluate the similarity of either approach to the truth (e.g. the true effect in simulations or the SSSRs in real data). This stresses the need for a quantitative measure of the difference between the truth and the estimated function(s) under investigation.

For time-varying effects, we propose to adapt the technique for quantifying the area between smoothed curves proposed by Govindarajulu et al. [

In Section 2 we briefly introduce two approaches for modelling time-varying effects, which are used for illustrating purposes. The ABCtime measure itself is presented in Section 3. In Section 4 we introduce a real data example and a simulation study, followed by an illustration of the use of ABCtime in these examples (Section 5) and a discussion (Section 6).

The Cox proportional hazards model [

is the standard tool in survival analysis. In some situations, though, the proportional hazards (PH) assumption may be violated due to the presence of time-varying effects. A potential violation of the PH assumption is often explored based on the Schoenfeld residuals [

They shall reflect the true time-varying behaviour of effects and thus should be flexible enough to reflect the underlying time-varying shape including possible short-term changes, but on the contrary should not be too

noisy. Therefore, we chose a symmetric nearest neighbour smoother with a span of 0.75, i.e.

The Fractional Polynomial Time (FPT) approach is part of the Multivariable Fractional Polynomial Time (MFPT) algorithm [

with potentially non-linear functional forms of covariates

FPT is the part of MFPT which focuses on the selection of a time-varying effect for one covariate using a function selection procedure based on fractional polynomials [

A time-varying effect based on an FPT1 is of type

There are many examples showing that the time-varying effect decays over time and the log transformation is often used successfully to estimate the functional form. In the FPT algorithm it is used as the default function for a variable with a time-varying effect.

To select a time-varying effect, a hierarchical closed test procedure based on deviance differences is applied:

1) Test the best-fitting FPT2 function

2) Test

3) Test

Scheike and Martinussen [

In multivariable analyses, they recommend to start with the fully non-parametric model [

where

Estimation and tests are based on the cumulative regression functions

as they are consistent and converge at a faster rate than

The test on a time-varying effect for covariate

where

The area between two curves can be used to quantify the distance between these curves. Depending on the application, this technique may be adapted and the area may be weighted according to the specific requirements.

Govindarajulu et al. [

precision in the estimates across the range of exposure, Govindarajulu et al. [

where

25 to 200 bootstrap samples will usually be needed to obtain a reasonable estimate of variance [

julu et al. [

The area difference is then calculated as

and

To assess how close two curves are, the area difference between the curves is presented as percent of the average area under the two curves. Consequently, the closer the curves, the smaller the percentage.

The concept of the area between curves is not restricted to smoothed curves of exposure but is applicable to a wider range of functions as, for example, time-varying effects. Transfer to this setting, though, requires some modifications of the original method proposed by Govindarajulu et al. [

The variety of different approaches proposed for modelling time-varying effects is broad, ranging from explicit functional forms for time-varying effects to estimates that are available only at certain specified time points or as step-functions [

Hence, ABCtime should be applicable to continuous functions, as well as to right- and left-continuous step functions to cover the broad variety of potential approaches for modelling time-varying effects. Consequently, using the left endpoint of intervals for calculation of ABCtime may lead to biased results. If the left endpoint for an ABCtime rectangle is identical to the left endpoint of a left-continuous step function, the “wrong” function value will be used. The same applies for right-continuous step functions and the right endpoint. Hence, the middle of the intervals is the best choice in this situation to avoid systematic use of “wrong” function values (see

Govindarajulu et al. [

In addition, the focus of the individual analysis may also influence the preferences about the weights. Yet, the choice of weights is rather subjective and dispensing with a bootstrap component in the weights speeds up calculation of ABCtime considerably. Especially with computer-intensive approaches and/or in simulation studies, we favour simple non-bootstrap alternatives.

An unweighted version, i.e. equal weights

with

Alternatively, the weights could be based on the inverse mean variance of the competitive approaches. If

where

Other possibilities would be to use weights based on the number of patients at risk. Weights based on the number of patients at risk (i.e. logrank like weights)

reflect the importance of deviations in the time-varying effects over time. Here,

A potential drawback of variance based weighting schemes (e.g.

Analogously, the evaluation period for ABCtime has to be chosen with care. It is often sensible to focus on a relevant region or a region of interest by restricting the calculation of ABCtime to the time range

The ABCtime is then calculated in the style of Equation (12) in Govindarajulu et al. [

with

To improve interpretation of the values of ABCtime, we further calculate the percentage of ABCtime on the weighted area under the reference function (termed pABCtime):

where the area under the reference curve (AUR) is calculated in analogy to ABCtime as the (weighted) area between the reference curve and the

Although we argue for simple weights for calculation of ABCtime itself, we will in the sequel use bootstrap techniques to assess the stability of this measure. For this purpose, we calculate the standard deviation and 95% bootstrap percentile intervals based on the pABCtime values calculated in bootstrap samples. All functions under investigation are reestimated in

This chapter introduces the real-life data set and the simulated data that are in the sequel used to demonrate the application of ABCtime. Here, we restrict investigation to univariate settings. However, all techniques can be applied to multivariable settings analogously.

The Rotterdam breast cancer series includes data on patients treated at the Erasmus MC Daniel den Hoed Cancer Center for primary breast cancer between 1978 and 1993 [

Sauerbrei, Royston and Look [

To investigate the performance of the ABCtime method in a setting where the truth is known, simple data sets with one variable are generated. Survival times are simulated using a generalised inversion method (see [

To illustrate the use of the ABCtime measure, we apply it to the CoxPH, FPT, and Timecox models in the Rotterdam breast cancer series and the simulated data. Time-varying effects are selected with a conservative significance level of

using pABCtime. The restricted period used for the calculation of pABCtime is from about 3 months to 9.7 years in this data set.

The basis for calculation of ABCtime is the area difference between the SSSRs and the CoxPH, FPT and Timecox effects, respectively, as shown in

The influence of the different weights on the pABCtime can most easily be demonstrated by the CoxPH results. While in the unweighted version (i.e. with equal weights) the weights, of course, remain constant over the complete observation period, the weights in the three other weighting schemes change with calculation

intervals of ABCtime, i.e. each rectangle

Similar tendencies can be observed for the FPT and Timecox effects. Yet, they are less pronounced in this example, since both estimates are very similar to the SSSRs and the pABCtime is relatively small with 4.5 and 6.2 for FPT and Timecox, respectively, in the original data and up to about 20% for both in the bootstrap samples. CoxPH on the contrary showed pABCtime values of 21.6 and up to 40 in the original data and boot- strap samples, respectively. These small values are caused mainly by a large AUR, i.e. a rather large effect size. In comparison to this effect size, differences between the estimated effects and the SSSRs are relatively small which is reflected by a small pABCtime (i.e. percentaged difference relative to the AUR).

According to a visual comparison of the effect estimates obtained by FPT and Timecox in the original data (

Thus, pABCtime provides a straight-forward quantification of the similarity to the reference curve, which is often hard to evaluate using graphical displays only (e.g. in simulation studies). In addition, pABCtime measurements in the individual bootstrap samples and corresponding bootstrap percentile intervals enable evaluation of variability/stability of effect estimates in real-life data sets. For

A striking pattern in

Although in a multivariable context the proportional hazards assumption seems questionable for the hormonal therapy (see

Note that the effect size (and thus the SSSRs) is rather small, resulting in a small AUR. Consequently, even small absolute differences between the estimated effects of interest and the SSSRs result in a relatively large pABCtime.

In addition to the two prognostic factors, we use the prognostic index

A more intuitive application of ABCtime is the comparison of different modelling alternatives in a simulation study. Because the true effect is known, application of ABCtime is straight-forward. The CoxPH, FPT and Timecox models are fitted to the data and are compared via pABCtime with logrank like weights in each of the 1000 simulated data sets. The significance level for testing on time-varying effects in FPT and Timecox is chosen to be 1%.

effect and thus results in very smooth, easy to interpret effect estimates, but at the expense of some deviating effects (e.g. artefacts). Timecox, on the contrary, uses a local fitting approach which results in very flexible but also potentially wiggly estimates. Thus, short-term changes in time-varying effects can be estimated very well, but with the potential disadvantage of blurring the global trend. However, these details on individual effects are difficult to see from graphical displays as individual curves are hidden by the mass but may cover a broad variety of different shapes (see

In the setting with linearly decreasing effect, pABCtime reflects and refines the visual comparison of estimated effects equally well. Here, the shadow plots (

In applications with time-varying effects, a measure is required which quantifies, for example, the benefit of a time-varying effect compared to a standard CoxPH effect or a time-varying effect obtained from a different analysis method. If such a measure is available, and assuming that the smoothed scaled Schoenfeld residuals (SSSRs) reflect the true data well, they can be used to assess the fit of a function. Furthermore, in simulation studies the fit of two or more functions can be compared to the known true effect. With simulation studies in mind, we developed the ABCtime measure by adapting the measure of Govindarajulu et al. [

In applications with known true effects, ABCtime is a straight-forward measure to quantify the distance between (time-varying) effects and the true effect. We conducted a simulation study to verify that ABCtime reflects the similarity to the true effect as it is supposed to do. When comparing time-varying and time-constant effects, ABCtime was able to detect effects that were closer to the true effect. Results are, of course, influenced by the choice of weights. Our slight preference for the logrank like weights resulted in a limited ability to detect time- varying behaviour in regions with less data support. In real data examples, though, we believe this is not a disadvantage, as deviations from PH in such regions are often a result of overfitting the data and/or of minor importance.

In the data example where the true effects are unknown, SSSRs were used as reference function to describe the underlying time-varying pattern of covariate effects. ABCtime enabled comparison between a constant CoxPH effect and two time-varying effects of different complexity. It clearly showed that the time-varying effects were considerably closer to the reference function than the CoxPH effect. This fact could also be revealed by graphical comparison of estimated effects. Differentiating between different time-varying effects such as the FPT and Timecox functions based on graphics, however, may be difficult, especially if their shapes are different, but none of them appears to be definitely closer to the reference. ABCtime yields a quantitative measure of the similarity which enables a comparison of different functions with the opportunity of up- or downweighting specific (time) regions of interest via the choice of weights. In our examples, ABCtime revealed that the time-varying approaches gave a better approximation to the SSSRs than the standard CoxPH model in all investigated settings, while the two of them were judged to perform similarly well, differing mainly in their variability. The variability of approaches is assessed by means of bootstrap percentile intervals of pABCtime, which also reflect the sensitivity of effect estimates obtained by the different approaches to slight changes in the data (e.g. their sensitivity to produce artefacts or “strange” shapes).

Thus, the ABCtime helps in specifying how well an estimated effect reflects the true or reference effect and gives an easily interpretable quantification of the “similarity” or distance between selected (time-varying) effects.

Although we restricted our investigation to the FPT and Timecox approaches, ABCtime is not limited to these methods, but is applicable to a broad variety of different approaches for modelling time-varying effects. One important aspect, however, is the specification of an appropriate reference function. The reference function has a very strong influence on the ABCtime measure and thus should be chosen with care. In simulation studies, the true effect itself naturally defines the reference. Despite of minor problems in more extreme situations (e.g. very large covariate effects or extreme covariate distributions with outlying values [

In this paper we investigated four different weighting schemes. In applications, the choice of weights should be motivated by the aim of the analysis and the choice of reference. In our examples, we have a slight preference for the logrank like weights, because they down weight differences at later times where few subjects are under risk. For tests of two survival functions, logrank like weights are the typical choice because they have good properties in the two sample case. These weights, as well as the inverse reference variance based weights, are straight-forward choices independent of the approaches to be compared. The inverse mean variance based weights, on the contrary, adjust for uncertain estimates in the approaches under investigation, but are simulta- neously subject to artefacts. Equal weights are the simplest choice, since they do not depend on the specific data and make interpretation easy. Yet, they do not adjust for regions with larger uncertainty or less data support.

Like many other flexible functions, the FP class is prone to produce artefacts at both ends. Therefore, we truncated the edges of the time scale from the calculation of ABCtime to reduce distortion of the measure by such artefacts and uncertain estimates. This is a kind of extreme downweighting of edges with weight = 0. Here, we define these edges as times beyond the 1% and 99% quantiles of event times.

Another possibility to avoid a distortion by artefacts is to reduce them already in the estimation process. Many approaches for time-varying effects, including the FPT algorithm, are sensitive to extreme survival times. If a data set contains many extremely small or large survival times, these time points may distort functional forms strongly, resulting in an inappropriate functional form or artefacts at the edges. This problem is already known from modelling FP functions of covariates. Royston and Sauerbrei [

The ABCtime measure is helpful in comparing time-varying effects. In some applications, though, especially when decisions about different methods for modelling time-varying effects are to be made, the resulting measure of distance may not give sufficient information to decide which approach is most suitable for the underlying problem. In these cases not only the raw distance of curves may be of interest, but also whether the shape of selected effects is correct. For example, if the true effect is strictly decreasing, the selected effect may only be acceptable if it is also decreasing and not, for instance, unimodal. On the contrary, if the true effect is unimodal, the position and/or size of the mode might be of great importance. In this case, an estimated non-unimodal effect would be unacceptable, though it might give a good ABCtime. Thus a qualitative measure as proposed by Bin- der, Sauerbrei and Royston [

We thank Maxime Look and John Foekens (Rotterdam breast cancer series) to make the data publicly available. We thank Clemens Wachter for his help in the preparation of the manuscript. Willi Sauerbrei and Anika Buchholz gratefully acknowledge the support from Deutsche Forschungsgemeinschaft (SA 580/8-1).

Many approaches for time-varying effects are sensitive against extreme survival times. If a data set contains many extremely small or large survival times, these time points may distort functional forms strongly, resulting in an inappropriate functional form or artefacts at the edges.

Royston and Sauerbrei [

where

with

p value of Grambsch-Therneau test | ||||
---|---|---|---|---|

Age | 0.7935 | 0.8308 | 0.6698 | 0.7721 |

T2 or T3 tumour | 0.0224 | 0.0382 | 0.0577 | 0.0301 |

T3 tumour | 0.2178 | 0.1249 | 0.1023 | 0.1491 |

Grade | 0.9687 | 0.9494 | 0.8563 | 0.9573 |

Transformed nodes | 0.0052 | 0.0009 | 0.0006 | 0.0015 |

Transformed progesterone | ||||

Hormonal therapy | 0.0035 | 0.0005 | 0.0020 | 0.0016 |

Chemotherapy | 0.0298 | 0.0022 | 0.0017 | 0.0067 |

Prognostic Index |

pABCtime (in %) | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

ABCtime (SD) | (95% bootstrap percentile interval) | |||||||||||||

AUR^{SSSR} | CoxPH | FPT | Timecox | CoxPH | FPT | Timecox | ||||||||

13.749 | 2.966 (0.734) | 0.616 (0.322) | 0.85 (0.38) | 21.574 (9.42, 34.56) | 4.479 (3.5, 12.79) | 6.182 (3.84, 15.34) | ||||||||

14.289 | 2.056 (0.491) | 0.667 (0.201) | 0.656 (0.294) | 14.385 (7.61, 22.29) | 4.666 (3.28, 9.18) | 4.591 (3.32, 11.48) | ||||||||

14.958 | 2.666 (0.605) | 0.644 (0.276) | 0.909 (0.349) | 17.825 (9.67, 26.54) | 4.306 (3.42, 11.24) | 6.078 (3.88, 13.13) | ||||||||

15.426 | 2.384 (0.535) | 0.631 (0.251) | 0.874 (0.335) | 15.452 (8.77, 22.5) | 4.09 (3.25, 9.91) | 5.666 (3.68, 12.66) | ||||||||

hormon: equal weights | ||||||||||||||

2.648 | 0.818 (0.441) | 0.818 (0.678) | 0.817 (0.621) | 30.867 (16.72, 86.28) | 30.867 (14.28, 99.34) | 30.863 (21.09, 171.78) | ||||||||

hormon: inverse reference variance based weights | ||||||||||||||

2.695 | 0.854 (0.373) | 0.854 (0.321) | 0.853 (0.389) | 31.671 (13.93, 79.86) | 31.671 (13.03, 71.96) | 31.666 (19.18, 117.57) | ||||||||

hormon: logrank like weights | ||||||||||||||

2.509 | 0.845 (0.403) | 0.845 (0.396) | 0.845 (0.45) | 33.692 (16.81, 86.02) | 33.692 (15.76, 85.03) | 33.69 (23.14, 145.48) | ||||||||

hormon: inverse mean variance based weights | ||||||||||||||

2.648 | 0.818 (0.378) | 0.818 (0.318) | 0.817 (0.4) | 30.867 (15.88, 87.52) | 30.867 (14.5, 83.73) | 30.863 (21.74, 123.32) | ||||||||

PI: equal weights | ||||||||||||||

8.055 | 2.438 (0.388) | 0.429 (0.165) | 0.592 (0.184) | 30.264 (18.85, 42.29) | 5.326 (4.31, 12.51) | 7.35 (4.21, 13.48) | ||||||||

PI: inverse reference variance based weights | ||||||||||||||

8.523 | 1.7 (0.263) | 0.505 (0.113) | 0.465 (0.147) | 19.947 (13.06, 27.47) | 5.931 (4.25, 9.76) | 5.451 (3.69, 10.65) | ||||||||

PI: logrank like weights | ||||||||||||||

9.041 | 2.151 (0.321) | 0.473 (0.134) | 0.605 (0.175) | 23.79 (16.12, 31.4) | 5.227 (4.11, 10.19) | 6.689 (4.02, 11.75) | ||||||||

PI: inverse mean variance based weights | ||||||||||||||

9.494 | 1.935 (0.29) | 0.503 (0.122) | 0.568 (0.168) | 20.376 (14.05, 26.7) | 5.3 (4.1, 9.62) | 5.982 (3.69, 10.75) | ||||||||