Marginal Distribution Plots for Proportional Hazards Models with Time-Dependent Covariates or Time-Varying Regression Coefficients

Given a sample of regression data from ( ) , Y Z , a new diagnostic plotting method is proposed for checking the hypothesis 0 H : the data are from a given Cox model with the time-dependent covariates Z . It compares two estimates of the marginal distribution Y F of Y . One is an estimate of the modified expression of Y F under 0 H , based on a consistent estimate of the parameter under 0 H , and based on the baseline distribution of the data. The other is the Kaplan-Meier-estimator of Y F , together with its confidence band. The new plot, called the marginal distribution plot, can be viewed as a test for testing 0 H . The main advantage of the test over the existing residual tests is in the case that the data do not satisfy any Cox model or the Cox model is mis-specified. Then the new test is still valid, but not the residual tests and the residual tests often make type II error with a very large probability.

It is often that the external time-dependent covariate i Z can be written as model with k cut-points 1 , , k a a  [3].  [9] are then introduced. These methods provide more symmetric plots and are useful in identifying the outliers.
Moreover, one may use Schoenfeld residuals or the cumulative sums form [10].
The residual plotting methods also induce several residual tests for the PH models.
We provide a new diagnostic plotting method for the PH model. The paper is organized as follows. In Section 2, we propose the MD plot and other supplementary diagnostic plots. In Section 3, we present simulation results on the performance of the plot. We also compare the MD plot to the current residual plots. In Section 4, we apply the new diagnostic plot to the long-term breast cancer follow-up data analyzed in Wong et al. [11]. Section 5 is a concluding remark, where we explain that the MD plot can also be served as a naive hypothesis test for the PH models.

The Main Results
The assumption and notations are given in § 2.1. The idea of the marginal approach is introduced in § 2.2. The method is explained in § 2.3 and § 2.4.

Assumptions and Notations
Let ph Θ be the collection of all PH models specified by In view of Proposition 1, hereafter, WLOG, we can assume that AS1. The zero vector 0 is a mode of U and it satisfies that Otherwise, let o U be a mode of U and define . If one fits the TIPH model The proof of Lemma 1 is trivial and is skipped.   (2) is true. We thus call this plot the marginal distribution plot.

The Marginal Distribution Plot
Since the main issue of the MD plot is the estimator ˆY S * and it is not trivial (3), the main focus of the paper is to introduce how to construct ˆY S * . We shall explain in details how to obtain ˆY S * through various ( ) g t . We also give ˆY S * for the general piecewise continuous ( ) .
For simplicity, we shall first explain our method when U or Z is a univariate We introduce the generalization to the case of a covariate vector (or matrix) in Remark 3 and to the time-dependent model for general Since 0 is a mode of the covariate U by assumption AS1, a consistent estimator of o S is the KME, denoted by ˆo S , based on the data satisfying ), and r is a given positive number (e.g., the inter-quartile-range or the standard deviation of i U 's). WLOG, let the first o n observations be all the observations satisfying For ease of explanation, we only consider the case of complete data in this section. The extension to the right censored data is straightforward and we present simulation results with right-censored data in Section 3. For the complete data, the KME of o S based on the first o n ob- It means that the estimator Y S  using 0 S  will always suggest that the given TIPH model fits the regression data even though the data set may not satisfy the TIPH model, and hence 0 S  is not a proper choice. Simulation study in § 3.1 suggests that under the assumption in Example 1, Y S  converges to ( ) Y S t in probability and using Y S  fails to identify the wrong model assumption.

Estimation of ( )
, , , We shall first illustrate the main idea through three typical cases when β is a ., the LDPH model. We also discuss the general case of ( ) , , and , where β is a consistent estimator of β under the selected PH model, e.g, the MPLE.
Remark 3. Even though U in (4) and (5) is a random variable, it is easy to extend to the case that U is a vector. Assume  is a pdimensional random vector, n is much larger than p and U is bounded. We can define In the simulation study, we know the mode of U . In applications, the sample mode is not well defined. We choose a "sample mode" of i U 's, say p c ∈  as follows: 1) Select a proper radius r and choose points q such that in a neighborhood of q with rasius r, , there are more than 20 (or 2) Case of ( ) ( ) ( ) ( ) , , , , .
where i a may depend on the i -th observation. And ( ) ( ) ( ) ( ) 1 ,1 g t t a t a ′ = < ≥ corresponds to a special case of the PWPH model with one cut-point. For the PWPH model with more than one cut-point, the derivation of ˆY S * is similar. Typically for the PWPH model with two cut-points, , , , , , , , where β is an estimator of β .

Supplementary Diagnostic Plots
The MD plot needs to know ( ) g t . One possible way to conjecture the form of the function ( ) g t for a time-dependent covariate is to extend the PWPH plotting method in Wong et al. [11] as follows: lot ln , ln for 0 and check whether it is piecewise linear, r is a positive constant and 1 z belongs to the support of Z . For instance, for a PWPH model defined in (6), y a b x = + , and the cut-points can be determined from the PWPH plot (see Figure 3 in Section 3).
Let 1 be all the distinct exact observations. In view of the expression in (7) under the PH model with ( ) ( ) ( ) where 1 S and 1 z are given in (11) and (12). Thus another diagnostic plotting method is and check whether it appears as a two-piecewise-linear curve: one is 0 y = and another one is where a is the intersection of the two line segments. We shall call the plotting method the LDPH plot, as ( ) ( ) ( ) 1 g t t a t a = − ≥ corresponding to the LDPH model. The advantage of the PWPH plot and the LDPH plot is that they provide clues on the cut-points, which are needed in the MD plot, unless the cut-point is given.
Remark 4. If the cut-points in the PWPH or TDPH model vary from observation to observation, then the PWPH plot as in (11) and LDPH plot as in (13) where a is a predetermined reference cut-point and 1 r and 2 r are positive constants.

Simulation Study
In order to compare the MD plot to the other plots, we present three sets of simulation results in § 3.1, § 3.2 and § 3.3, respectively. The mode is 0 as assumed in AS1. o S can be uniform or exponential distributions. The covariate can be discrete or continuous. The data do not need to be from a PH model.
There is no unit in time t due to the simulation.

Simulation Study under the Assumptions in Example 1
Two random samples of The sample of size 300 n = is only used for the MD plots in panels (1,3) and

Defined in Remark 2
To show that the consistent estimator of o S is the key in the MD approach, we Y W 's, respectively. It is seen that the curves of "ph est" and ˆY S almost coincide for both data sets even when 300 n = . These plots suggest that the curve of "ph est" is always close to the curve of the KME , regardless of whether the pre-assumed model fits the data. Thus Y S  is not a proper choice for the diagnostic plot.

About Residual Plots
In panels (2,1) and (4,1) of Figure 1, the residual plots using the score process by the cumulative martingale residuals method are plotted with data ( ) respectively. There is no big difference in pattern between these two pairs of plots. Thus these two types of residual plots can not tell whether the data satisfy the TIPH model in this example.

About Residual Tests
We also carried out simulation study on the testing

Simulation Based on Data from the TIPH Model
A sample of complete data with 300 n = is generated from the TIPH model: Thus the covariate is correctly specified.

Residuals Plots
Panels (1,1) and (2.1) in Figure 2 display the residual plots using the score process by the cumulative martingale residuals method by fitting data from the first and the second TIPH model respectively into ( )

Residuals Tests
In fact, with the data from the first mis-specified model and with a moderate sample sizes 50 n ≥ , our simulation study with a replication of 5000 suggests that the residual test in the existing R package (e.g., ( ) cox.zph ) would not reject the mis-specified TIPH model for more than 70% of the time. Thus it is not surprised that the residual plots cannot detect the mis-specified TIPH model.

MD Plots
The MD plot in panel (1,3) fits the mis-specified TIPH model with data from the first sample. It successfully identifies that the functional form of the covariates Z is mis-specified for the first data set, as ˆY S * is almost totally outside the 95% confidence band of ˆY S . In other words, the MD approach suggests that the first data set does not follow the PH model ( ) On the other hand, the MD plot in panel (2,3) successfully identifies that the functional form of the covariates Z is correct for the second data set, as ˆY S * is totally inside the 95% confidence band of ˆY S .
Based on the second sample, the modified PWPH plot and the LDPH plot are displayed in panels (3,2) and (4,2); the MD plots under the PWPH and LDPH Models are displayed in panels (3,3) and (4,3). The PWPH plot in panel (3,2) suggests that the data are either from a TIPH Model, or from a PWPH model Both 1 β and 2 β are not significantly different from β , as their differences from β are within two SEs. Both MD plots in panels (2,3) and (3,3) suggest that the PWPH model with at most one cut-point fits the data, as expected, as both curves of ˆY S * are totally inside the 95% confidence band of ˆY S .
The LDPH plot in panel (4,2) suggests that the LDPH Model may fit the data, but it is seen from panel (4,3) that even within the interval [0,1], only less than 30% of the curve of ˆY S * lies inside the confidence band of ˆY S . Thus the MD plot suggests that the data are not from the LDPH Model, as expected. It is seen that the LDPH plot performs not as good as the MD plot.

Simulation on the LDPH Model
The modified PWPH plots and the LDPH plot are given in panels (1,1), (2,1) and (3,1) in Figure 3. The MD plots and the qqplots for the TIPH, PWPH and LDPH Models are given in the second and the third columns of Figure 3.
Both the PWPH plot and the MD plots with corresponding qqplot in panels (1,1), (1,2) and (1,3) suggest that the data are not from the TIPH Model. We However, the MD plot in panel (2,2) suggests that the data are not from the PWPH model. This again indicates that the MD plot performs better than the modified PWPH plot.
The LDPH plot in panel (3,1) in Figure 3 suggests that the data are from the LDPH Model with the cut-point a in ( ) 0.2, 0.4 . The MD plot and the corresponding qqplot in panels (3,2) and (3,3) suggest that the data are from the LDPH Model, as expected. Thus the LDPH plot has the advantage that it can suggest the cut-point a in the case that a is not given in the LDPH Model.
Since U is discrete in this case, the PWPH plot and the LDPH plot perform better than those in the previous simulation studies when U is continuous.

Data Analysis
A common situation that will involve the use of the PH model is a long-term clinical follow-up study. In such a study, the impact of a prognostic variable may change at different time periods. This is the case in the breast cancer data analyzed in Wong et al. [11]. The data are obtained from an Institutional Review One objective of the study is to investigate whether tumor diameter is significant in predicting early or late relapse. Then the relapse time Y is the response and the tumor diameter Z is the covariate. Clinical consideration and survival plots suggest late failure can be considered at time greater than 5 years from initial breast cancer surgery. Data analysis based on a PWPH model with two cut-points at 2 years and 5 years with covariate Z is carried out in Wong et al. [11]. We apply our diagnostic plotting methods to check whether such a PWPH model fits our data in the case that Z is continuous and Z is discretized as a dichotomized variable ( 2 > cm or 2 ≤ cm), as is done in Wong et al. [11]. In the latter case, we try the log-log plots as well.
We apply our data to the TIPH model with covariate Z (which is basically continuous), the regression coefficient is ˆ0 .35 β = and is significant. We shall only present our new methods for the case that the tumor diameter Z is continuous in panel (2,1) of Figure 4. The modified log-log plots in panel (1,1) seems to suggest that the TIPH Model fits the data, but the MD plot in panel (2,1) suggests that the TIPH Model does not fit our data, as the curve of ˆY S * lies almost totally outside the 95% confidence band of ˆY S . Thus the partial likelihood estimate ˆ0 .35 β = is not valid.
Wong et al. [11] suggest to do data analysis by dichotomizing Z according . Then the covariate is discrete, and the standard log-log plots, as well as the PWPH plot proposed by Wong et al. [11], is applicable. These two plots are given in panels (1,2) and (1,3) of Figure 4. In view of the graph, the log-log plots suggest that the TIPH Model does not fit the discretized data. Moreover, it does not present useful information on the other possible PH models. However, the PWPH plot in panel (1,3) does suggest that the PWPH model with two cut-points close to 1.5 years and 6 years fits the discretized data. Wong et al. [11] suggest to do data analysis using a PWPH model with the cut-points at 2 and 5 years (which are close to 1. 5 Figure 4, we provide the MD plots based on the discrete and continuous covariates, respectively, for fitting the PWPH model. In particular, since the curve of ˆY S * lies totally inside the 95% confidence band of ˆY S in panel (2,2), the MD plot suggests that the PWPH model considered in Wong et al. [11] fits the discretized data. Thus their data analysis is valid. However, the MD plot in panel (2,3) indicates that the original data set with the continuous covariate Z does not fit a PWPH model, as the curve of ˆY S * lies totally on one edge of the 95% confidence band of ˆY S .

Concluding Remarks
Our simulation results and the data analysis suggest that the MD plot has certain advantages over the existing residual plots, especially when the null hypothesis 0 H in (2) is mis-specified or the data are not from any PH model. Our MD plot does not involve residuals studied in the literature, and this is the first difference between the residual approaches and the MD approach.

Drawback and Remedy of the MD Plot
The MD approach is closely related to , where the parameter β is unknown but is given. The MD plot is applicable to all PH models with and with all types of covariates U , provided that ( ) ( ) is given. The assumption of a given ( ) ( ) may be viewed as a drawback of the MD plot if one wants to find a certain PH model to fit the data. There are several ways to overcome this drawback.
1) For the case that , we also propose a modified PWPH plot and LDPH plot in § 2.4 for inferring the functional form of ( ) g t . 2) One can apply the MD approach to several possible typical models. For instance, in § 3.2 and § 3.3, given a data set, the MD approach finds which of the 3 semi-parametric PH models fits.
3) Of course, one can also make use of the existing residual approaches in the literature for guessing ( ) g t . There is no harm to inspect all diagnostic plots available based on the data. On the other hand, the MD plot can further check the validity of the function forms suggested by the existing residual plots. As illustrated in the paper, with the help of the confidence band of the KME ˆY S , it is more reliable and more informative than the residual plots on whether the model suggested by the residual plots is appropriate for the data. Thus the MD plot is a nice complement to the existing diagnostic methods, not a replacement to them.

A Naive but Valid Model Test
As seen from the simulation results, when the data are not from any PH model (see Example 1 in § 3.1) or 0 H is mis-specified (see simulation on the TIPH Model in § 3.2), the MD plots can successfully reject