Copy Mean : A New Method to Impute Intermittent Missing Values in Longitudinal Studies

Longitudinal studies are those in which the same variable is repeatedly measured at different times. These studies are more likely than others to suffer from missing values. Since the presence of missing values may have an important impact on statistical analyses, it is important that they should be dealt with properly. In this paper, we present “Copy Mean”, a new method to impute intermittent missing values. We compared its efficiency in eleven imputation methods dedicated to the treatment of missing values in longitudinal data. All these methods were tested on three markedly different real datasets (stationary, increasing, and sinusoidal pattern) with complete data. For each of them, we generated nine types of incomplete datasets that include 10%, 30%, or 50% of missing data using either a Missing Completely at Random, a Missing at Random, or a Missing Not at Random missingness mechanism. Our results show that Copy Mean has a great effectiveness, exceeding or equaling the performance of other methods in almost all configurations. The effectiveness of linear interpolation is highly data-dependent. The Last Occurrence Carried Forward method is strongly discouraged.


Introduction
Longitudinal studies are those in which the same variable is repeatedly measured at different times.They are more likely than others to suffer from missing values [1][2][3].Indeed, it is frequent that subjects miss a clinical visit or fill out incompletely a questionnaire.The missing data have been classified into three main categories [1]: Missing Completely at Random (MCAR) when the missingness probability is independent on the variables, Missing at Random (MAR) when the missingness probability depends only on the observed variables, and Missing Not at Random (MNAR) when the missingness probability may depend on unobserved variables.
When the main analysis involves statistical modeling of the change over time of the longitudinal variable using, for instance, mixed models, the model parameters are generally estimated by the maximum likelihood and it is well-known that the maximum likelihood estimation is robust to MAR data [2,4,5].However, selection models and pattern-mixture models have been proposed when the data are MNAR or when a sensitivity analysis to this assumption is performed [2,[4][5][6][7].
This paper focuses on situations where the main analysis does not involve modeling and on likelihood-based methods such as descriptive studies, exploratory analyses, non-parametric clustering, etc.These kinds of analyses are very sensitive to missing data, even when the missingness mechanism is MAR; then imputation methods are very useful.
Twisk [8] and Engels [3] compared several imputation methods for longitudinal studies.Twisk proposed a classification of imputation methods into two categories: "Cross-sectional" methods that impute missing values at time t using information available at time t and "longitudinal" methods that impute the missing values of an individual i using all the non-missing values of i. Engels suggested four categories: 1) "No personal data" methods do not use information available on individual subjects; 2) "baseline data" methods use the information present at baseline but no time-dependent information; 3) "before data only" methods consider all the information available before the occurrence of the missing value; and 4) "before and after" methods impute the missing values using all available information.
Regarding the evaluation of performance, Engels proposed different indices to compare the performance of imputation methods.These indices are mainly based on the difference between the imputed values and the actual values [3].
The present article aims at comparing different imputation methods for missing values in longitudinal studies.Section 2 provides the general framework and the methodology: a formal definition of the concept of missingness, a presentation of the imputation methods, and the criteria used to measure performance.This section reviews the classical methods and presents an original method called Copy Mean.Section 3 presents the design of the simulation study and Section 4 presents the results.A discussion is provided in Section 5.

Notations
Let us consider a set S of n subjects.For each subject, an outcome variable Y is measured at t different times.The value of Y for subject i at a specific time l is noted il .For subject i, the sequence , , , is called a cross-sectional measurement.When il is missing, the value obtained by using a given imputation method IM is noted IM il y .

Classification of Missingness
In their founding documents, Rubin and Little distinguished three kinds of missingness [9,10] .Typically, the probability for an observation il y to be missing at time l depends on the current value of Y at time l.For example, if patients who suppose they would perform badly at time l refuse to be tested at time l, the data will be MNAR:    

il MISS
The impact of the mechanism of missingness on the imputation of the missing values was examined by Molenberghs [11].In the particular case of longitudinal data, the missingness mechanisms were classified according to the position of the missing values within the trajectory:  Intermittent missing data are missing within a trajectory.Formally, il y is an intermittent missing value if there exists a and b, a l b   , such that ia y and y are not missing.ib  Monotone missing data are missing either at the beginning or at the end of a trajectory.This includes the case of left-or right-censored follow-ups.If a value is missing, then all the following (respectively, preceding) values are also missing.Formally, il y is a (right) monotone missing value if, for all d l  , id y is missing.Some imputation techniques, such as the Linear Interpolation or the Copy Mean (see Sections 2.3.3 and 2.3.4), are not compatible with these two missingness mechanisms.In this article, we will focus on intermittent missing data, either MCAR, MAR, or MNAR.

Imputation Methods
Herein, 12 imputation methods are compared.They were grouped according to the information necessary for their implementation and are summarized in Table 1.

No Information
Only the complete-case method does not require information.
1) Complete case method: This method removes any trajectory with one or several missing values [10].Particularly radical, it is the easiest way to implement.Nevertheless, it has serious drawbacks [12] including major loss of information and biases as soon as data are not MCAR.

Cross-Sectional Imputation
These methods use only data collected at a given time (time at which the value is missing).The imputation of a missing value at time l is made according to the values from the other individuals observed at time l, i.e. the cross-sectional measurement y are imputed using a predictive model.Then the process is iterated: a new model is constructed for .1 whose values are again calculated, then for .2 and so on.Each iteration allows a little more precision in estimating the missing values.

y y
After a predetermined number of iterations, the process stops.In this article, the initialization process was done using Cross Mean and the process was iterated 10 times.

Cross-Sectional and Longitudinal
Imputation using Covariables (External) Fish: The second dataset (Figure 2(b)) comes from a study on an automatic pattern recognition system applied to the monitoring of fish migration [18].It included 350 individuals.The main variable is continuous in the range [−1.83; 1.95] (overall mean: 0.16; overall standard deviation: 0.89).The trajectories present some large variations and are close to sinusoidal functions.The dataset has no missing values but the covariates were not accessible; thus, methods that use covariates were not tested on this dataset.
Finally, it is possible to use all the information, including some covariates measured at baseline: 13) Linear Regression, External: the principle is the same as the internal linear regression (iterative process on all cross-sectional variables) but the predictive model for .l is a function of both other trajectories y .l y  and some covariates.

Simulation
Alcohol: The third dataset (Figure 2(c)) comes from the Quebec Longitudinal Study of Child Development led by the GRIP [19].In this study, 1831 participants were interviewed retrospectively; thus, the data show a very low rate of missingness.The monthly alcohol consumption was rated on a four-point scale (0 to 4, overall mean: 1.18; overall standard deviation: 1.09).The main feature of this study is the stability of the values over time.Three trajectories had missing values (0.16% of total); they were removed from the study.The covariates selected were: sex, happiness scores, income, tobacco consumption, and expenditure on tobacco.

Data Generation
The present simulation study was performed using three existing datasets with complete data.Several incomplete datasets were obtained by generating missing values according to different schemes.To be as general as possible, we worked on three datasets with very different characteristics.

The Three Datasets Pregnanediol:
The first dataset (Figure 2(a)) comes from a study on human menstrual cycles [17].The initial aim of the study was a search for biomarkers for accurate prediction of ovulation.One hundred and two women were recruited from eight natural family planning clinics

Generation of Missing Values
Several methods may be used to generate missing values C. GENOLINI ET AL. 30 [20].In the present article, for each of 3 complete datasets, we generated 9 (3 × 3) types of incomplete datasets that included 10%, 30%, or 50% missing data using either a MCAR, a MAR, or a MNAR missingness mechanism.This process was repeated 500 times.Thus, 13,500 datasets (3 × 9 × 500) were simulated.The incomplete datasets on pregnanediol and alcohol were analyzed with the 12 imputation methods.The incomplete datasets on fish were analyzed with only the 11 methods that do not require external data.
To generate intermittent missing values in a complete dataset, we defined a probability function   t  (the first and last values were always observed ones).In the MCAR case, this probability is independent of Y: . In the MAR case, the probability depends on il where is the last observed value preceding il : Finally, in the MNAR case, the probability depends on the current value :

Imputation Quality Comparison Criteria
To assess the quality of the different imputation methods, we considered the deviation which is the difference between the true and the imputed value [3] The deviation then leads to three criteria: 1) the Bias is the mean of the deviation; 2) the Mean Absolute Deviation (MAD) is the average of the absolute deviations; and, 3) the Root Mean Square Deviation (RMSD) is the square root of the mean of the square of the deviation.When il is the real value that method IM imputed as y , the Bias is , m being the total number of missing values.

Methods and Softwares
All the analyses were performed with R software [21].Classical and new imputation methods have been programmed and published in package Longitudinal Data on CRAN [22].The spline imputation method was programmed using stats package [13,23].Imputations needing linear regression used function mice (mice package) with method "predictive mean matching" [24].
The analysis of the results showed that the missingness mechanism and the type of dataset had impacts on the performance of the methods but not the percentage of missing data.Thus, for brevity, only the tables relative to 30% missing data will be presented in the main text.The full results are given in the Appendix.

Mean Absolute Deviation Results
The Mean Absolute Deviation (MAD) is the average of the absolute deviations between the real values and the imputed values.Table 2 presents the mean result for each method according to the missingness mechanism and the type of dataset.For better readability, the results were standardized: in each case (each column) the performance of the best method (the lowest MAD) was set to 1 so that all other results are multiples of this reference value.In There were no marked differences between MCAR, MAR, and MNAR.Only the Spline Interpolation method performed poorly with MAR on Alcohol dataset.This was probably due to the fact that, with MAR, long series of contiguous missing values are more likely; in such a case, the Spline Interpolation method imputes by polynomials with values far from the original curve.

Root Mean Square Deviation Results
Table 3 presents the root mean square deviation results.Here too, the results were standardized.The performance of the best method (the lowests RMSD) was set to 1 so that all other results are multiples of this reference value.In Table 3, the hight performance values (1.4 or lower) are highlighted in bold.The threshold of 1.4 was chosen arbitrarily.The results with the Root Mean Square Deviation were close to those obtained with the MAD criterion.They are detailed in the Appendix.

Bias Results
Table 4 presents the results for bias.The "good methods" (between −0.03 and 0.03) are highlighted in bold.The hresholds of −0.03 and +0.03 were arbitrarily chosen.t  Most methods had little or no bias: 60.2% had a bias ranging between −0.03 and +0.03 and 69.9% a bias between −0.05 and +0.05.There were important differences in bias between MCAR, MAR, and MNAR mechanisms.The bias was slightly larger with the MAR than with the MCAR and even larger with MNAR (see Table 4).This is due to the fact that in MAR and in MNAR mechanisms, the low values are those that are the most likely missing.

Summary
Table 5 summarizes the results obtained with all the methods and criteria.Each column shows how many times a method has been particularly performant according to the above-defined criteria (Tables 2-4).

Discussion
In this article, we compare different methods for imputing trajectories.Missing data were generated according three different mechanisms (MCAR, MAR, and MNAR) in three dataset exhibiting strong structural differences.Eleven conventional methods and one original technique were compared according to three performance criteria: the Mean Absolute Deviation, the Root Square Mean Deviation, and Bias.Because evaluation criteria are numerous, it is difficult to conclude such a study with an assertion that a given method is superior to all others.Still, in many cases, this study showed the particular efficiency of the Copy Mean.This method was the only one that gave correct results in all configurations.Linear Interpolation exhibited also good results but showed some weakness on some types of data.In agreement with previous studies [25,26], the well-known LOCF should be avoided as often as possible because it achieved a correct performance only when the data were fairly constant over time.In all other cases, it showed poor performance.Finally, some other tech-niques gave also rather poor results and should be avoided: the linear regressions and the conventional techniques (Spline Interpolation, Traj Median, Traj Hot Deck, Cross Mean, Cross Hot Deck, Traj Mean, Cross Median, LOCF). Figure 3 gives an intuitive idea of the relative performance of some representative methods.The cross-sectional method (Cross Mean in the example) was not effective when the individual trajectories were far from the average trajectory of the population.Conversely, linear interpolation gave good results except with the Fish dataset (Figure 3(b)).This is mainly because it ignores the global variations of the population.LOCF has low performance in all situations.Finally, Copy Mean performed as well as the best techniques in all settings (close to linear interpolation in cases 3a and 3c, as good as Cross Mean 3b).

Limitations
In the present study, we used three datasets with marked differences in terms of shape, number of individuals, number of repeated measurements, and type of the outcome variable.Nevertheless, because these datasets were only examples, a generalization of our results to other datasets should be examined with caution.
Besides, the present results were valid only with intermittent missingness.As mentioned above, the Copy Mean and the Linear Interpolation techniques are not applicable to monotone missingness patterns.It is, of course, possible to extend them in different ways (the Longitudinal Data library proposes four solutions to extend these methods to monotone missingness), but their effectiveness in this setting has not been studied yet.It would be interesting to check whether the present results (high efficiency of the Copy Mean and partial efficiency of Linear Interpolation) can be confirmed in case of monotone missingness.

A1. MAD A1.1. Set Pregnandiol
located in Aix-en-Provence, Dijon, and Lyon (France), Milano and Verona (Italy), DÃ 1 4 sseldorf (Germany), LiÃ¨ge (Belgium) and Madrid (Spain).Urine pregnanediol-3a-glucuronide was measured before ovulation.This variable is a continuous in the range [0.05; 26.6] mg/L (Overall mean: 11.5 mg/L; overall standard deviation: 18.3).The trajectories of this variable have the characteristic of being non-stationary and increasing.Of the 102 trajectories, two (1.96% of total) had missing values.These trajectories were removed from the present study.Because some imputation methods require the use of covariates, we chose five covariates more or less correlated with the longitudinal variable under study: weight, size, age at menarche, number of children, and current age.

Figure 1 .
Figure 1.Copy Mean imputation.The individual trajectory .ly is in black, the mean trajectory .y is in red.The dot- ted lines are the values imputed by linear interpolation.The dashed lines are values imputed by Copy Mean.

Figure 2 .
Figure 2. Graphical representations of the three dataset.Individual trajectories are in black.The overall mean trajectories re in red.(a) Pregnanediol; (b) Fish; (c) Alcohol.a

Figure 3 .
Figure 3. Illustration of strength and weakness of four representatives method.Real trajectories are in black.Real values that have been removed from the trajectory and that should be imputed are in dotted black.Values imputed by the four methods are in color: green = Linear Interpolation; red = Copy Mean; dark blue = LOCF; light blue = Traj Mean.
A value is Missing Not at Random if the probability that il y be missing depends on MISS Y TRUE il MAR: A value is Missing at Random if the probability that il y be missing is independent of  

Table 1 . Imputation methods and their characteristics.
il y 2

.3.4. Cross-Sectional and Longitudinal Imputation (Cross & Long)
 may also contain missing values, the process is iterative by gradual approximation:  Initially, all the missing values are imputed (by one of the methods described above).A model regressing .1 y as a function of .2.3 ., , , t is built.Missing values in .1 y are replaced by the values predicted by the model.


In the same way, all the .l

Table 2 ,
the performances of the "good methods" are highlighted in bold.The "good methods" are those whose values are between 1 and 1.2.The threshold of 1.2 was chosen arbitrarily.With Pregnanediol data, Copy Mean, Linear Interpolation, LOCF, Traj Median and Traj Mean, were the best.With Fish data, the most effective methods were Copy Mean, Linear Regression Internal, Cross Median, and Cross Mean.All methods that use only longitudinal information performed poorly with this data set characterized by a strong non-linear trend with low inter-subject variability (see Figure2(b)).With Alcohol data, Linear Interpolation and Copy Mean gave the best results.