Comparison of Spatiotemporal Fusion Models for Producing High Spatiotemporal Resolution Normalized Difference Vegetation Index Time Series Data Sets

It has a great significance to combine multi-source with different spatial resolution and temporal resolution to produce high spatiotemporal resolution Normalized Difference Vegetation Index (NDVI) time series data sets. In this study, four spatiotemporal fusion models were analyzed and compared with each other. The models included the spatial and temporal adaptive reflectance model (STARFM), the enhanced spatial and temporal adaptive reflectance fusion model (ESTARFM), the flexible spatiotemporal data fusion model (FSDAF), and a spatiotemporal vegetation index image fusion model (STVIFM). The objective of is to: 1) compare four fusion models using Land-sat-MODIS NDVI image from the Banan district, Chongqing Province; 2) analyze the prediction accuracy quantitatively and visually. Results indicate that STVIFM would be more suitable to produce NDVI time series data sets.


Introduction
The Normalized Difference Vegetation Index (NDVI) is a widely used vegetation index (VI) and provides a way of evaluating the biophysical or biochemical information related to vegetation growth [1]. Long term NDVI time-series datasets have been widely used for monitoring ecosystem dynamics to understand Journal of Computer and Communications constraints, it is difficult to obtain NDVI data with both high spatial and high temporal resolution on the same remote sensing instrument [4]. In addition, long periods of cloud cover problems in some regions have aggravated this matter [5]. Thus, spatiotemporal fusion techniques which combine NDVI date from multi-sensors with high spatial and temporal resolution is feasible solution to acquire remote sensing time series for monitoring surface vegetations dynamics [6] [7].
Up to now, several spatiotemporal fusion models have been proposed. Gao et al. [8] proposed a spatial and temporal adaptive reflectance fusion model (STARFM) to blend MODIS and Landsat image to produce a synthetic surface reflectance product at 30 m spatial resolution. Based the STARFM, Zhu et al. [9] developed an enhanced spatial and temporal adaptive reflectance fusion model (ESTARFM), introducing conversion coefficient between pixels and improving the prediction accuracy. Zhu et al. [10] proposed the flexible spatiotemporal data fusion model (FSDAF) which performs better in predicting abrupt land cover changes. Liao et al. [11] developed a spatiotemporal vegetation index image fusion model (STVIFM) to generate NDVI time series images with high spatial and temporal resolution in heterogeneous regions. In this study, we made a comparation between STARFM, ESTARFM, FSDAF, and STVIFM methods, tested by Landsat and MODIS data acquired in same site and quantitatively assess the accuracy of predicted image generated from each fusion model.

Study site and Data Preparation
In this study, a selected study area is shown in Figure 1, which located in Banan District (29˚34'10''N, 106˚57'35''E) in Chongqing Province to perform the comparison between the spatiotemporal fusion models. We select MODIS daily surface reflectance image and Landsat-8 image acquired for these dates during this period: April 28, 2015, August 02, 2015, and October 21, 2015. All images are pre-processed and calculated as NDVI data. Scene subset is shown in Figure 2.

STARFM
The STARFM is based on the moving window technology, which requires at least a pair of high-resolution image and coarse-resolution image on the base time and one coarse-resolution image on the predicted time. By introducing a weigh function using spectral difference, temporal difference and spatial difference to determining the contribution of other pixels in the window to the central pixel. And then a synthetic high Spatiotemporal image (F(t2)) is predicted with the high-and coarse-resolution data through the proposed weight function. This model can be written as in Equation (1).
where, F(t1) and M(t1) denote the high-and coarse resolution date on the base date, M(t2) is the coarse resolution date at the predicted date, and Wi is the weight function.

ESTARFM
The ESTARFM needs at least two pairs of high-resolution image and coarse resolution image on the base time and one coarse-resolution image on the predicted time. Compared with STARFM, this method not only considers the spatial and spectral similarity between pixels, but also introduces a conversion coefficient, which is derived from the high-and coarse-resolution data during the observation period using a linear regression. The final high-resolution prediction is computed as in Equation (2).
( 2) * ( ( 1) where, F(t1) and M(t1) denote the high-and coarse resolution data on the base date, M(t2) is the coarse resolution data at the predicted date, and Wi, Vi denote the weight function and conversion coefficient respectively.

FSDAF
The FSDAF using one pair of high-resolution image and coarse-resolution image on the base time and one coarse-resolution image on the predicted time, and it also need to use land cover map. This model integrates STARFM, the linear unmixing method [12] and the thin plate spline (TPS) interpolator that maintains the land cover change signals and local variability, which combined the temporal prediction from the linear unmixing method with the spatial prediction obtained by the TPS and distribute the residual to fine pixel to get the final prediction. It can be written as Equation (3).
where, F(t1), F(t2) denote the high-resolution image on the base time and predicted time respectively. F ∆ is referred to the change between t1 and t2, which computed by the linear unmixing method and TPS. And Wi is the weight function.

STVIFM
The STVIFM requires two pairs of high-and coarse-resolution images acquired on the base time and one coarse-resolution on the predicted date. On the one hand, this model links the mean NDVI change of high-resolution pixels to mean NDVI change of coarse resolution pixels within a moving window. On the other hand, it also considers the difference in NDVI change rates at different growing stages. And the final prediction can be written as Equation (4).
where, NDVI(t2), NDVI(t1) are the high-resolution date on the prediction time and base time respectively. ΔNDVI denote the change between t1 and t2, which calculated by this model. And the Wi is the weight function.

Assessing Prediction Accuracy
The model's prediction performance is quantitatively evaluated by representative metrics. And the r and RMSE (root mean squared errors) are used to measure the difference between the predicted image and actual image. The formulations of these metrics are as follows:

Prediction Performance
We use the August 02 Landsat NDVI image as validation source and use April 28 and October 21 to predict the August 02 image. Figure 3 shows the actual NDVI image and predicted NDVI image by four spatiotemporal fusion models on August 02, 2015. All the predicted NDVI images are consistent with the actual image from visual comparison, and water boundaries and clear land can be predicted obviously, which demonstrate the practicality of these spatiotemporal models.

Quantitative Assessment
Scatter plots in Figure 4 indicate the difference between the actual NDVI values and the predicted NDVI values on August 02 2015. We can see that the predicted NDVI values by four spatiotemporal fusion models are all fall close to the 1:1 line, which show all four spatiotemporal fusion models can capture changes in phenology. And the prediction of ESTARFM and STVIFM using one input pair is relatively accurate than that of STARFM and FSDAF using two input pairs, which because two input pairs can provide more spatial details.
To better assess the accuracy of predictions, the metrics r and RMSE were calculated in Table 1. All four methods can get the change details to the base date image to get the prediction. The accuracy of the predicted NDVI image using the STVIFM is the best (r = 0.864, RMSE = 0.1191) and a little better than the accuracy of the predicted NDVI image using ESTARFM (r = 0.867, RMSE = 0.1247). The image predicted by STARFM (r = 0.804, RMSE = 0.1626) and FSDAF (r = 0.810, RMSE = 0.1446) can also produce an accurate result, but these two models got inaccurate predictions on some pixels (Figure 3(b), Figure   3(d)), which demonstrate the predictions using two input pairs is relatively more accurate.