Development of Global Cropland Agreement Level Analysis by Integrating Pixel Similarity of Recent Global Land Cover Datasets

Global cropland monitoring is important when considering tactical strategies for achieving food sustainability. Different global land cover (GLC) datasets providing cropland information have already been published and they are used in many applications. The different data input methods, classification techniques, class definitions and production years among the different GLC datasets make them all independently useful sources of information. This study attempted to produce a cropland agreement level (CAL) analysis based on the integration of several cropland datasets to more accurately estimate cropland area distribution. Estimating cropland area and how it has changed on a national level was done by converting the level of cropland agreement into percentages with an existing cropland fraction map. A pre-analysis showed that the four GLC datasets used in the 2005 and 2010 groups had similar year input data acquisitions. Therefore, we placed these four datasets (GlobCover, MODIS LC, GLCNMO and ESACCI LC) into 2005 and 2010 year-groups and selected them to process dataset integration through a CRISP approach. The results of this process proposed four agreement levels for this CAL analysis, and the model correlation was converted into percentage values. The cropland estimate results from the CAL analysis were observed along with FAO data statistics and showed the highest accuracy, with a 0.70 and 0.71 regression value for 2005 and 2010 respectively. In the cropland area change analysis, this CAL change analysis had the highest level of accuracy when describing the total size of cropland area change from 2005 and 2010 when compared to other individual original GLC datasets.


Introduction
Global trends of the last 25 years show that for the developing world as a whole, the share of undernourished people among the total population has decreased significantly.Some important factors that have affected this reduction have been stability in economic growth and the massive development of irrigation systems, which have caused an escalation in agricultural productivity [1].Sustainable development goals (SDGs), as part of the development of a global pledge, are targeting to eradicate hunger by 2030 and ensure sustainable food production systems [2].To achieve this target, the availability of accurate information about global and regional cropland distribution that can be monitored periodically is becoming important for the construction of strategies for achieving food sustainability targets [3].Satellite-derived cropland dataset can be one alternative source of information since it directly correlates with food resource distribution [4] and water requirements [5].
Over the last two decades, many institutions have published different global land cover (GLC) datasets that provide cropland class information.Since the first one released in 1993, the trend in the improvement of the production of GLC datasets has been in an increase of spatial resolution and accuracy [6].
However, each GLC dataset has a different production year, data input method, classification technique, class definition and accuracy distribution value [7] [8].
These datasets are released as independent datasets, which make them incomparable, especially for a multitemporal analysis [9].
To evaluate the differences between each GLC dataset, researchers have tried to analyze land cover class agreement using a relative pixel comparison approach.The results of this kind of analysis produce a spatial agreement analysis that is effective for determining regions with levels of high confidence as a reference [10] [11].Previous pixel comparison analyses were carried out by comparing GLC datasets, which does not take into consideration when the data was collected.This condition makes the differences that result from a comparison analysis ambiguous as the differences may be a result of real physical changes to the cropland area during that period of time (not because of the different characteristics or classification system of the dataset).Therefore, the method for selecting GLC datasets for a comparison analysis must be reliable and consider time.
The goal of this research was to integrate the combined cropland classes from current GLC datasets to produce a 1 km cropland agreement level (CAL) analysis and its changes (year 2005 and 2010) through two main processes of dataset harmonization and pixel comparison.This study also focused on analyzing the potential use of a cropland agreement level for cropland area estimate and cropland change phenomena.
The presented model of cropland agreement was, to our knowledge, the first study to compare the four current versions of GLC datasets (GlobCover, MODIS LC, GLCNMO and ESA CCI LC) that were focused on worldwide cropland classes while taking into consideration data collection time.There was no previous GLC dataset comparison [12] [13] that considered data collection time and used similar two-year term GLC datasets.It was the first study to estimate total cropland area on a national level by converting the levels of agreement into percentage values using a correlation model between the CAL analysis and an IIASA cropland fraction map.

Global Land Cover (GLC) Datasets
During this study, we explored all existing GLC datasets in the low spatial resolution category (300 -1000 m) and distributed those GLC datasets according to data collection time.After distributing the GLC datasets on a timeline, we grouped the data based on the proximity of data collection time and got five groups of GLC datasets.Figure 1 shows the five groups of data in the timeline of GLC datasets.This grouping strategy made those datasets more comparable for pixel comparison and integration analysis.Within each group, the maximum difference in data collection time was two years.The reason for setting two years as the maximum time difference was to minimize any extreme changes that could occur in cropland fields over long periods.
To create this CAL analysis, the GLC datasets were put through two processes, which were a harmonization and pixel comparison.In the harmonization process, we standardized the analysis depth for all GLC datasets within five groups.Those five groups consisted of 14 GLC datasets from the seven GLC dataset versions listed below: 1) IGBP GLCC v 2.0 (1993 dataset) using the International Geosphere-Biosphere Programme (IGBP) classification system produced by the United States Geological Survey (USGS) [14]; 2) UMd (1993 dataset) using the simplified IGBP classification system developed by the University of Maryland [15];   For the pixel comparison process, we used datasets from the 2005 and 2010 groups from the same GLC dataset versions.These GLC datasets were the GlobCover, MODIS LC,GLCNMO and ESA CCI LC datasets.The same GLC datasets from each group were paired to produce two CAL analyses for further analysis of cropland change.Table 1 shows the characteristics of each of the GLC datasets in the 2005 and 2010 groups that were used in the pixel comparison process.

Reference Data
This study used a relative accuracy assessment, which was the use as a reference of another dataset that was considered to use accurate datasets.We explored the reference data in order to convert agreement levels into percentage values.As reference data, a 1 km global IIASA-IFPRI cropland percentage map with a 2005 baseline year [22] was developed by integrating a number of individual global and regional cropland maps.This IIASA-IFPRI cropland percentage map was validated by high-resolution satellite imagery via Geo-Wiki (http://www.geo-wiki.org)[23] with an overall accuracy result of 82.4%.The reason for choosing the IIASA data was because this dataset was a recent one that gave a highly accurate overview of cropland area distribution in percentage values.

Methods
To achieve the study's first goal, there were two main methods for producing the CAL analysis, which were the harmonization and pixel comparison processes.
The different characteristics and classification systems for each dataset made these two processes important for getting reliable CAL analysis results.During the harmonization process, we standardized all GLC datasets to produce comparable cropland classes for the pixel comparison process.The result of the cropland harmonization process was evaluated by comparing the result to each group.A balanced distribution verified that the data was standardized and comparable.Finally, datasets within the same group were overlaid using the CRISP approach for a pixel comparison processes to produce the CAL analysis As the fourteen GLC datasets from each group had different mapping standards, we applied the following cross-walking methods to get comparable products:

Re-Projection
To facilitate different projections, these data sets were co-registered and re-projected to a geographic (latitude-longitude) image with map datum WGS 84, which is a pseudocylindrical equal-area map projection [24].This projection has been used in many GLC datasets.GlobCover, GLCNMO and ESA CCI-LC are GLCs that have used this projection.Its use was also supported by the use of this projection system in similar studies of GLC datasets comparisons [25].

Rescaling Method
The resampling process had an important role in standardizing the different pixel resolutions of the GLC datasets because during the comparison process the GLC datasets needed to have the same pixel size.Different rescaling processes were a special concern in this study as a mistake in the resampling process could have caused changes in the class area.When considering the goals of this study, which was to resample all GLC datasets to a 1 km resample target, this study attempted to adopt and combine some important parts of the resampling process from another study with an aim of minimizing errors from the resampling process.
For the cropland rescaling process, we used two resampling processes, which were the nearest neighbor and maximum area methods.We used these two techniques because both would not change the value of cells.For datasets that have coarser resolutions, direct resampling using the nearest neighbor method might have caused a non-ignorable disagreement between the original and the rescaled dataset [9].This is based on the fact that the footprint of the sensor is not at the same location during each revisit [26].The maximum area method, however, has been proven a more powerful approach for aggregating discrete land cover data [27].Furthermore, it also tends to give a smoother result than the nearest neighbor method because the new value resulting from this approach is obtained based on the most common values around the pixel.The GLC datasets that already have a 1 km resolution did not need a rescaling process, which minimized the changes from resampling the results data [26].
Taking into consideration the facts from this previous explanation, the steps that were taken in the rescalling process during this study were as follows: 1) Global land cover datasets that have a large resolution, such as the GlobCover and ESA CC-LC (300 m), MODIS LC (500 m), and GLMNO 2008 V.2 (500 m) datasets, were resampled on a grid with a 250 m resolution using the nearest neighbor method.2) After that, the entire GLC 250 m grid was aggregated using a majority area method to 1 km for all datasets.3) For GLMNO 2003 v1 since they were already in a 1 km spatial resolution, the datasets were kept in their Journal of Environmental Protection original resolution without a rescaling process.

Legend Harmonization
For this study, legend harmonization played a crucial role because the focus of this study was on analyzing cropland classes among several GLC datasets.Differences in the definitions and characteristics of each class correlating with the cropland in each individual GLC would produce ambiguities among the comparisons of the GLC datasets.To overcome this problem, we evaluated all the classes used by the different GLC datasets and then analyzed those classes to determine how they correlated with a standardized cropland classification system, which was the Land Cover Classification System (LCCS).The rationale for choosing LCCS rather than IGBP was: 1) LCCS was developed after an analysis of existing relevant FAO nomenclature documents [26] that can explain some categories of cropland classes in detail.2) MODIS LC v 6, one of the GLC products that use IGBP, would soon change to LCCS with their next dataset [18].
One step in the legend harmonization process was to convert the original class numbers to LCCS-labels.Table 2 shows the conversion results for some of the original class numbers to cropland LCCS-labels.The class definition followed the LCCS hierarchy based on the dichotomous phase and the modular-hierarchical phase [28].As in Table 2, the cropland class of the MODIS LC, LCCS-label is "A11-A3".A11 is the dichotomous phase of "Cultivated and Managed Land" and A3 is the modular-hierarchical phase of "Herbaceous Crop".The fact that the original class descriptions from the GLC datasets had a great impact on choosing the type of dichotomous phase and modular-hierarchical phase, we adopted the descriptions of the original classes from the GLC datasets from a previous study [29] [30].For this conversion, we grouped all classes that correlate with cropland into a new cropland class, including mosaic vegetation cropland that had a smaller cropland percentage when compared to vegetation.Since all the GLC datasets in the five year groups would be harmonized for comparison proposes, Table 2 shows information about the type of class that was categorized as "cropland" based on the LCCS-label conversion status of seven GLC datasets.

Thematic Similarity
Following the M. C. Hansen and B. Reed [31] strategy, we evaluated the harmonization results for the cropland classes by analyzing cropland pixel similarity in each dataset group.A successful result in the harmonization process would produce a balanced proportion among the GLC datasets in the same groups of GLC data.Figure 2 shows the proportion of the cropland classes from the harmonization results and Figure 3 shows the spatial distribution of harmonized cropland area.In general, four out of five groups showed a balanced proportion of cropland classes.The subtraction value of the highest from the lowest percentage in each group were 7%, 6%, 6%, 8%, 8% for 2000, 2005, 2010 and 2013 group respectively.However, in the 1992 group, there was a large difference in cropland classes between UMd and GLCC of 38%.The absence of mosaic agricultural classes in the UMd datasets was the cause of this difference [15].This shows that the differences in the goals and focus of an analysis caused conspicuous differences between cropland and vegetation by classifying more area as a vegetation class.This explanation shows that the harmonization result was acceptable especially for the 2005 and 2010 groups.So, both were eligible to be analyzed in the next step.

Cropland Level Agreement (CAL) Analysis
After assigning the harmonized dataset target and proving the balance proportion value, we compared and observed for pixel similarity within the four GLC datasets in the two year-groups by overlaying the datasets using the CRISP approach, which is based upon cross-walking between classes [9].Since we only focused on one cropland class (not analyzing multiple classes), utilizing a CRISP approach that matches using one-to-one mapping was sufficient [12] [32].
Another technique is the Fuzzy approach, which can allow an overlap between legend definitions to be taken into consideration.It requires expert knowledge to quantify uncertainty in the classification and transition zones of boundaries [33] [34].In the previous study explained by McCallum et al. [12], the result of this pixel comparison technique could indicate an agreement level among these datasets.The comparison result of these cropland classes produced four agreement levels, which in this study were also referred to as the Cropland Agreement Level (CAL) analysis.The four levels that were obtained from the CAL analysis in this study were as follows (Figure 4): Level 1: No agreement for pixels with a unique aggregated class in each data set.
Level 2: Low agreement for pixels where only two of the four data sets were in agreement.
Level 3: Medium agreement for pixels where three of the four data sets were in agreement.
Level 4: Full agreement where all the four data sets within a pixel were in agreement.
Since we focused on analyzing cropland class similarity within four GLC datasets in two groups of data from 2005 and 2010, we produced two CAL analyses from those years.
To study this CAL analysis more deeply, we divided the study area into seven test sites according to size.Those study areas were South America, North America, Europe, Africa, Australia, Russia and Asia.Table 3 shows the percentage calculation result for four CAL analysis levels for seven test sites.
Four GLC datasets were used and the underlined value shows the highest percentage value whereas the bolded value shows the lowest percentage for the site area.
Comparison results for the seven sites in the two years of analysis showed that Europe was indicated as the area with highest full agreement level whereas  Australia, Russia and Africa had the three highest no agreement level areas.A short time period for cultivation and a small cropland area combined with a large area of vegetation were the causes of a mix in cropland classes.This triggered different classification results in each GLC dataset.Besides this, there were also random changes in levels two and three of the 2005 and 2010 CAL analyses.

Correlation Factor between CAL and Existing Cropland Fraction
To obtain a correlation model between the CAL analysis and cropland percentage, first we analyze the trend correlation between the CAL analysis with the original cropland classes from the four selected GLC datasets (Table 4).The dominant distribution of "cropland" classes (class numbers 10 and 20 for ESA CCI-LC, numbers 11 and 14 for GlobCover, numbers 11 and 12 for GLCNMO and number 12 for MODIS LC) is in level 4, compared with the "mosaic cropland" that is dominant in level 1 and level 2, indicating that the four levels in the CAL analysis correlate with the amount of actual cropland area in one pixel.
To investigate the meaning of each CAL analysis level in percentage values, we analyze the pixel correlation between the CAL analysis and the IIASA-IFPRI cropland percentage map by using a 2D Scatter Plot (Figure 3).Results show that level 4 of the CAL analysis mainly corresponds with 80% of cropland fraction and with the same approach, the following correlations were also obtained: level 3 with 40%, level 2 with 20% and level 1 with 10% (Table 5).Figure 5 shows a comparison between the CAL model and IIASA-IFPRI map using a 2D Scatter Plot.To estimate cropland area from the CAL analysis, the pixel area from each agreement level is multiplied by the correlation percentage.

Comparison of Cropland Area Estimates from CAL with the FAO Data
To evaluate cropland estimation results on the national level, cropland areas derived from the CAL analysis are compared to the cropland area estimates from  FAO-stat 2005 as a statistical data reference.Based on the definition from FAO statistics, cropland is defined as "arable land and permanent crops" [35].Overall correlation also is observed for 2005 and 2010 respectively and shows some proximity between the CAL analysis and the FAO with a 0.70 and 0.71 regression value (Figure 6).
We also analyze the accuracy of the four selected GLC datasets to FAO statistics.We divide countries into two groups based on the subtraction value of the cropland area.The groups are 1) small, which have a cropland area from 5000 to 140,000 km 2 , and 2) medium to large, which have cropland area from 180,000 to 1,700,000 km 2 (Figure 7).In this framework, the regression value and relative error to the FAO (%) have been observed as 0.574; 42.2 for small cropland area countries and 0.858; 29.8 for medium to large cropland area countries between FAO-stat and the CAL analysis (Table 6).Those values promote the CAL analysis as the most accurate dataset compared to FAO-stat within all datasets in estimating cropland area.Good correlation of the results to FAO statistics is still not enough for quality assessment of the product.To provide accurate validation it is worth to compare the product with some more precise country level cropland map which have higher resolution.Globland30 [36] and Unified Cropland layer [37] as one of the alternative for the analysis in next study.

Cropland Agreement Level (CAL) Change Analysis
The CAL change analysis is used to study the potentiality of cropland area changes monitored between 2005 and 2010.Figure 8 shows global cropland agreement level change.The area that experience the increase and decrease in Figure 6.Regression model of cropland area estimates derived from two years of the CAL analysis with FAO cropland area statistics.
CAL change analysis especially around southern Africa actually categorized as mixed cropland with pasture area.It can be happened since some GLC mentioned that area as mosaic cropland class (Figure 3).
Almost all of the CAL analysis and the four GLC datasets are not able to produce similar cropland area change data comparable to the cropland change statistical data from the FAO.However, the CAL analysis has the highest proximity of cropland change compared to the other GLC datasets.
To give a visual description of cropland area changes, we chose Bali Island, Indonesia as a test site.According to FAO statistics, Indonesia is a country that has had massive cropland area expansion (Figure 9).Bali was chosen because within five years (from 2005-2010) it had large changes in cropland area, both in regards to an expansion of cropland area and shrinkage in cropland area caused by changes in land use.Massive development in the tourism sector caused changes in land use from cropland to housing, especially in Denpasar [38].
Moreover, the development of three new reservoirs in Bali and East Java built within the 2005-2010 period expanded the cropland area around the reservoirs [39].the results from the four GLC datasets, and it also accommodates an area with an extreme level of changes.

Conclusion
This study shows that integrating recent GLC datasets can be considered for es-

Figure 1 .
Figure 1.Timeline of the GLC datasets in five groups of data, based on the closeness of time collection, making the pixel comparison process more reliable in regards to time.

Figure 2 .
Figure 2. Proportion value of cropland analyses as the result of the harmonization of the five groups.

Figure 3 .
Figure 3. Spatial distribution of harmonized cropland area (cropland and mosaic cropland) from four selected GLC datasets in the two year-groups of 2005 and 2010.Only groups of GLC datasets that are located within these two year-groups are used for the cropland class comparison process.

Figure 4 .
Figure 4. Cropland Agreement Level (CAL) analysis 2005, based on pixel comparison analysis between four selected recent GLC datasets.

Figure 5 .
Figure 5.Comparison between the CAL analysis and IIASA cropland fraction using scatter plot.

Figure 9
shows that in ESA CCI LC and GlobCover, almost all of the cropland area in 2005 and 2010 had no change, whereas in the MODIS data, there is a decline in cropland change.In contrast, GLCNMO shows an expansion in cropland area.Cropland change results from the CAL analysis can accommodate

Figure 7 .
Figure 7.Comparison on the national level between the CAL analysis and the four original GLC datasets with cropland area estimates from the FAO in (a) small cropland area countries and (b) medium to large cropland area countries.

Table 2 .
The conversion results for the original cropland classes converted to LCCS-labels for all GLC datasets.

Table 3 .
Agreement percentage of the CAL analysis across the seven test sites in 2005 and 2010.

Table 4 .
Trend correlation between the CAL analysis with the original cropland classes from the four selected GLC datasets.

Table 5 .
Correlation factor between CAL analysis and IIASA cropland fraction.

Table 6 .
Regression value and relative error of the CAL product and the four GLC datasets to FAO statistical data.