_{1}

^{*}

Most GIS databases contain data errors. The quality of the data sources such as traditional paper maps or more recent remote sensing data determines spatial data quality. In the past several decades, different statistical measures have been developed to evaluate data quality for different types of data, such as nominal categorical data, ordinal categorical data and numerical data. Although these methods were originally proposed for medical research or psychological research, they have been widely used to evaluate spatial data quality. In this paper, we first review statistical methods for evaluating data quality, discuss under what conditions we should use them and how to interpret the results, followed by a brief discussion of statistical software and packages that can be used to compute these data quality measures.

Spatial data quality is limited by the quality of the data sources such as traditional paper maps or more recent remote sensing data [

There are four levels of measurement scales that are used to capture spatial data: nominal, ordinal, interval and ratio. Normal and ordinal data belong to categorical data, while interval and ratio data belong to numerical data. In the past several decades, different statistical measures have been developed to evaluate data quality for different types of data. Although these methods were originally developed for medical research or psychological research [

Nominal categorical data is used to label variables without providing any quantitative value, which is the simplest form of a scale of measure. Unlike ordinal data, nominal data cannot be ordered. For example, land cover/land use can be categorized into “open water”, “residential”, “commercial”, “wetland”, “mixed forest”, “agriculture”, and there is no inherent order among these categories. Without loss of generality, we first consider a simple classification problem where there are only two categories. For example, we have a map of a certain mineral and we want to evaluate the accuracy of the mineral map. The data can be summarized in a 2-by-2 confusion or error matrix that cross-tabulates the truth and classification on the map (

Truth | |||
---|---|---|---|

Mineral present | Mineral absent | ||

Classification on the map | Mineral present | a | b |

Mineral absent | c | d |

Accuracy measures | Interpretation | How to calculate |
---|---|---|

True positive (TP) | Mineral correctly identified as present on the map | a |

False positive (FP) | Non-mineral incorrectly identified as present on the map | b |

True negative (TN) | Non-mineral correctly identified as absent on the map | d |

False negative (FN) | Mineral incorrectly identified as absent on the map | c |

Sensitivity | Conditional probability that a true “present” is correctly classified on the map | TP/(TP + FN) = a/(a + b) |

Specificity | Conditional probability that true “absent” is correctly classified on the map | TN/(TN + FP) = d/(b + d) |

accuracy measures.

For multi-class classification, we can use one against all approach for TP, TN, FP, FN. Suppose we have a map of classification of the likelihood of landslides as shown in

Correct classification rate is the number of correct classified instances on the map divided by the total number of instances, i.e., the sum of number on the diagonal divided by N, where N is the total number of instances. Misclassification rate is the number of incorrect classified instances on the map divided by the total number of instances, i.e., the sum of number off-diagonal divided by total instance N.

Kappa index can be used to evaluate attribute accuracy when truth is known [

When the truth data is not available, Kappa index can be used to evaluate relative agreement between two data sources, or pairwise relative agreement among more than two data sources. If Kappa index is small between two data sources, we can infer that the data quality of at least of one data source is not good. If we have 3 data sources, two of them have “good kappa”, but both of them have “bad kappa” with the third data sources, we can infer that the first two data sources has similar data quality – either both of them are good or both of them are bad. In this case, other information needs to be collected to determine quality for

Truth | ||||
---|---|---|---|---|

Low | Moderate | High | ||

Classification on the map | Low | 70 | 10 | 5 |

Moderate | 8 | 67 | 20 | |

High | 1 | 10 | 14 |

Kappa index value | Interpretation |
---|---|

0 | Agreement equivalent to chance |

0.10 - 0.20 | Slight agreement |

0.21 - 0.40 | Fair agreement |

0.41 - 0.60 | Moderate agreement |

0.61 - 0.80 | Substantial agreement |

0.81 - 0.99 | Near-perfect agreement |

1.00 | Perfect agreement |

these three data sources.

Ordinal data is a categorical data type that does not have a number (i.e., not quantitative), but the data have natural, ordered categories. For example, average temperature can be classified as “very cold”, “cold”, “chilly”, “lukewarm “, “warm”, “hot”, “very hot” on a map, or landslides incidence of a certain area can be shown on a map with different color to indicate “low”, “moderate” and “high” likelihood of landslides. In other words, although ordinal data do not represent a quantity, but they do have an inherent order.

The Kappa index we discussed previously is not appropriate for ordinal categorical data, because it assumes all the errors in the confusion matrix is considered of equal importance. However, for ordinal data, the classification errors vary in their importance. In other words, the “costs” of misclassification are different among the ordinal categorical data. For example, it may be far worse to classify an area with high likelihood of landslides area to low likelihood of landslides than to classify it as a moderate likelihood of landsides. In this scenario, the weighted Kappa is the correct index to use for evaluating data quality purpose [

To calculate weight Kappa, we need to create another Weights matrix which contains the weights for each cell. The diagonal cell in the Weights matrix is 1, indicating full credit for each class correctly. The value of off-diagonal cells should be assigned values by the analyst, with weight value between 0 and 1. A value of 0 means that there is no partial credit for misclassification for one class to the other, a value of 1 means we give full credit for misclassification (i.e., we treat this misclassification as correct classification). Any value less than 1 but greater than 0 means there is partial credit for misclassification.

Truth | ||||
---|---|---|---|---|

Low | Moderate | High | ||

Classification on the map | Low | 1 | 0 | 0 |

Moderate | 0.5 | 1 | 0 | |

High | 0.2 | 0.5 | 1 |

We also give partial credit for classifying “moderate” to “high”. However, we do not give partial credit for classifying “moderate” to “low” or misclassifying “high”. In general, the weighted Kappa can be calculated as following.

We assume to have a k-by-k confusion matrix M, and create a proportions matrix P, which is M/n. Let p i , j be the proportion of observations in row i, column j, p i + be the proportion of mapped data in row (class) i, and p + j be the proportion of mapped data in column j. Let w i j denote the weight assigned to the i,j th element in matrix W. We further define p 0 * = ∑ i = 1 k ∑ j = 1 k w i j p i j , and p c * = ∑ i = 1 k ∑ j = 1 k w i j p i + p + j . Then the weighted Kappa can be defined as

K ^ w = ( p 0 * − p c * ) / ( 1 − p c * ) .

Numerical data or quantitative data is a numerical measurement that can be represented in numbers. Numerical data can be discrete or continuous. Discrete data represent times that can be counted, and it has a finite number of possible values and the values cannot be subdivided meaningfully. For example, the number of people in a census tract is discrete numerical data, and the number of houses in a certain area is also discrete numerical data. On the other hand, continuous data represent measurement that can be meaningfully subdivided into finer and finer increments, depending upon the precision of the measurement system. For example, the annual precipitations and temperature are both continuous data. Bland-Altman analysis [

Bland-Altman plot is a scatter plot of the difference between two measurements (Y-axis) against the average of two measurements (X-axis), with 95% limits of agreement. The limits of agreement are calculated by the mean observed difference ± 1.96 X standard deviation of observed difference. Consider a situation where we developed a new algorithm to process images, which is computationally more efficient than the standard method. We want to assess the agreement between intensity values from this new image processing algorithm (observed value) and the ground truth from standard method. The true values with sample size n = 30 were simulated from uniform distribution (0, 255) and the observed values were obtained by the true values plus values that simulated from a normal distribution with sample size n = 30, mean 0 and standard deviation 3. The 30 pairs of true and observed values are shown in

Sample# | Truth | Observed | Sample# | Truth | Observed | Sample# | Truth | Observed |
---|---|---|---|---|---|---|---|---|

1 | 53 | 53 | 11 | 41 | 41 | 21 | 187 | 186 |

2 | 74 | 74 | 12 | 35 | 35 | 22 | 42 | 41 |

3 | 115 | 118 | 13 | 137 | 133 | 23 | 130 | 130 |

4 | 182 | 184 | 14 | 77 | 76 | 24 | 25 | 28 |

5 | 40 | 42 | 15 | 154 | 155 | 25 | 53 | 55 |

6 | 180 | 183 | 16 | 100 | 104 | 26 | 77 | 77 |

7 | 189 | 191 | 17 | 144 | 144 | 27 | 3 | 2 |

8 | 132 | 132 | 18 | 198 | 199 | 28 | 76 | 78 |

9 | 126 | 120 | 19 | 76 | 76 | 29 | 174 | 176 |

10 | 12 | 14 | 20 | 155 | 151 | 30 | 68 | 66 |

data points, the variations could be outside these limits. It seems like the new algorithm cannot be used to substitute the standard method. Note that there is no uniform criterion on acceptable values of limits of agreement. This depends on the variables being measured and researchers should use their domain knowledge to make decisions.

Intra-class correlation coefficient (ICC) is a widely used index to assess agreement between two numerical measures. ICC provides an estimate of overall concordance between data from two or more sources. It is somewhat akin to “analysis of variance”. There are 10 forms of ICCs, depending on the model selection (random effect model vs. mixed effect model) and type of selection (single measurement or multiple measurements), and definition of selection (absolute agreement or consistency). A comprehensive review of selecting and reporting ICCs can be found in Koo and Li [

Importantly, we need to know that there are no standard values for acceptable reliability based on ICC. A low ICC may due to lack of variability among the sampled data, instead of low degree of agreement between two methods or two raters. Thus, it is suggested to have at least 30 samples when using ICC to evaluate agreement. The interpretation of ICC values is somewhat arbitrary (

Using the simulated data in

When we do not have ground truth data, we can still use ICC to evaluate the agreement between two data sources. A high ICC means high agreement between the two data sources, they can have equally good data quality or equally bad data quality. A low ICC means low agreement between the two data sources, at least one of the data sources has bad data quality. Similarly, Bland-Altman Analysis can also be used to evaluating the agreement of two data sources/models

ICC | Interpretation |
---|---|

<0.5 | Poor agreement |

0.5 to <0.75 | Moderate agreement |

0.75 to <0.9 | Good agreement |

0.9 - 1.0 | Excellent agreement |

predictions when the ground truth is not available.

Some of the statistical measures for data quality evaluation are relatively simple, and it is possible to calculate using traditional “pen and paper” approach. However, as the sample size increases, statistical software is needed to conduct such analysis. In addition, for the more complicated methods such as weighted Kappa index, Bland-Altman plot and ICC, we usually need statistical software to do the calculation. Statistical software such SAS software [

Most GIS databases contain data errors. The data errors may come from human entry error, the source of data (paper maps or images), or imperfection of the image processing algorithms. These data imperfections have a direct impact on the reliability of spatial analysis results. For example, if objects have slightly different boundaries for a polygon overlay operation, a large number of “slivers” will be produced which will result in errors for downstream analysis [

Different GIS applications require different degree of details of spatial data which depends on the purpose of the applications. Importantly, we have to understand that there is no “one-size-fit-all” guideline for evaluating spatial data quality. Even if we use the same methods, we may use different “cut-off” values to decide whether the data quality has accuracy enough or not. It is important to choose the correct methods to evaluate data quality and wisely interpret the results, so that we can have a better knowledge of the data quality, which in turn, helps us to make informed decisions.

The author declares no conflicts of interest regarding the publication of this paper.

Han, X. (2020) On Statistical Measures for Data Quality Evaluation. Journal of Geographic Information System, 12, 178-187. https://doi.org/10.4236/jgis.2020.123011