Multiscalar Geomorphometric Generalization for Soil-Landscape Modeling by Random Forest: A Case Study in the Eastern Amazon ()
1. Introduction
Elements of the landscape control the processes acting on soils; therefore, the soil-landscape approach constitutes one of the most powerful conceptual tools in mapping activities, especially at scales with an intermediate or greater level of detail. The soil-landscape relationship is related to the concept of the catena, coined by Milne [1]. In a catena, variations in soils along a slope are attributed to the translocation of soluble elements and to erosive and depositional processes, not excluding different source materials. Subsequently, the analyses of soil-landscape relations proposed by Hugget [2], contemplated three-dimensional models of the slopes. In the context of digital soil mapping, in the scorpan model paradigm [3], soil-landscape process modeling can be described as an interdisciplinary object of the interface between pedometry-geomorphometry [4].
The scale issues of soil-landscape relationships are related to the complex interactions of both elements and how these processes occur and are perceived. Several geomorphology studies report a time-space coupling between landform size-scales and lifetime [5] [6]. From a soil perspective, different pedological process will manifest influence at short, mid or long time-scales [7] [8]. In contrast, topography studies in soil physics demonstrate a complex water dynamic related to the nested geometry of slopes, considering relief and micro-relief patterns, resulting in trends in the movement of particles and solutes and changes in texture and chemical parameters of soils [9] [10]. Therefore, multiscalar topography influences a particular soil distribution in two general aspects, overlay of pedological processes that occurred at different times, and driving forces in the present time, determined by the sum of forces better correlated with one, several, or many geomorphologic scales.
Some aspects of spatial scaling in digital soil mapping have been summarized in non-exhaustive reviews [11] [12]. The hierarchical definition of scale can be used to understand the soil phenomena, from the soil region, passing through watersheds, catena, pedon, horizons and finally molecular interactions. The characteristics of measurement affect the results of analysis, and in this sense, the modifiable areal unit problem (MAUP) represents a key issue. Information transfer across scales could be classified in upscaling, in less detail, or downscaling, with greater detail, but both of these require must consider bias.
With respect to the scale of covariables for predictive soil mapping, the highest DEM resolutions do not necessarily produce the highest accuracy [13] [14]. Despite the potential of machine learning to produce complex and non-linear predictions, few studies have investigated multiscale perspective in covariables to account for physical process that are not predictable by finer scale environmental information [15]. Some studies propose data-driven techniques for selection of pixel size or neighborhood size for a particular landscape [16] [17], but this approach produces a variety of results in different geomorphic units, which complicate the interpretation of scalar components in the soil-landscape relationship. For a friendly interpretation of scale relationships on soil-landscape models, this study proposed a cartographic-based criterion to formalize the scale correspondence to pixel size for geomorphometric covariables.
The present study tested the hypothesis whether multiscale geomorphic representation, obtained from cartographic generalization of a digital elevation model, can improve pedometric modeling. To achieve this goal, this case study applied the Random Forest algorithm to a multiscale geomorphometric database to predict soil surface attributes.
2. Material and Methods
The procedures described in this section were performed using the open-source software QGIS 3.10; SAGA GIS 2.3; GRASS GIS 7.8; and R Programming 3.5 [18] [19] [20] [21].
2.1. Study Area
The study was conducted in Iripixi Lake (ILW) and Caipuru Lake (CLW) watersheds, with an area of 27.137 and 28.315 ha respectively, located in the Trombetas basin in Oriximiná-Pará in the Eastern Amazon. Pilot areas, consisting of small farms, are distributed in the upper and lower courses in both basins and adjacent basin boundaries, as shown in Figure 1, totaling 697 ha, approximately 1% of the extent of the watersheds.
The study area is in phanerozoic sedimentary basins, in the Alter do Chão geological formation, whose local characteristics were evaluated from a faciological analysis near Óbidos [22], a neighboring municipality. This region is classified as humid equatorial climate with 3 dry months [23]. The geomorphic units of these watersheds are classified as a homogeneous dissection with coarse drainage density and weak incision depth [24]. The pedoenvironment of this study area are in the upper lands of the lower Amazon basin and the most abundant soil classes in the area are LatossoloAmarelo (Ferrasols), LatossoloVermelho-Amarelo (Ferrasols), ArgissoloVermelho-Amarelo (Acrisols) and GleissolosHaplicos (Gleysols) [25].
2.2. Environmental Covariates
In the proposed mapping scale, vegetation and topography factors are the main sources of soil variation. In this study, to represent such factors, the source multispectral images Landsat 8 and Shuttle Radar Topography Mission Digital Elevation Model 30 m (SRTM DEM 30 m), were used, respectively.
Figure 1. Study area location in the Eastern Amazon.
The Landsat 8 images are from September 11, 2017, corrected for surface reflectance with the LaSRC algorithm by USGS [26]. The SRTM DEM is a digital elevation model based on stereoscopic radar survey, and has 30m pixels [27]. The corrections made in SRTM DEM were filling in sinks, and reduction of deforestation effect by the estimated canopy addition method [28].
The topography information was upscaled and organized into generalized multiscale geomorphometric variable groups, as detailed in the next section.
2.3. Multiscale Geomorphometric Generalization
The multiscale geomorphometric generalization (MGG) is an upscaling operation, based on cartographic concepts of generalization of digital maps [29] [30]. This approach can be applied for any geomorphic variable, including elevation models, landforms units, and primary and secondary derivatives.
This operation results in variables at different scales, arranged in groups according to criteria required for the analysis. The framework of these upscaling methods brings to the pedometric perspective the understanding that the soil-landscape relationship occurs through complex and multiscale interactions. In this sense, the formalization of the desirable scales of analysis and modeling occurs both in their definition and in the group arrangements.
For the MGG operation, vector and raster representations demand different approaches for upscaling, because each of them has specific scale transformation problems due to their mathematical structures. Furthermore, it is necessary to have a unique reference for the scales for compatible representation of geomorphic features in both types of variables, thus allowing for joint interpretation. In this study, the concept of minimum mappable area for soil surveys [31] was considered to define pixel sizes in relation to cartographic scale. The detailed descriptions for each of the four scales used are in Table 1. The area equivalence between raster and vector is calculated as a function of a 5 × 5 pixel grid, considered a conservative parameter to determine a geomorphic feature.
The MGG was applied to the following geomorphometric covariables: elevation (Elev), slope, relative slope position (RSP), topographic wetness index (TWI), plan curvature (PlanCurv), profile curvature (ProfCurv), topographic factor of water erosion (LS) and geomorphons.
These geomorphic variables, at different scales, were obtained from SRTM DEM from two upscaling methods, as illustrated in Figure 2. Using local averages on covariable elevation, in 2 × 2, 3 × 3, 4 × 4 windows, for resolutions 60 m, 90 m and 120 m, respectively, and subsequent derivatives covariable calculation. Classification of geomorphons [32] was followed by the exclusion of polygons smaller than the minimum mappable area for each scale. Such methods correspond
Table 1. Correspondence between scale and pixel size for Multiscale Geomorphometric Generalization (MGG), using the concept of minimum mappable area.
aFor a 5 × 5 window. bRatio pixel area (pa) by minimal mappable area (mma), in percentage.
Figure 2. Methodology flowchart of MGG for the topography covariables.
to cartographic generalization applied to general geomorphometry and specific geomorphometry, respectively [33].
In this case study was used a machine learning approach to identify and select the optimum scales of variables for modeling. It was therefore necessary to provide a multiple database for training and evaluation each scale. In this sense, the variables were organized from the combination of the set of topography variables, arranged in all possible combinations.
2.4. Soil Sampling and Analysis
Soil sampling was performed in 9 pilot areas, considering covariables to evaluate the effect of topography related to variation in soil distributions. Each pilot area was sampled at 10 points, a sufficient density for semi-detailed soil surveys, compatible with the 1:25,000 scale soil maps [31]. The sample points were distributed according to a stratified random arrangement by the conditioned latin hypercube method [34] [35], with raster topography covariates, described in the previous section, at a 1:25.000 scale. At the total of 90 sample points, the morphological description of the A horizon was performed [36] and soil samples were collected in the 0 - 30 cm depth for physical and chemical analyses [37] [38]. To evaluate variances and patterns in the sample dataset, principal component analysis [39] was conducted.
2.5. Modeling by Random Forest
The modeling of the soil-landscape was done using Random Forest [40], a machine learning algorithm frequently used to produce digital soil maps [41]. Some characteristics of this algorithm that are worth mentioning are that it can handle categorical and continuous variables, it can do regression and classifications, it is robust for overfitting problems, and is feasible for interpretation of variable relationships, including linear and nonlinear systems [42].
First, the set of training cases and those intended for validation were defined. The selection was made randomly, with proportions of 70% and 30%, respectively. The training was done for every group of geomorphometric covariables, one group at a time, for modeling of some soil attributes, namely, A horizon thickness, pH, silt and sand content. Next, for the prediction of the multiscale generalized geomorphometric groups, the groups with the best adjustment were evaluated and selected based on the highest values of % variation explained by the model. Finally, the modeling structure and results of original scale geomorphometric and generalized geomorphometrics groups were compared. For evaluation of these predictions, visual analysis of digital maps and multi-way plots of forest structures and effect of variables on prediction were used, calculated with Random Forest Explainer in the R package [43].
The prediction of soil texture was made by considering the silt and sand raster layers using the Brazilian soil classification system [44]. For evaluation of this prediction, the confusion matrix was calculated with the Kappa index, and the user’s and producer’s accuracy [45].
3. Results and Discussion
3.1. Generalized Multiscale Geomorphometrics
The multiscale geomophometric generalization produces a dataset of covariables at each scale. For evaluation of that operation, in this section, the tendency of geomorphometry distributions and a geomorphology interpretation of this generalized data will be described.
The application of MGG to the original DEM database resulted in 28 continuous geomorphometric variables, the distributions of which are illustrated in Figure 3. Some distributions had smaller changes, like TWI and LS, with a reduction of occurrence of the extreme values and some scatter reduction in the upscaling direction. This could be explained by the lowest representation of smaller geomorphic features that contributed to counting of upper and lower limits on dynamics of the water on the slope. Some variable distributions had highly significant changes, such as slope and geomorphic curvatures, which had a sharp and progressive reduction of scatter related to the smoothest generalized surface. Considering the PlanCurv, it is observable that the upscaled derivative better describes the features of the valleys and spikes in the study area, as illustrated in Figure 4(b).
Some studies test upscaling effects on digital topographic information, with comparable results. The contextual spatial modeling, using gaussian space scale rates to produce a set of coarse resolution DEMs, had a similar result of smooth
Figure 3. Distributions of geomorphometry raster variables at original and generalized scales.
Figure 4. Comparative 3D view of original and generalized geomorphometric covariables: (a) Relative slope position; (b) Plan curvature.
slope and geomorphic curvatures [46]. In contrast, other studies have proposed sophisticated calculations for generalization of DEM considering questions of feature preservation, and could be tested with the MGG framework. The Feature Preserving DEM Smoothing (FPDEMS) method reduces the complexity of the surface at the detailed spatial scales at which roughness dominates, while not significantly altering the topographic complexity at larger spatial scales [47]. Other approaches use a multi-point algorithm to rapidly and accurately retrieve the critical points, using drainage-constrained TIN, to produce coarser-resolution DEMs [48] [49].
The distribution of relative slope positions had the most complex variation, and was related to dissection process acting on a flattened surface of phanerozoic sedimentary basins, which results in organization of this type of landform, a homogeneous dissection with tabular top, coarse drainage density and weak incision depth [24] [50]. A comparative 3D view of original and generalized relative slope positions is illustrated in Figure 4(a). This variable at its original scale identifies the variation at a sub-watershed level, with 0 - 1 interval values that occur at each drainage division. In contrast, at 1:100,000, this variable shows variation at the watershed level, with high values only at the most elevated ridgeline section. In this perspective, the upscaled variable can identify upper and lower courses of basins, a relevant information for soil mapping, because it allows analysis of base level, exposition time to weathering and stratigraphic differentiation on sedimentary geological formation [5].
The geomorphon scale transformations resulted in incorporation to the valley portions of some local landform segments which originally were classified in topo and slope, as illustrated in Figures 5(a)-(c). Therefore, at a 1:100,000 scale, some geomorphic features were disconnected from the main patterns and were omitted and substituted by the local prevalent geomorphon type. Regarding the original scale geomorphons, the results show consistent valley-slope-top landforms sequences in the most of mapping site. However, at same scale of soil mapping in South Africa, the geomorphons were compared with expert manual delineation with a high degree of mismatch (57% of the area), despite reasonable prediction results [51]. Alternatively, considering the low importance of geomorphons in decision trees, as we will see in the next section, it is possible to test other approaches to generalization, including the variation of search parameters in the definition of relief units.
Other approaches to classification of landforms, like the topographic position index [52], and k-median clustering [53], also had parameters that could be adjusted for the proposed correspondence of scales. Therefore, these techniques can be included and tested with the MGG framework in future research.
3.2. Evaluation of Soil-Landscape Models
This section presents the relationship of soil variables to geomorphometric covariables, in the context of MGG and will focus on whether the decision tree forest structures have an expected pedological meaning, and if a greater number of topographic variables could mask other factors.
The main soil variation is related to particle size distribution, as shown by the contribution of this variable to the first component of PCA, as illustrated in Figure 6. The A horizon thickness has an opposite tendency with respect to silt,
Figure 5. Comparative 3D view of original and generalized geomorphometric covariables: (a) Geomorphons 1:25,000; (b) Geomorphons 1:100,000; (c) Digital elevation model at original scale.
Figure 6. (a) Vectors and (b) variable loadings of principal components for soil attributes.
and is only slightly in agreement with sand content, revealing the weathering dynamics related to translocation of organic matter in sandy rainforest soils. Organic matter has a relevant influence on soil differentiation and is directly related to pH, as expected in this pedoenvironment, a upper land of the lower Amazon basin [25].
Also, exchangeable aluminum has only a slight agreement with clay content, and this is probably associated with clay mineralogy dominated by kaolinite, common in that fraction in acid soils [54]. Studies indicate that predominance of kaolinite at the surface soil is explained by reduced Silica leaching by cycling of Si promoted by the forest vegetation, and the resulting soil properties patterns was found on Amazon soils located on sedimentary domains with high exchangeable aluminum content, like this study area [55] [56].
The variation in the A horizon is strongly related to landcover and topographic factors [57]. For modeling the A horizon, the MGG group 15 had the best adjustment, with 40.5% of variation explained, in contrast to 30.2% for the original scale DEM, indicating the influence of coarse variables. This could mean that there is prevalence of older pedological processes associated with larger and older geomorphic features.
The decision tree forest structures, as illustrated in Figure 7, clearly show the main controllers of tendencies, represented by Elev and RSP, such as the high portion of watersheds that have conserved and currently present long-term pedological processes of soil development that have resulted in an overall greater thickness, in contrast with the dissected portion. The water content and dynamics, represented by TWI, PlanCurv, ProfCurv, influence the variation in the A horizon thickness, and landcover also has a significant importance in the model, represented by surface reflectance variables.
These results show similarity with another study that reported increased accuracy of soil prediction by Random Forest on four datasets across the globe with addition of a coarser scale DEM [58]. These authors also proposed an approach with a variogram of soil properties to a priori approximations of the effective scale for modelling. In the context of MGG, this can be used for scale definition and group arrangement for soil attributes on specific pedoenvironment.
Figure 7. MGG decision tree forest structures for A horizon thickness, geomorphometric group 1:75,000 plus 1:100,000 scales.
Soil pH influences microbiota, nutrient uptake, root growth and therefore plant development in general [59]. Soil pH is strongly related to landuse, or in scorpan, the organism factor. As expected, the most important variable for modeling of this soil attribute, at the original scale, is the surface reflectance, as shown in Figure 8(a). When considering the importance plot of the MGG group 22, illustrated in Figure 8(b), the same five variables have with highest importance index. In this sense, both models show topographic factors with less importance then vegetation factors, in the same proportion, despite twice the number of geomorphometric variables in the MGG.
3.3. Prediction of Superficial Layer Soil Texture
This section discusses the Random Forest prediction of particle size, which is able to produce soil textural classifications for land users and stakeholders and will focus on the question of whether model adjustment can be improved by the MGG, and if the soil particle size maps result in a more accurate soil textural classification.
For prediction of particle size of surface layer, sand and silt fractions were used because of the low contribution of clays in the Alter do Chão lithology. This could be explained by local characteristics of that geological formation, with an overall fine to medium grained sandstone content in the upper portion, and medium and coarse sandstones with small contribution of red claystones in the lower portion [22].
The Random Forest models for silt and sand, at original scale geomorphometrics, had a poor adjustment, as shown in Table 2. When considering the best adjusted MGG groups, despite the considerable portion of randomness related
Figure 8. Variable prediction importance for pH at (a) original geomorphometric scale and (b) multiscale generalized geomorphometrics.
Table 2. Model adjustment for silt and sand.
to the heterogeneity of soils, the % variation explained is reasonably higher and mean of squared residuals is reasonably lower. The model’s predictors have a principal contribution from variables at 1:75,000 and 1:100,000 and can identify tendencies at the watershed scale.
The most significant covariables of particle size prediction are shown in Figure 9(a), Figure 9(b): Elev and RSP at coarser scales, associated with stratigraphy and long-term hillslope transportation; ProfCurv at coarser scales, and PlanCurv at original and coarser scales, related with accumulation, transit and dissipation zones [60].
The prediction of sand and silt content, at original and multiscale generalized geomorphometrics, is illustrated in Figure 10. In both variable groups, the MGG has produced maps with less noise and more recognizable patterns related to geomorphic features. These results corroborate the hypothesis that the topography has an influence, in a larger spatial context, and has prevalence on prediction of soil particle size contents in the tested basin. In contrast, a case study with Random Forest with 30 m and 90 m DEM did not achieve significant differences in prediction [61]. Despite some similarity with covariables importance, like Elev and RSP, the modeling is done on single scale datasets. In this sense, we can argue the importance of observing soil-landscape phenomena from a multiscale perspective.
Figure 9. Variable prediction importance: (a) model of sand by geomorphometric group 1:25,000 plus 1:75,000; (b) model of silt by geomorphometric group 1:75,000 plus 1:100,000.
Figure 10. Predictive maps of silt (a, b) and sand (c, d) at original scale geomorphometrics and multiscale generalized geomorphometrics, respectively.
Table 3. Accuracy evaluation for soil texture classification.
The MGG was able to increase the accuracy of superficial layer soil texture classifications, as shown in Table 3. The most significant improvements occur in the franco-arenosa (MeAr) and areia franca (ArMe) soil texture classes, both with a smaller contribution area at the mapping site in relation to the more relevant areia (MAr) class. Also, the user’s accuracy has a considerably higher result, so the MGG increased the reliability of each mapped class. In the same way, the Kappa index also has higher values for MGG geomorphometric variables.
A case study done on a farm in China, with machine learning on single and multiple scales variables [62], also found better results with a range of appropriate scales, even using only local derivatives. In this sense, the MGG framework has greater potential because DEM transformation before derivate calculation allows for use of both local and regional derivatives in a multiscale group arrangement.
4. Conclusions
This study evaluates the multiscale geomorphometric generalization with the purpose of modeling the soil-landscape relationship. The comparison of original scale with multiscale generalized variables was based on Random Forest prediction of soil attributes. Regarding the proposed methodological framework, the following results and issues stand out from this case study:
● The general geomorphometry generalization tends to smooth slopes and curvatures and produce identifiable representations of relative slope position at sub-watershed and watershed level. The specific geomorphometry generalization results in incorporation into valley portions some local landform segments which originally were classified in topo and slope.
● Random Forest modeling with MGG variables did not mask greater importance of the vegetation factor for prediction of soil attributes related to landcover. The forest structures and effect of variables on prediction agree with pedological knowledge. Comparing these results with those from other studies, it could be argued that the soil-landscape phenomena from a multiscale perspective are highly relevant.
● The MGG improved model adjustment for silt and sand particles and also improved the accuracy of metrics of soil texture classification of surface layer, especially for the most unusual classes, with the Kappa Index going from 43% to 62%. Topography influences at a coarser spatial scale and has prevalence on prediction of soil particle size contents in the studied watershed.
● Future development of the MGG framework should address generalization of DEM concerning feature preservation and comparison of landform classification adaptable at multiple scales.