Coupling Discriminating Statistical Analysis and Artificial Intelligence for Geotechnical Characterization of the Kampemba’s Municipality Soils (Lubumbashi, DR Congo)

This study focuses on the determination of physical and mechanical characteristics based on in vitro tests, by using field samples for the Kampemba urban area in the city of Lubumbashi. At the end of this study, we identified the soils according to their parameters, and established the geotechnical classification by determining their bearing capacity by the group index method using from the identification tests carried out. By using the AASHTO classification method (American Association for State Highway Transportation Official), the results obtained after our studies revealed five classes of soil: A-2, A-4, A-5, A-6, A-7 in a general way, and particularly eight subgroups of soil: A-2-4, A-2-6, A-2-7, A-4, A-5, A-6, A-7-5 and A-7-6 for the concerned area. The latter has given statistical analysis and deep learning based on multi-layer perceptron, the global values of the physical parameters. It’s about: 31.77% ± 1.05% for the limit of liquidity; 18.71% ± 0.76% for the plastic limit; 13.06% ± 0.79% for the plasticity index; 83.00% ± 3.33% for passing of 2 mm sieve; 76.22% ± 3.2% for passing of 400 μm sieve; 89.07% ± 2.99% for passing of 4.75 mm sieve; 70.62% ± 2.39% passing of 80 μm sieve; 1.66 ± 0.61 for the consistency index; −0.67 ± 0.62 for the liquidity index and 8 ± 1 for the group


Introduction
To build a stable and sustainable road or civil engineering structures, it is imperative to treat the supporting soil in order to increase its bearing capacity. This stage of soil treatment can be accomplished only if the different physical characteristics of this soil are known. The study of the mechanical behavior of materials used in public works in general has been of interest to the scientific community for a very long time [1]- [9], among others. In geotechnical engineering, engineers use several systems to classify geomaterials by assigning them to each category by the similarity of their physical and/or mechanical properties a code.
The municipality of Kampemba consists of the Kafubu, Bel-Air 1 and 2, Bongonga, Industriel, Kigoma and Kapemba quarters, and has a total area of 9,741,644.63 m 2 . This municipality is part of the square degree of the city of Lubumbashi city between the parallels 11˚36'5.60'' and 11˚44'44.91'' of Northing (South Latitude), and the meridians 27˚29'59.31'' and 27˚33 '19.57'' of Easting (East Longitude).
The southern part of Katanga, which is part of our area of study, has a tropical climate with alternating two seasons, with a temperate and continental character linked to altitude but also to distance from the Indian and Oceanic Masses East and West 26, respectively [10] and annual rainfall is estimated at 1200 mm. The geology in place is described in Figure 1. This geological map is a necessary but not sufficient document to provide useful information to civil engineering professionals in their quest for good behavior soils to erect different structures. Hence, the detailed study of the soils and the alteration state of different parts of the massive rock in order to arrive at the presentation of the geotechnical map was the subject of this article. It will serve as a guide for the Kampemba's municipality civil engineers.
Several soil classification works have already been done for some municipalities in the city of Lubumbashi, including Kampemba [7] [11] [12] using different methods. This study was based on deep learning of inferential statistics and artificial intelligence based on the multilayer perceptron methods.
In the context of this article, the first objective is to arrive in the geotechnical classification of the Kampemba's municipality soils from the laboratory results. The second objective was to learn by deep learning the physical characteristics of the soils, and finally, the third objective was to identify the existing relationships between the physical parameters of the soils using statistical analyses and the perceptron multi-layer methods.

Sample Constitution
In situ sampling was based on the following criteria: soil color, moisture, consis-

Determinant Predictive Variables
After the sampling campaign, there were followed by sieve analysis and consistency limit testing according to AFNOR standards (NF 1997(NF -1, 2005 [13] to find all baseline data. Consistency or Atterberg limits are determined only for the fine elements of a soil, i.e. the fraction passing through the 400 μm sieve [13] or 420 μm [14], as these are the only elements on which water acts by changing the consistency of the soil [15]. In the case of this article, the sieve used was 400 μm. In practice, the liquidity limit ( L W ) is defined as the water content from  which a groove closes on 1 mm under 25 blows, the Casagrande cup was used for its determination. Its value was calculated by two methods: • the analytical method (Equation (1) and Equation (2)), • the graph based on the lower square right.
These technics were verified using deep learning based on the multilayer perceptron, used in artificial intelligence for regressions and predictions.
with: "ω" water content and "N" the number of blows. The second analytical formula is that of the Washington State Highway Department [16] [17] in which the liquidity limit is determined by a single measure corresponding to a number of blows between 17 and 36. 10 1.419 0.3 log For this study, the results found with the two methods are almost similar, but with a relative error of 1.97%.
In practice, the limit of plasticity (W P ) is the water content of the spindle that breaks into small sections of 1 to 2 cm long at the moment when its diameter reaches 3 mm. If the sample breaks at different diameters, several successive tests are made and the least square line is drawn to determine the water content corresponding to the 3 mm diameter.
The plasticity index "I P ", is the difference between the liquid limit and the plasticity limit. It gives an indication of the extent of the plasticity range. With this index, the soil can be classified according to its degree of plasticity.
The sieve analysis made it possible to calculate the following predictive variables: the content of fine particles (X 80 ), the sieve passing of 400 μm (X 400 ) and the sieve passing of 2 mm (X 2 ).

Geotechnical Classification of Soils
Burmisterin [16] has established the following classification for soils: This rule was used in this work for the partial classification, as the materials are very heterogeneous and come from the alteration of several lithological and pedogenetics natures. For the global classification of soils, the method chosen was that of the AASHTO [18], which is based on sieve analysis, liquidity limit and plasticity limit such as: • when the test results required for classification are available, the groups are examined from left to right by successive elimination. The first group as far as possible to the left, that corresponds to the group searched for; • the plasticity index of subgroup A-7-5 is less than "W L − 30" and the plasticity index of subgroup A-7-6 is higher than "W L − 30"; note that this classification is completed by the group index method "I g ", which is calculated from the results of the sieve analysis, liquidity limit "W L " and plasticity index "I P " The calculation of this index defines the bearing capacity of a soil based on its identification tests. It can be used on the one hand to specify the classification of soils, and on the other hand, to evaluate the thickness of pavement sub-base layers according to the below formula of Steele in [16] According to Steele, the strength of a pavement foundation implies that its thickness depends on five factors [17]: nature of the subsoil, drainage, compaction, climate and the safety coefficient. The author gives the following classification as a function of the thickness of the subgrade: • Null for a good subgrade (I g = 0 or 1); • 10 cm for a fair subsoil (I g from 2 to 4); • 20 cm for a bad subsoil (I g from 5 and 9); • 30 cm for a very bad basement (I g from 10 to 20). This thickness will have to be adapted to the conditions in Lubumbashi during the project study.
In order to avoid repetitive operations in data processing, the geotechnical mapping computer program based on deep learning: complex artificial neural networks [7] created according to the AASHTO classification was used, whose user interface is shown in Figure 3.

Analysis of the Data
For this article, two methods were used: statistical methods and artificial intelligence methods based on complex neural networks: Deep Learning. The statistical synthesizers used are: mean, first quartile, second quartile (median), third quartile, unbiased standard deviation, coefficient of variation, skewness, kurtosis and 95% confidence interval.

2) Inferential statistics
For this analysis, the following tests were used: the box plot test for detecting outliers, the Kolmogorov-Smirnov test for normality, and discriminant factor analysis to highlight trends. Explanation and prediction of the membership of individuals in several geotechnical classes based on the explanatory variables, quantitative or qualitative, was done using discriminant factor analysis (DFA).
This method, which is both explanatory and predictive, can be used for: • check on a two or three dimensional graph whether the groups to which the observations belong are distinct; • identify the characteristics of the groups on the basis of explanatory variables; • predict belonging group. In order to detect multi-collinearities and identify the variables involved, linear regressions of each of the variables in relation to the others must be carried out. It is then recommended to calculate: • The R 2 of each of the models. If the R 2 is 1, then there is a linear relationship between the dependent variable of the model (the Y) and the explanatory variables (the X); • The tolerance of each model is (1 − R 2 ). It is used in several methods (linear regression, logistic regression, and discriminant factor analysis) as criteria for filtering variables. If a variable has a tolerance below a fixed threshold (the tolerance is calculated by taking into account variables already used in the model), it is not allowed to enter the model because its contribution is negligible and could lead to numerical problems; • The VIF (Variance Inflation Factor) which is equal to the inverse of the tolerance.
It can be useful to detect multiple collinearities within a group of variables in particular in the following cases: • To identify structures in the data and to derive operational decisions from them; • To avoid numerical problems in some calculations.

Deep Learning Methods
A neural network is a mesh of several neurons organized by layers. The "S" neurons of a single layer are all connected to the "R" inputs. In this case, the layer is said to be fully connected. A weight "w i,j " is associated to each connection. The first index "i" (row) designates the number of neuron on the layer, while the second index "j" (column) specifies the number of input. The set of weights in a layer forms a matrix "w" of dimensions "S × R" (Figure 4). The mathematical model of an artificial neuron, shown in Figure 4, consists essentially of an integrator that performs a weighted summation of its inputs. The result "n" of this sum is then transformed by a transfer function "f" which produces the output "a" of the neuron. The "R" inputs of the neuron correspond to the vector represents the vector of synaptic weight of the neuron [20] [21].
The output "n" of the integrator is given by the equations below: with "b" the activation bias or threshold of the neuron and "n" the activation level that is then transformed by a transfer function "f" that produces the output "a" of the neuron (Equation (6)). ).
with "w" a matrix of synaptic weights and "t" time.
Several activation functions are used to solve different problems. Since this is more of a regression problem (approximation of functions) for this paper, the functions that have been used are the following: For this type of problems, the multi-layer perceptron architecture was chosen.
This is a static model, i.e. not considering time, because only the variables of soil's nature were used in the analysis. These variables are intrinsic characteristics of soils.

Statistical Analysis
Before using advanced methods of analysis, it is first necessary to discover the data in order to identify trends, detect anomalies or simply to have essential information such as the minimum, maximum, or average of a sample of data [22]. Table 1 and Table 2 show the synthetic values founded. Without distinction of geotechnical classes, the soils of Kampemba have the following geotechnical properties given in Table 1. The Confidence Interval has been calculated according to student statistic at 95%. Table 2 and Table 3 provide details of all synthesizers of quantitative and qualitative variables.
The coefficient of variation of all variables is greater than 15%. This reflects significant variability in the sample (the materials are very heterogeneous); hence the mean alone is not a good summary of the whole sample. The distribution of values is asymmetric for all variables: • the liquidity limit, the plasticity limit and all the sieve analysis parameters admit a left-hand asymmetry while; • the plasticity and group index admit a right-hand asymmetry.
With respect to the flattening of the curve: • the liquidity limit, the plasticity limit, the plasticity index, the fine particle content, the sieve passing of 400 μm and 2 mm shows a platicurtic or hyponormal curve; • the sieve passing of 4.75 mm shows a leptocurtic or hypernormal curve while; • the group index shows a normal distribution.   The distribution of observations "box-plot" in Figure 5 represents the distance between Q1 and Q3 of the sample. The horizontal line inside the box represents the median and the "+" represents the mean. The vertical lines on each side of the box extend to the minimum and maximum values of the sample.
The limits at which data can be considered potentially outliers are represented by the lower limit Q1 − 1.5 (Q3 − Q1) and the upper limit Q1 + 1.5 (Q3 + Q1). These box plots show almost no outliers in our statistical series. The analyzed variables are asymmetric due to a very high heterogeneity of the studied site's soils.   properties given in Table 3. Soils of class A-6 constitute the modal class. Soils class of A-2 correspond to lateritic soils in this municipality.
Given the very high heterogeneity of these soils, a discriminant factor analysis (DFA) was carried out in relation to the AASHTO classification. The results are shown in Table 4.    o Soils A-7-6: the parameters that follow Gauss law are W P , I P , X 80 and I g ; o Soils A-7-5: the parameters which follow the Gaussian law are W L , W P , I P , X 400 , X 80 and I g ; o Soils A-4 and A-2-4: no variable follows the normal law.
The first test carried out to detect correlations between the variables is the one based on the Pearson correlation matrix presented in Table 5.
Since the value of R 2 alone is not sufficient to demonstrate statistically significant correlations, deep learning is performed further down using artificial neural networks. These methods make it possible to approximate all the functions that may exist between the variables whatever the distribution law of each of them.
The results of the global statistics of multi collinearity are shown in Table 6.
These results show the following collinear variables without distinction of geotechnical soil class: • The passing of 2 mm and 400 μm (variance inflation factor VIF > 10); • The liquidity and consistency index.
Multi-collinearity analysis by soil class shows that AASHTO classes with an index of 5 have several variables that are more self-correlated than others (Table   7).

Deep Learning of the Physical Characteristics of Kampemba's Soils
As defined in [20], a neuron is a bounded nonlinear function. This method has made it possible to test different functions. Only the function "Exponential Linear Unit elu: f(x) = alpha × (exp(x) − 1) for x < 0, f(x) = x for x ≥ 0" in the hidden layers and outputs gave a small error between known and predicted values (Table 8). By increasing the hidden layers and the number of neurons, performance deteriorate. The selected model is the one that used a hidden layer with eight neurons based on the AASHTO classification.    W L , I P X 80 I g 42: after 50 iterations 38: after 1500 iterations Figure 10 The model in Figure 9 shows that: A-4 soils have inhibitory activity due from their very low to zero plasticity, which leads to an increase in bias; plastic soils (those with index 6 and 7) enhance synaptic activities.  The model in Figure 10 shows that: soil consistency characteristics have an important activity in the evaluation of the group index; granularity has a predominant effect on soil classes with an index of 4; network performance is improved by removing data from geotechnical classes that inhibit synaptic activities. These methods made it possible to confirm the different correlations identified between the variables by the statistical methods. Figure 11 shows the final geotechnical map of the municipality of Kampemba after all statistical and deep learning analyses.

Geotechnical Mapping of the Area
With the help of the AASHTO classification, the soils of Kampemba are divided into eight major groups, which are identified in detail in the following geotechnical classes arranged according to the increasing group index: The statistical analyses made it possible to determine the values of various parameters taken into account in the study as presented below: • A-2-4 soils: this material is classified as silty gravels and sands (medium or steep clay type with the presence of illites and kaolinites). It is a soil with low plasticity and good subsoil for construction according to its group index, and having the following "W L " as parametric values: 28.27% ± 4.24%; W P : 19.51% ± 4.86%; I P : 8.76% ± 0.82%; X 2 (mm): 42.80% ± 13.96%; X 400 (μm): 34.21% ± 6.62%; X 4.75 (mm): 68.05% ± 13.01%; X 80 (μm): 24.88% ± 8.11%; I g : 0; Figure 11. Final geotechnical map of the area of study.

Conclusions
According to the AASHTO classification, the soils of the municipality of Kampemba are divided into five major groups (A-2, A-4, A-5, A-6 and A-7) which are further subdivided into eight subgroups with the following geotechnical characteristics: o Soils A-2-4: this material is classified as gravelly sands with silt having a group index of I g = 0; o Soils A-2-6: this material is classified as gravelly sands to clays with a group index of I g = 0; o Soils A-2-7: this material is classified as gravelly sands with active clays with a group index of I g = 0; o Soils A-4: this material is classified as loamy to sandy soils with a group index of I g = 0; o Soils A-5: this material is classified as plastic loamy to sandy soils with a group index of I g = 12 ± 13; o Soils A-6: this material is classified as clayey to sandy soils with a group index of I g = 10 ± 1; o Soils A-7-5: this material is classified as an active clayey to sandy soil with a group index of I g = 13 ± 9; o Soils A-7-6: as above, this material is also classified among active clayey to sandy soils with a group index of I g = 10 ± 2. A-2 are good quality soils, i.e. good for supporting the foundations of the various civil engineering works. Soils of groups A-4 and A-5 are also good for laying foundations but are remarkably vulnerable to liquefaction, erosion and leaching, which will require treatment to ensure safe operation. However, soils of Groups A-6 and A-7 (A-7-5 and A-7-6) are not good soils for foundation, but may still have other advantages as they form impermeable substrates. It should be noted that A-6 soils are good substrates if their group index is low. Among other things, they can be considered as a usable reserve of clay for pottery, as a source for the production of ceramic materials, stabilised clay bricks and refractory materials.
Artificial intelligence methods based on deep learning using the multilayer perceptron have confirmed the inter-variable correlations identified by statistical methods. These techniques are widely used now in engineering, because they allow detecting any functions linking the variables. Statistical methods are often limited to linear correlations using the Pearson matrix, i.e. limited to two variables, one as input (X) and the other as output (Y). The artificial intelligence technic shows another advantage of identifying geomaterials from non-quantitative dataset based on labels of training samples.
The mechanical behavior of civil engineering infrastructures is correlated with the behavior of the materials making up the supporting soil. It is therefore very important to identify the supporting soil and to control its behavior. The present work is therefore an important scientific tool in this sense. One of the scientific interests of this study is to contribute to the identification and choice of quality materials that can be used in public works in the area of the study.