Geographical Traceability of Clinacanthus nutans with Near-Infrared Pectroscopy and Chemometrics

In this study, a seed origin discrimination model for Clinacanthus nutans was developed. First, 81 C. nutans samples from three seed origin locations were collected, and their Near-Infrared (NIR) spectra were obtained. Next, Principal Component Analysis (PCA) was performed on the NIR spectra of the 81 C. nutans samples. Then, MSC (multiplicative scatter correction), SNV (stand-ard normal variate), first derivative, and second derivative pre-treatments of the C. nutans spectra were performed and combined with the Support Vector Machine (SVM) algorithm for modelling and analysis. Among these methods, first-order derivative pre-treatment achieved the best SVM model effectiveness, with a training set accuracy of 93.44% (57/61) and a test set accuracy of 85.00% (17/20). In order to further improve the discrimination accuracy of the model, three optimization algorithms Grid Search (GS), Genetic Algorithm (GA), and Particle Swarm Optimization (PSO) were employed to identify the best c and g parameters for the SVM model. The results demonstrated that the PSO optimization algorithm yielded the best parameters of c = 0.8343, g = 57.8741, with corresponding model training set the accuracy of 96.36% (60/61) and test set the accuracy of 95.00% (20/21). Therefore, developing a seed origin classification model for C. nutans based on NIR spectroscopy combined with chemometrics is feasible and has the advantages of being simple, rapid, and green.


Introduction
Clinacanthus nutans (C. nutans), known as the alligator flower, or the Sabah snake grass, is a plant belonging to the genus Clinacanthus in the family of Acanthaceae.
It is found primarily in southern and southwestern China as well as Malaysia, Indonesia, and Thailand [1] [2] [3]. A flurry of research carried out about the chemical composition of C. nutans, confirmed that they are rich sources of flavonoids, phenolics, steroids, triterpenoids, cerebrosides, glycoglycerolipids, glycerides, and sulfur-containing glycosides, which make them a useful folk medicine and interesting healthy food [4] [5] [6] [7]. Moreover, various compositional and health studies concluded that C. nutans herbal tea has considerable potential as the potential natural antioxidant source. In summary, C. nutans may provide beneficial effects on people's health and represent a great economic resource.
As the demand for healthy food growing, consumer attitudes are slowly changing and C. nutans are attracting greater interest due to their benefits. Therefore, accurate determination of the origin of C. nutans seeds is scientifically important and has application in relevant medicines, health food materials, as well as establishing product quality standards [8].
Currently, the identification of the origin of C. nutans seeds and determination of C. nutans composition are performed primarily using High-Performance Liquid Chromatography (HPLC) [9] [10] [11] and Gas Chromatography-Mass Spectrometry (GC-MS) [12] [13] [14]. However, high equipment cost, complicated operation, and the need for chemical reagents have restricted their widespread use. Therefore, it is of great significance to develop a rapid, simple, and green method for identifying the origin of C. nutans seed.
Near-Infrared (NIR) spectroscopy primarily reveals the overtone bands and combination bands of fundamental vibrations of X-H functional groups (such as C-H, O-H, and N-H) [15] [16] [17] [18]. It not only provides rich qualitative and quantitative information but also is rapid, simple, and does not require chemical reagents. This rapid and simple technique has now been applied in agriculture [19], food science [20], medicine [21], and other fields. Researchers have used NIR spectroscopy combined with chemometrics to confirm the geographical area of durian and have found good application prospects [22]. Herrero Latorre, Peña Crecente, García Martín, and Barciela García [5] used NIR spectroscopy combined with pattern recognition technology to identify honey samples from different sources, developing a fast and single food authentication system to distinguish authentic PGI-Galicia honey samples and other commercial honey samples from other origins. C. nutans contains different X-H functional groups with significant absorption in the NIR region. However, there have been few reports We collected and analyzed 81 C. nutans samples from three geographic locations including Malaysia, Hainan (China), and Guangxi (China). By combining NIR spectroscopy and chemometrics, we established a seed origin classification model for C. nutans with high classification accuracy.

Experimental Samples
The 81 C. nutans samples used in the study originated from Malaysia, Hainan, and Guangxi, China, of which 39 originated from Malaysia, 30 originated from Hainan, and 12 originated from Guangxi. All samples were identified by experts from the Institute of Medicinal Plant Development of Guangdong Academy of Agricultural Sciences.

Spectral Acquisition
We employed NIRS XDS Rapid Content Analyzer with dispersive grating (FOSS, Denmark) and its diffuse reflectance accessories. The spectrum acquisition range was 400 -2500 nm, and the detectors were Si (400 -1100 nm) and PbS (polycrystalline lead sulphide; 1100 -2500 nm). Spectra were sampled at 2 nm intervals to obtain a range of 400 -2500 nm. The spectral data of all C. nutans samples were collected three times and averaged, and a total of 81 spectra were obtained.

Sample Set Partitioning
Currently, sample selection methods primarily include the random sampling method, the Kenard-Stone (KS) method, the duplex method, and the sample set partitioning based on joint X-Y distance (SPXY) method. The SPXY method is a sample partitioning method based on the KS method that can be effectively applied to the analysis of the spectral calibration model [23]. Compared to the KS method, the SPXY method considers both the x and y variables when calculating the spatial distance of the sample. The formula for calculating the spatial distance of the x variable is the same as in the KS method (Equation (1)). Equation (2) gives the formula for calculating the spatial distance of the y variable.
The stepwise selection process of the SPXY method is similar to that of the KS method, except that ( ) , xy d p q replaces x d .

( )
, xy d p q as the standardized xy distance so that the sample has the same weight in xand y-spaces. The formula for this calculation is shown in Equation (3).
In this study, the SPXY method was used to partition the 81 C. nutans samples into a training set and a test set at a 3:1 ratio. There were 61 C. nutans samples in the training set and 20 C. nutans samples in the test set. The details on sample partitioning according to region are shown in Table 1.

Algorithm
Support vector machine (SVM) is a machine learning method based on statistical learning theory. It has many unique advantages in solving small sample, nonlinear, and high-dimensional pattern recognition problems [24].
The sample training set is represented by ( ) is its corresponding expected output. SVM can identify the optimal hyperplane ( ) where ω is the normal vector of the plane and b is the distance from the plane to the origin) between two categories of data. In cases of linear separability, the data are partitioned into two categories by the plane after classification, and the difference between the two categories of data are 2 ω . The classifier is: In cases of nonlinearity, SVM maps data from low-dimensional space to highdimensional space. The classifier is: Here, sign{} is the sign function, a i is a Lagrange multiplier, x i is a training sample, x is a sample to be classified, and is a kernel function. Selecting the most appropriate kernel function is the most important step in developing a high-performance SVM model, and usually includes two parts: one is to select an appropriate kernel function type, and the other is to optimize the important parameters after determining the kernel function type. Studies have found that models developed with the radial basis function (RBF) kernel selected as the kernel function parameter have good learning ability. Therefore, the RBF kernel function was used in this study to implement SVM modelling. The two important parameters of the RBF kernel function are the penalty parameter c and the kernel function parameter g. These two parameters have significant effects for Training set 28 23 10 Test set 11 7 2 controlling the complexity, approximation error, and measurement accuracy of the model. Therefore, it is necessary to optimize these two parameters.
Commonly used parameter optimization algorithms include grid search algorithm (GS), genetic search algorithm (GA), and particle swarm optimization algorithm (PSO). GS is a traversal algorithm that tries all (c, g) parameter pairs and then finds the (c, g) parameter pair with the highest accuracy, namely the optimal parameters, through cross-validation [25]. GA is a computational model that simulates natural selection and genetic mechanisms of Darwin's theory of evolution and is a method of searching for an optimal solution [26]. PSO is a stochastic optimization method based on populations. By imitating the swarm behavior of herds, birds, insects, and fish, each member of the group constantly changes its search mode by learning from its and other members' experience [27].

Model Evaluation Indicators
Model evaluation is used to measure the parameter space and feature extraction effectiveness of different models. The performance of classification models is generally evaluated by the accuracy of the test set [28]. The closer the accuracy is to 1, the better the classification effectiveness of the model. Classification accuracy refers to testing of the established model using the test set in the classification model and is computed as the ratio of the number of statistical samples correctly determined to the total number of samples. In this experiment, the accuracy and the confusion matrix are used for the evaluation of the multi-classification model performance, and the calculation formula is as follows: TP TN Accuracy TP FP TN FN In the equation, TP represents the number of positive samples from the pretraining set that were correctly classified by the model, FN represents the number of positive samples from the pre-training set that were wrongly classified by the model, FP represents the number of negative samples from the pre-training set that were wrongly classified by the model, and TN represents the number of negative samples from the pre-training set that were correctly classified by the model.

Spectral Analysis
C. nutans has a complex composition, including saponins, phenolic compounds, flavonoids, diterpenes, and phytosteroids. These substances have different hydrogen-containing groups and can produce specific absorption bands in the NIR spectrum (780 -2526 nm), as shown in Figure 1. The peaks at 1452 nm and 1939 American Journal of Analytical Chemistry

Principal Component Analysis (PCA)
Due to collinearity between the NIR spectral signals, the information is redundant, as shown in Figure 1. The result showed a low difference among the spectral of the 81 samples. Therefore, it is necessary to reduce the dimensionality of the C. nutans NIR spectra to simplify the data. PCA is a statistical method for dimensionality reduction using orthogonal transformation to convert the original random vector related to its component into a new random vector whose component is unrelated. This reduces the dimensionality of the multidimensional variable system so that it can be converted into a low-dimensional variable system with high precision (Zou et al., 2006). Figure 2 represents a PCA score chart of NIR spectrum of C. nutans. Figure 2(a) represents a two-dimensional score plot for PC1 and PC2. Figure 2(a) shows that the samples from the three locations had a wide distribution. Compared to the C. nutans samples from Malaysia and Hainan, the samples from Guangxi were more concentrated. Figure 2(b) represents a three-dimensional score plot of the first three principal components of C. nutans showing the projection of sample points in three-dimensional space.
The cumulative total variance obtained by the first three principal components was 95.52%, which indicates that the first three principal components could reflect most of the characteristic information of the original spectrum. The American Journal of Analytical Chemistry three-dimensional score plot shows that the most dispersed distribution is the C. nutans samples from Malaysia, indicating that there is a large intragroup difference in the C. nutans samples from Malaysia. The samples from the three C. nutans seed locations exhibited large areas of overlap on the PCA score plots. Therefore, PCA analysis alone cannot be used to make a clear judgment on the origin of C. nutans seeds and further algorithmic processing of the C. nutans NIR spectra is needed in order to develop a model with high classification accuracy and good prediction accuracy.

SVM Model Analysis
The SVM has many unique advantages in solving small sample, non-linear, and high-dimensional pattern recognition issues. Thus, the SVM algorithm was used in this study to analyse the NIR spectra of C. nutans, and the three parameter optimization algorithms GS, GA, and PSO were used to optimize the two SVM parameters c and g in order to establish a classification model for C. nutans seed origin with high accuracy and good predictability.
Data pre-processing is an important factor for improving prediction precision in qualitative analysis and modelling. The acquired spectra not only contains the original information of the samples to be tested but also various external interfering information, which can result in some degree of difference between the measured and true values [30]. In order to eliminate errors as much as possible, various data processing methods must be used to reduce the impact of various interfering factors, thereby laying the foundation for subsequent data processing.
In this study, multivariate scattering correction (MSC), standard normal variate transformation (SNV), first derivative, and second derivative were used for pretreatment of spectral data. Figure 3 shows the pre-treatment average spectra of the C. nutans samples.
In order to compare the effects of different pre-treatment methods on the accuracy of the C. nutans seed origin model, SVM models with default c and g parameters (default value of c was 1, default value of g was 1/k, where k was the number of categories) were established for the four pre-treatment methods and compared with the original spectra. The model establishment results in Table 2 showed that different pre-treatment methods have different effects on the modelling results. Among them, spectra processed by the first derivative yielded the best model prediction effectiveness, with a training set accuracy of 93.44%, and a test set accuracy of 85.00%.
After determining the best pre-treatment method, the parameters c and g were optimized using GS, GA, and PSO. The parameter optimization process and cross-validation results are shown in Figure 4. Figure 4    of PSO optimization results. After 50 iterations, the cross-validation accuracy was stable at 97.62%, and the optimal penalty parameter c = 0.8343 and the kernel function parameter g = 57.8741.
After optimizing c and g through the three optimization algorithms GS, GA, and PSO, the cross-validation accuracy reached a minimum of 96.72%. In the next step, the optimal values for c and g were used to establish SVM models and the test set accuracy was used to select the best SVM model. These results are American Journal of Analytical Chemistry shown in Table 3. The prediction accuracy of the SVM model was greatly improved after optimization of c and g. The prediction accuracy of the test sets for the three optimization algorithms reached 95.00%, of which PSO yielded the best accuracy. The value of the penalty parameter c was the smallest, therefore, the parameter pair found by PSO was selected as the optimal parameters. The penalty parameter c = 0.8343 and kernel function parameter g = 57.8741 corresponded to the best SVM model for C. nutans seed origin, with a training set accuracy of 96.36% (60/61) and a test set accuracy of 95.00% (19/20), the specific results are represented by the confusion matrix in Figure 5.

Conclusion
In this study, a classification model for the origin of C. nutans seeds based on