Using Vegetation Indices as Input into Random Forest for Soybean and Weed Classification ()
1. Introduction
Weed management is a major component of a crop production system. To implement weed management strategies, managers need tools to help them distinguish crop from weeds and at times, weeds from other weeds. The latter is important in determining which combination of chemical, mechanical, and/or biological control techniques to use for weed control. In 2012, agricultural producers in the United States (U.S.) spent $13.7 billion for agricultural chemicals [1] . Targeted chemical spraying reduces the amount of herbicide applied, thus decreasing cost and protecting the environment.
Palmer amaranth, redroot pigweed, and velvetleaf infestations occur in soybean fields throughout the eastern U.S. [2] [3] [4] . The weeds produce numerous seeds and establish large populations in soybean fields, thus reducing yields. Dense populations of Palmer amaranth and redroot pigweeds can damage agricultural equipment during harvesting. Populations of Palmer amaranth and redroot pigweed have become resistant to some commonly used herbicides, such as glyphosate. Velvetleaf is also difficult to control with glyphosate. Therefore, managers need effective tools to identify Palmer amaranth, redroot pigweed, and velvetleaf to implement the correct strategy to control them. Remote sensing systems measuring light reflectance properties of plants have shown good potential for weed crop discrimination, including soybean and weed discrimination [5] [6] . The success of remotely sensed data for crop and weed discrimination is dependent on the spectral sensitivity of the recording device and the algorithms used to process the data.
Vegetation indices derived from various mathematical combinations (i.e., ratio, difference, normalized difference) of hyperspectral and multispectral data have shown promise as tools for agricultural application, including determining canopy water content and water stress of crops [7] [8] [9] , assessing insect and disease infestations [10] [11] [12] [13] [14] , differentiating crops from weeds [15] - [20] , and assessing nutrient status of plants [21] [22] . The advantages of using vegetation indices over single wavebands include enhancing differences in the spectral properties of plants, while diminishing the influence of relief, nonphotosynthesizing elements of plants, atmosphere, soil background, shadow, and viewing and illumination geometry on spectral data [23] [24] . Gaps exist on using vegetation indices for crop weed discrimination and weed to weed discrimination, especially in soybean production systems.
Random forest has gain popularity as a tool in numerous disciplines including remote sensing. It is a nonparametric ensemble method that uses numerous classification trees to predict (regression) or determine (classification) the class of an unknown sample [25] , hence the name random forest. The algorithm selects a random number of samples from the database provided by the analyst and then uses the samples to develop a decision tree. Sample collection is based on a bootstrap procedure with replacement. The same process is repeated for each tree in the forest; a different model is constructed for each tree.
Samples randomly selected to derive the model for each tree are referred to as “in-bag” samples, and the samples not used to create the model are called “out-of-bag” samples. Typically, 2/3 and 1/3 of the samples are used as “in-bag” samples and “out-of-bag” samples, respectively. Based on the “in-bag” and “out-of-bag” concept, random forest does not need a separate testing set to evaluate the model.
To test classification or prediction accuracy, decision trees in which a sample is “out-of-bag” are predicted or classified with trees in which the sample is not “in-bag”. The vote of each tree is tallied; then the unknown sample is assigned the class in which it received the most votes. For each tree, a subset of the variables is used to develop the split. Also, a variable importance ranking is derived with the algorithm. Analysts can evaluate the ranking and identify variables that are relevant to the model for prediction or classification problems, leading to reduction in the number of variables used for the analysis.
Random forest ranks very well among other classifiers [26] . Recently, it was ranked in the top ten of 100 classifiers tested for classification purposes [26] . It is capable of processing large datasets, can analyze numerous variables without deletion, is robust for analysis of datasets with missing variables, and has the ability to process unbalanced datasets. Models derived for classification can be used on other datasets. Light reflectance data recorded from sensors on-board satellite, airborne, and ground-based platforms have shown promise as input data for random forest to use for classification [27] [28] [29] [30] and regression [31] [32] problems.
Currently, no information is available on using vegetation indices derived from multispectral data as input into random forest for soybean weed discrimination. The objective of this study was to evaluate normalized difference vegetation indices as input into random forest to differentiate soybeans and three broad leaf weeds: Palmer amaranth, redroot pigweed, and velvetleaf. The study focused on leaf multispectral reflectance data of simulated World View 3 satellite sensor bands.
2. Materials and Methods
2.1. Experimental Setup
Two experiments were completed in 2014 in which 30 replicates of soybean variety 4928LL (LL-liberty link), Palmer amaranth, redroot pigweed, and velvetleaf were grown in a greenhouse located at USDA-ARS, Stoneville, MS. Planting dates were June 13, 2014, and August 28, 2014, for experiments one and two, respectively. Seeds of the different plants (i.e., soybean variety 4928LL, Palmer amaranth, redroot pigweed, and velvetleaf) were sown into plugs. After emergence, the plant species were transferred to two liter pots. All plant species were exposed to a fourteen-hour photoperiod, and light was supplemented at the beginning and ending of the day with sodium vapor lamps. Day/night temperatures of the greenhouse was maintained at 28˚C/24˚C ± 3˚C, respectively. The soybean variety used in the study had an indeterminate growth habit and gray pubescence and was assigned to maturity group 4.9. The weed seeds were obtained from a seed bank maintained at the laboratory.
2.2. Leaf Reflectance Measurements
Leaf reflectance measurements were obtained with a plant contact probe attached to a spectroradiometer (Fieldspec 3, ASD Inc. Boulder, Colorado) sensitive to a spectral range of 350 to 2500 nm. The contact probe has its own light source allowing the user to collect data anytime during the day or night. Reflectance measurements were collected from the most recently matured leaf of each plant; each reading was an average of fifteen readings. The spectroadiometer was calibrated in fifteen minute intervals with a white spectralon panel. Leaf reflectance measurements were collected prior to the plants reaching 1 foot tall. The goal of weed management strategies is to detect and kill the weeds in the vegetative growth stages and prior to the seeds reaching full maturity levels.
2.3. Post Processing of Reflectance Measurements
Multispectral data simulating the spectral bands of the World View 3 satellite sensor were derived from the hyperspectral spectroradiometer data (Table 1). These spectral bands were chosen because they represented data in the visible, red edge, near infrared, and shortwave infrared regions of the light spectrum. Twelve vegetation indices were created with the multispectral spectral bands (Table 2). The selected indices have shown good potential for assessing leaf pigments [33] - [38] , leaf internal structure [36] , and leaf water content [39] .
Table 1. Sixteen-spectral bands simulating World View 3 satellite sensor spectral bands. They were used as input variables into cforest for soybean and weed discrimination.
Table 2. Vegetation indices used as input into cforest for soybean and weed discrimination.
aC = coastal, B = blue, G = green, R = red, RE = red edge, NIR = near infrared, and SWIR = shortwave infrared.
Broadband and narrowband data have been used to develop vegetation indices. Therefore, the simulated band center wavelength closest to center wavelengths used in other studies were employed in developing each vegetation index. Periodically, two vegetation indices were developed for a designated index because the band centers were equidistance from the band centers used by other investigators. These include the advanced normalized difference vegetation index (ANVI), shortwave infrared water stress index (SIWSI), and structure insensitive pigment index (SIPI). The hsdar package (hyperspectral data analysis in R) of R [40] and base R packages [41] were used to develop the sixteen spectral bands and the twelve vegetation indices, respectively.
2.4. Classification
The conditional inference version of random forest (cforest) was used for the classifications. Researchers have suggested using the cforest version of random forest if the data are highly correlated [42] . Cforest is more stable in deriving variable importance values in the presence of highly correlated variables, thus providing better accuracy in calculating variable importance [42] .
Differences of cforest compared with the original version of random forest are as follows. Cforest employs conditional inference trees as base learners; random forest uses classification and regression trees as base learners [42] [43] . Cforest develops unbiased decision trees based on subsampling without replacement instead of using bootstrap samples. The algorithm uses the conditional permutation scheme described by Strobl et al. [42] [47] to determine variable importance ranking.
Two parameters were adusted to develop the cforest models, mtry and ntree. The former respresents the number of random variables to use in each tree; the latter characterizes the number of trees.
Model stability was tested using the following procedure [6] [44] : 1) develop model using the default mtry (5) and ntree (500) values, 2) tabulate and review variable of importance rankings, 3) adjust starting seed (i.e., the random generator used as a starting point for sampling) and rerun the model using the default mtry and ntree values, 4) accept model if variable of importance rankings are consistent between the first and second runs; if not proceed to step five, 5) increase ntree by 500 and rerun model following steps one thru four. The procedure was repeated until the variable of importance rankings were consistent between the first and second runs. The party package [45] [46] [47] of R was used to complete the classifications and to obtain the variable importance readings.
For each date, two datasets were evaluated as input into the cforest algorithm: 1) twelve vegetation indices dataset and 2) sixteen band multispectral dataset. Overall, user’s, and producer’s accuracies, and the kappa coefficient, were tabulated to compare accuracies of the classifications [48] .
3. Results
Identical accuracy results were obtained for the vegetation indices and the multispectral data classifications for the June 30, 2014 dataset (Table 3). Overall accuracy and the kappa coefficient were 90.8% and 0.878, respectively. User’s and producer’s accuracies ranged from 76.7% to 100%. The highest user’s and producer’s accuracies were obtained for the soybean class. The lowest user’s and producer’s accuracies were observed for the redroot pigweed and Palmer amaranth classes, respectively.
For the September 17, 2014 dataset, the vegetation indices achieved higher classification accuracies than the multispectral dataset with the differences ranging from 0.7% to 3.4% for user’s, producer’s, and overall accuracies (Table 3). Similar trends were observed in the classification accuracies of the vegetation indices and multispectral data. The best classification accuracies were achieved for the soybean class. Also, velvetleaf was tied for first for the producer’s accuracy. The lowest user’s and producer’s accuracies were observed for the redroot pigweed and Palmer amaranth classes, respectively.
Variable importance rankings indicated that eight and nine of the twelve indices were important to the classification of the June and September vegetation indices datasets,
Table 3. Accuracy results for the cforest classifications for each dataset.
respectively (Figure 1). SIWSI2 was ranked as the most important vegetation index used by the classification models. Its variable importance score was approximately 1.5 times greater than the second ranked vegetation index score.
The multispectral datasets variable importance rankings are summarized in Figure 1. Shortwave infrared bands had strong to moderate variable importance rankings for the June dataset. NIR2, G, RE, and NIR1 bands had moderate to low variable importance scores. All of the spectral bands were important to the September 17, 2014 multispectral dataset classification model because their variable importance rankings were distinguishable from the zero line. The highest and lowest rankings were assigned to G and C bands, respectively. Finally, more trees were needed to obtain stable variable importance rankings for the multispectral datasets compared with the vegetation indices datasets (Table 4). That aspect could be attributed to the multispectral bands datasets having more variables than the vegetation indices datasets, thus requiring more trees to be used for the stabilization process.
Figure 1. Variable importance rankings for vegetation indices and multispectral data used as input into cforest for classification.
Table 4. Parameters used for cforest classifications.
amtry = number of predictors to use when splitting a node; ntree = number of trees grown.
4. Discussions
Using vegetation indices as input variables for soybean and weed discrimination, cforest achieved classification accuracies that were equivalent to or slightly better than classification accuracies obtained with multispectral bands as input variables (Table 3). Kappa coefficients for the vegetation indices classifications indicated an almost perfect agreement (values within the range of 0.81 - 1.00, [49] ) between reference and predicted data. Almost perfect to substantial agreement (i.e., values within the range of 0.61 - 0.80, [49] ) occurred between reference data and predicted data for the June 30 and September 17, 2014 multispectral datasets, respectively. Errors for soybean and velvetleaf classes were attributed to the former being misclassified as the latter and vice versa. The cforest algorithm using vegetation indices or multispectral bands as input had problems in distinguishing between Palmer amaranth and redroot pigweed, suggesting combining them into one class.
Consistently, indices derived with SWIR and NIR bands, the G and NIR bands, and G and R bands were considered important in separating the plant species. SIWSI1 and SIWSI2 were ranked in the top three for vegetation indices variables when evaluating both datasets (Figure 1). Those indices are sensitive to moisture content in plant leaves [50] [51] , suggesting leaf succulence was important in crop weed separation and weed to weed separation. The SWIR band serves as a measure of water content, leaf internal structure, and dry matter; whereas, the NIR band serves as measurement of only leaf internal structure and dry matter [50] . When combined into an index, SIWSI provides a measurement of leaf water content expressed as equivalent water thickness.
Also, GRNDVI was ranked within the top three indices for each date supporting the role that plant pigment played in separating the plant species. Other pigment indices having relevant variable importance values on both dates were GNDVI and NPCI. NDVI, the most wildly used vegetation index throughout the literature [52] [53] [54] , was ranked eighth and eleventh according to the variable importance tabulated for the June 30, 2014 and September 17, 2014 datasets respectively, indicating its weak relevance for discriminating between the plant species.
The vegetation indices were derived using the multispectral bands (Table 1). Thus, the variable importance rankings of the former were easily understood by evaluating the variable importance rankings of the latter. SWIR bands 2 - 8 were a necessity for differentiating between the plant species. Therefore, vegetation indices derived with one those bands would be ranked higher on the variable importance scale than a vegetation index excluding SWIR bands (i.e., SWIR bands 2 - 8). The findings of this study are similar to those of [5] , which indicated shortwave-infrared data were important for soybean weed discrimination. Their study focused on the following weeds discrimination from soybean: hemp sesbania [Sesbania exaltata (Raf.) Rydb. ex. A.W. Hill], palm leaf morning glory (Ipomoea wrightii Gray), pitted morning glory (Ipomoea lacunose L.), prickly sida (Sida spinose L.), sicklepod [Senna obtusifolia (L.) H.S. Irwin and Barnaby], and small flower morning glory [Jacquemontia tamnifolia (L.) Griseb.].
Overall, this study demonstrated that vegetation indices derived from leaf reflectance data can be used as input into cforest for differentiating soybean, velvetleaf, and two pigweeds. It is important to note that the data were collected using pure leaf spectra and not canopy spectra. For canopy spectra, some differences will occur due to in-canopy shadowing, leaf orientation, and differences in leaf area. One of the strengths of vegetation indices is that they are influenced less by shadowing and sensor viewing angle than multispectral bands, indicating the benefit of using them at the canopy level.
5. Conclusion
Vegetation indices derived from spectral data simulating World View 3 bands showed promise as input variables into cforest for discriminating soybean and three weed species. Indices consisting of shortwave infrared bands were important for plant species differentiation. Also, the analyst should consider combining redroot pigweed and Palmer amaranth into one class, pigweed. If different herbicides were going to be used to treat the pigweeds and the velvetleaf plants, the results indicate that the cforest algorithm could use vegetation indices derived from leaf light reflectance data to accurately identify pigweeds, velvetleaf, and soybean plants. This research further supports using vegetation indices and machine learning algorithms such as cforest as decision support tools for weed and crop differentiation.
Acknowledgements
The author is grateful to Dr. Vijay Nandula for supplying the velvetleaf seed, Dr. Neal Teaster for providing the pigweed seeds, and Mr. Milton Gaston Jr., Mr. Arrington Smith, Ms. Keysha Hamilton, Ms. Kenyattia Andersen, Mr. David Fisher, Ms. Raven Thompson, and Ms. Keyanna Nealon for their assistance in data collection. Mention of trade names or commercial products in this report is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the U.S. Department of Agriculture.