Decision Trees as a Tool to Select Sugarcane Families

New strategies are required in the sugarcane selection process to optimize the genetic gains in breeding programs. Conventional selection strategies have the disadvantage of requiring the weighing of all the plants in a plot or a sample of stalks and the counting of the number of stalks in all the experimental plots, which cannot always be performed because more than 200,000 genotypes routinely comprise the first test phase (T1) of most sugarcane breeding programs. One way to circumvent this problem is to use decision trees to rank the yield components (the stalk height, the stalk diameter and the number of stalks) and to subsequently use this categorization to select the best families for a specific trait. The objective of this study was to evaluate the categorization of yield components using the classification and regression tree (CART) algorithm as a family selection strategy by comparing the performance of CART with those of conventional methods that require the weighing of stalks, such as the best linear unbiased prediction (BLUP) with sequential (BLUPS) or individual simulated (BLUPIS) procedures. Data from five experiments performed in May 2007 in a randomized block design were analyzed. Each experiment consisted of five blocks, 22 families and two controls (commercial varieties). CART effectively defined the classes of the yield components and selected the best families with an accuracy of 74% compared to BLUPS and BLUPIS. Families with at least 11 stalks per linear meter of furrow resulted in productivities that were above the average productivity of the commercial varieties used in this study and are, therefore, recommended for selection.


Introduction
Genetic breeding programs are central to the sugarcane agribusiness.The use of novel cultivars can increase the average productivity of the Brazilian sugar and alcohol sector and improve the quality of the raw materials used in the production of sugar and ethanol [1].
Sugarcane genetic breeding programs usually consist of three test phases (T1, T2 and T3), an experimental phase (EP) and a multiplication phase [2].Briefly, the first plant selections are performed in the T1 phase.A clone is selected in this phase that is cultivated in the subsequent phases through vegetative propagation.
The clones are planted in experimental designs with replicates to identify potentially superior clones.After 8 to 10 years of evaluation, the best clones are used in final evaluation experiments (EP) in different locations, wherein the clones are evaluated for 3 to 5 harvests.
Although individual visual selection is routinely applied in the early phases of breeding programs [1] [3], this type of selection has been criticized for its inefficiency in terms of the absence of replicates, plant competition and confounding environmental effects [4].The aforementioned authors have advocated the use of family selection followed by individual selection to produce greater gains than that obtained via mass selection, especially for low-heritability traits.
Along this line of thought, some breeding programs have prioritized family selection followed by individual selection to find superior clones [3] [4] [5].This strategy is motivated by the higher likelihood of finding individuals with favorable traits in families with high genotypic values [5].
Reference [6] has shown that predicting genotypic values using the best linear unbiased predictor (BLUP) at individual level (BLUPI) procedure is the optimal sugarcane selection strategy.This procedure simultaneously uses information from families and individuals within families for selection.However, this procedure is seldom used in breeding programs because of the difficulty of collecting data on an individual level.Some strategies to overcome this practical problem have been reported in the literature.Reference [3] developed what we shall call sequential BLUP (BLUPS).
Families are ranked according to the trait being evaluated (usually tons of stalks per hectare-TSH), and the selection is performed for 40% of the families.The families comprising the 40% with the highest mean TSH are split into four groups.In the group of families with the highest means, 40% of the individuals from each family are selected, and in the other three groups, 30%, 20% and 10% of individuals are selected from each family.Reference [7] proposed the selection of families with genotypic values greater than the overall mean, followed by the simulation of the number of individuals to be selected in each family according to the ratio between the genotypic values of the families and the number of individuals to be selected in the best family.The latter procedure is termed BLUP individual simulated (BLUPIS).
The difficulties encountered in using BLUPS [3] and BLUPIS [7] in in-American Journal of Plant Sciences ter-family and intra-family selection are related to the large volume of data that must be collected and the logistics that are required for timely data collection and processing to perform the selection because the data are collected at the end of the crop cycle.At least one representative sample of stalks from each experimental plot must be weighed to use these methods.The difficulties in finding skilled labor and operating costs often restrict the number of families that can be evaluated in the field.
Alternative data collection methods have been sought to streamline the family selection process by circumventing having to weigh plants from all the plots.Thus, a definition of classes (categorization) for the variables that incorporates crop yield components (the number of stalks, the stalk diameter and the stalk height) could significantly reduce the time expended on data collection, if such a definition were properly defined and experimentally validated.Decision trees can be used to categorize the yield components, specifically by using the classification and regression trees (CART) algorithm [4], which is a statistical method potentially useful for identifying families with the highest yield potential by combining classes of variables.
CART involves non-parametric statistical methods that are used in data partitioning through specific rules performed by binary divisions [8].The objective of this technique is to describe the variability in the dependent variable as a function of the independent variables through binary divisions [9].Reference [10] has argued that the advantage offered by this technique is that the algorithm evaluates all the possible predictors and divisions.Furthermore, the algorithm may be applied to other data sets that include the same variables used in designing the decision tree.
The objective of this study was to examine the efficiency of categorizing sugarcane yield components using the CART algorithm for sugarcane family selection to further the development of alternative data collection methods and reduce costs in the initial phase (T1) of sugarcane breeding programs.The efficiency of the algorithm was measured by comparison with the selection performed using conventionally used procedures i.e., BLUPS and BLUPIS.Each plot consisted of 20 plants, which were distributed in two 5-m-long furrows, 1.40 m apart, totaling 12,000 plants.Each family was thus represented by 100 genotypes, which is considered to be a sufficient number for selection within the best families [11].Agronomic practices including weed control and soil fertilization were the usual for this crop at the experimental station.Field was not irrigated.

Data Collection
In 2009, the mean stalk height (SH) and stalk diameter (SD) of all plots of the five experiments were assessed.Stalk height (SH) was measured in meters for one stalk from each clump in the plot from the base to the first visible dewlap.
Stalk diameter (SD) was measured in centimeters using a digital caliper in the third internode from the stalk base to the apex of one stalk per clump in the plot.
In addition to the stalk height and diameter, the total number of stalks per plot (NS) was also counted.
The total plot mass (TPM), in kg, was determined by weighing all the stalks using a dynamometer.The stalk productivity, in tons of sugarcane per hectare (TSH), was estimated using the equation 10 , where TPM is the total mass of the plot in kg, and PA is the plot area in m 2 .In the present study, PA = 14 m 2 .

Selection Using CART
In this study, regression trees were used to create classes for the three yield component variables.Only the SH, SD, NS and TSH data of the controls that were tested in the experiments, totaling 50 observations, were used in designing the regression trees.However, since regression trees may be incorrectly generated or, in an extreme case, even not generated, if the number of observations is too small, we decided to also simulate control data prior to using the CART algorithm, resulting in a procedure known as "data synthesis".The use of synthetic data to improve the amount of data for comparing statistical methods or techniques has been previously used in other research works [12] [13] [14].
The simulation was performed based on the covariance matrix ∑(4 × 4, positive definite) of the variables TSH, NS, SH and SD of two of the controls that were used in all five experiments.In the simulation algorithm, the Cholesky decomposition of the covariance matrix ∑ was used to generate

CC′ Σ =
, where C is a lower triangular matrix m m × known as the Cholesky factor.A normal multivariate vector X CZ µ = + was simulated, where μ is the mean vector of the controls, C is the Cholesky factor derived from the covariance matrix ∑, and Z is a vector of random independent and identically distributed (IID) variables American Journal of Plant Sciences with a standard normal distribution.This procedure was used to generate 1000 row vectors of the type [ ] X X X X , wherein X ij (i = 1 to 1000, and j = 1 to 4) represents the simulated value of the variable j (TSH, NS, SD or SH) for individual i.The algorithm presented ensured that these four variables had a covariance matrix ∑ and a mean vector μ [15] [16] [17].
The generated data were subsequently subjected to the standard CART algorithm procedure.The NS distribution is discrete (Poisson) and is characterized by a parameter λ = mean number of stalks per plot; however, this distribution can be approximated by a normal distribution [18] because λ is relatively large (mean = 111.74).Thus, the simulated value was approximated to the nearest integer.Tree pruning was performed according to the 1-SE rule [8] and 10-fold cross-validation [19] methods to generate more accurate estimates, to reduce over fitting and to facilitate the interpretation of results.In summary, regression trees were obtained using simulated data based on the control observations (1000 observation vectors), and pruned (according to the 10-fold cross-validation and the 1-SE rule methods) and unpruned regression trees were obtained using non-simulated data (50 observation vectors).
Combinations of variables that could produce TSH levels higher than the mean productivity of both controls were located in the generated trees to obtain a clone selection cutoff point.The intra-family selection procedure was subsequently defined as follows: the selected families were split into three classes to define the number of individuals to be selected in each family, as indicated by the CART algorithm.The classes were defined based on the number of replicates (plots) in which the family was selected by the algorithm.The family was selected in each plot (replicate) when the combination of variables used in the classifier met the selection criterion defined by the designed regression tree.The first class consisted of the families selected by CART in all five plots (or replicates) of the family.The second class consisted of the families selected in four replicates.Finally, the third class consisted of the families selected in three replicates.Thus, for the intra-family selection, 30% of the individuals from each family were selected in the best class, followed by 20% and 10% of the individuals from each family in the second and third classes, respectively.Note that other ratios could have been chosen, which could modify the results presented here.
The choice reported herein was based on the aforementioned BLUPS procedure.
In future studies, we will analyze the best selection ratio within our proposed use of CART.

Selection Using BLUPS and BLUPIS
The TSH data were analyzed using restricted maximum likelihood (REML)/ BLUP mixed models and a statistical model associated with genotype assessment in an incomplete block design at plot means level by considering the matrix equation [6]  σ correspond to the genotypic variance, the block variance and the residual variance, respectively.
The selection in the BLUPS procedure was performed following the strategy used by the Australian breeding program [3] to select 40% of the families tested.In the BLUPIS procedure, the families with TSH means higher than the overall mean were selected [7].The number of individuals selected from each family k (k = 1 to 52) was calculated using ( ) , wherein ˆk g refers to the estimated genotypic value of family k, ˆj g refers to the estimated genotypic value of the best family, and n j is the number of individuals selected from the best family.In the present study, n j = 27 individuals were selected from the best family.A mixed models analysis was performed using the SELEGEN-REML/BLUP software [20].

Comparison between BLUPS, BLUPIS and CART
Confusion matrices were generated for each tree to facilitate the visualization of the similarities and differences among the selection methods BLUPS and BLUPIS (which were considered as conventional methods and, therefore, considered correct and were subsequently used for comparison purposes) and CART (the method being tested) (Figure 1).
This confusion matrix was used to calculate four useful statistical parameters to assess the applicability of the selection method: 1) the choice accuracy (CAc), where CAc = (A + D)/T ABCD ; 2) the apparent error rate (AER), where AER = 1 − CAc; 3) the selection precision (SeP), where SeP = A/T AC ; and 4) the error of omission (EOm), where EOm = 1 − SeP.
Figure 1.Schematic of a confusion matrix showing frequencies of occurrence (A, B, C and D) for combinations of classes (Selects or Fails to select): the "conventional method" corresponds to the method used in practice, which is considered to be "ideal" or "true"; the "Tested method" corresponds to the novel method that was developed in this study.context of sugarcane family selection, the higher the choice accuracy (CAc) and the smaller the error of omission (EOm) are, the better is the CART performance.In a breeding program, the error of omission (EOm) is more compromising than the error of selecting more families improperly, that is, the error corresponding to B/TAB.The genotypes that pass to the next phase, coming from the families improperly selected by CART, would be subjected to new selection cycles within the breeding program, where these genotypes could then be excluded, if necessary.That is, the performance of CART improves for a higher number of correct predictions of selected and non-selected families (higher CAc), as indicated by the other procedures (BLUPS or BLUPIS), and a smaller number of families selected using BLUPS and BLUPIS and discarded by CART (smaller EOm).Using non-simulated data, CART identified 38 of 52 families selected by BLUPIS (SeP = 0.731), that is, 73% of families with high TSH values (Table 2).CART failed to select 14 families selected by BLUPIS, resulting in an EOm = 0.269.Coincidentally, 14 other families not selected by BLUPIS were selected by CART.This error corresponds to another type of selection error, which is less compromising than the EOm because the genotypes selected in the respective families are assessed in the subsequent stages of the breeding program, where these genotypes may be eventually excluded from the breeding population, as previously mentioned.Similar reasoning applies when comparing the apparent error from CART selection relative to BLUPS selection (EOm = 0.227, Table 2).
The CART choice accuracy values were similar to those obtained using BLUPIS and BLUPS (CAc = 0.745).In practical terms, this result indicates that CART successfully predicted 74.5% of the families selected or non-selected by BLUPS or BLUPIS, even when only using the number of stalks in the plot.This accuracy ratio is greater than 0.5 (p-value = 1.26e−07), a value that would be expected by chance if selection using CART had no relationship whatsoever with the other methods.
CART, based only on NS, indicated the selection of 52 families, 14 (26.9%) of which would not have been selected by BLUPIS and 18 of which would not have been selected by BLUPS (Table 2).When considering only potentially superior families, that is, those families that should be selected, CART exhibited significant selection precision compared to BLUPIS (SeP = 0.731, p-value = 0.0005976, H 0 :π = 0.5) or BLUPS (SeP = 0.727, p-value = 0.0001941, H 0 :π = 0.5).These percentages were relatively low but ensured that there was a reasonable amount of potential families in the subsequent stages of the breeding program at a rather reduced operation cost because only NS data were required.
According to [26], approximately 60% of the best genotypes are concentrated in 10% of the best families, and little can be gained by selecting more than 20% of the families.Therefore, the use of the CART algorithm and the selection rate from the BLUPIS and BLUPS methods should ensure the selection of 10% to 20% of the best families, and the best individuals would consequently be assessed in the second test phase (T2).
When considering only simulated data, the CART choice accuracy values were also similar to those obtained using BLUPIS or BLUPS, with CAc = 0.736.The results obtained using simulated data (synthetic data) were actually very similar to those obtained using non-simulated data, most likely because of the relatively large number of control data (a total of 25 plots per control, which contributed data for the CART algorithm).Using simulation data prior to the CART procedure has the potential advantage of enabling the means for ideotypes (ideal families) to be simulated at the researcher's discretion, which can be used to define which families to select from those present in a specific experiment.The results in Table 2 show the relevancy of the simulation procedure because the measures of the choice accuracy and the selection precision of the simulated and non-simulated data were rather similar, indicating sustained algorithm performance.Furthermore, the simulation can enable offsetting limited control data in a specific experiment.In the extreme case of the absence of controls, data could be simulated if the researcher is able to define a mean vector and a covariance matrix for the variables of interest according to the study population and considering the environment in which the selection is performed.This information could be retrieved from historical records from other experiments that have been conducted at the same location, for example, or from other studies reporting the information.
The use of tree pruning (1-SE rule or 10-fold cross-validation methods) to generate more accurate estimates resulted in no changes in the trees obtained, both for the simulated and non-simulated data.There was no change in the trees for which the pruning procedure was used because the algorithm could reach the optimal tree without using a fit to the model, which may have resulted from the good volume and quality of the data that were used in the analyses.
Figure 2 shows the regression tree with the non-simulated data generated by CART.The mean productivity of the controls was 145.81 TSH.Productivities higher than this value were generated by families with NS values higher than 110.5.That is, the NS was ranked into two classes, of which the first consisted of families with total NS values per plot below 110.5, and the second consisted of families with corresponding values above 110.5.This cutoff point between the classes corresponded to at least 11 stalks per linear meter of furrow because the plots consisted of two five-meter furrows.
Figure 3 shows the regression tree with the simulated data.The productivities generated for this tree were higher than 145.81TSH when the total NS per plot was higher than 113.4,that is, at least 11 stalks per linear meter, which corroborated the result found using only the original data.However, the increase in the volume of data via simulation enabled additional classes of predicted values for TSH to be defined according to the total NS per plot of the family, which may be advantageous within the family selection process.Thus, it would be sufficient to select families with 13 to 15 stalks per linear meter if the breeder aims to select families with predicted TSH values ranging from 157 to 180 t•ha −1 .A NS per linear meter above 15 and below 18 would indicate families with predicted TSH values ranging from approximately 180 to 200 t•ha −1 .Families with more than 18 stalks per linear meter would be associated with predicted TSH values above 230 t•ha −1 .Although the yield components SD and SH are not included in the regression tree generated by CART, the breeder should assess these traits and others, including the disease resistance, the lateral bud outgrowth, the internode length, the growth habits and other agronomic aspects of plants, for selection in families with higher productivity potential.
In 2006, 110 full-sib families were assessed from biparental crosses performed at the Serra do Ouro Experimental Station of the Federal University of Alagoas, located in the municipality of Murici, Alagoas, Brazil.Following acclimatization, the seedlings resulting from the crosses were used in experiments on families in the experimental field of the Sugarcane Research and Breeding Center at the Federal University of Viçosa, located in the municipality of Oratórios, Minas Gerais, Brazil at a latitude of 20˚25'S, a longitude of 42˚48'W and a 494-m altitude in a LVe soil.Oratórios has a climate classified as Aw according to Köppen and Geiger.The annual average temperature and rain-fall are respectively 21.6˚C and 1162 mm.Five experiments were performed in May 2007 using a randomized complete block design.Each experiment consisted of five blocks, 22 families and two controls (commercial varieties).The same controls were used in all the experiments.

2 g
y Xr Zg Wb e = + + + .In this equation y represents the data vector ( ) ( ), y N Xr V  ; r is the presumed fixed effects vector; g is the geno-American Journal of Plant Sciences typic effects vector (presumed to be random), where the environmental effects vector of the incomplete blocks (presumed to be random), where is the vector of errors or residuals (random), where Z and W are the incidence matrices for the said effect.The variance components The selected families were split into four classes based on the TSH means.Each class consisted of 11 families, and 40% of the individuals within each family of the first class and 30%, 20% and 10% of individuals in each family in classes 2, 3 and 4 were selected, respectively.

Figure 2 .
Figure 2. Regression trees generated using the CART algorithm for control data, wherein NS represents the total number of stalks per plot (two 5-m-long furrows), and the terminal nodes represent the predicted yield in tons of stalks per hectare (TSH); non-simulated data.

Table 1 .
Genotypic TSH means (u + g) of families selected using BLUPS, BLUPIS and CART using data with and without simulation, number of replicates (plots) wherein each family was selected using CART (Rep) and number of individuals selected within each family (n k ).
*Families not selected using CART because they failed to exhibit satisfactory results (≥11 stalks/m) in at least 50% of plots are shown in bold.

Table 2 .
Confusion matrices between the family selection strategies using CART, BLUPIS and BLUPS, together with measures of choice accuracy (CAc), apparent error rate (AER), selection precision (SeP) and error of omission (EOm) for the original data (without simulation) and following simulation (with simulation).