Revealing GE Interactions from Trial Data without Replications

Detecting genotype-by-environment (GE) interaction effects or yield stability is one of the most important components for crop trial data analysis, especially in historical crop trial data. However, it is statistically challenging to discover the GE interaction effects because many published data were just entry means under each environment rather than repeated field plot data. In this study, we propose a new methodology, which can be used to impute replicated trial data sets to reveal GE interactions from the original data. As a demonstration, we used a data set, which includes 28 potato genotypes and six environments with three replications to numerically evaluate the proper-ties of this new imputation method. We compared the phenotypic means and predicted random effects from the imputed data with the results from the original data. The results from the imputed data were highly consistent with those from the original data set, indicating that imputed data from the method we proposed in this study can be used to reveal information including GE interaction effects harbored in the original data. Therefore, this study could pave a way to detect the GE interactions and other related information from historical crop trial reports when replications were not available.

and multi-year trial data sets to be reported by various institutions and made available for public users. However, most of the published crop trial data only include entry means under each environment rather than the original repeated field plot data. Without replications, it would be statistically challenging to detect GE interaction effects, which are highly related to yield stability, from the crop trial reports. Therefore, it would be a great addition to develop a new method that could be effectively used to detect GE interactions when original replicated field trial data are not available.
Because the entire original data can be treated as missing data, in order to successfully recover the potential genetic information, we will need an imputation method to generate "alternative replicated field data" that can be used to recover the information from the original data. There are two major categories of imputations: single imputation (SI) and multiple imputation (MI). With SI, missing values are filled by some type of predicted values like mean imputation, regression imputation, and/or matching methods [3] [4] [5]. Although SI has been widely used, one shortcoming is that it does not reflect the full uncertainty created by missing data and almost always underestimates the variance. For example, the regression imputation method is based on an estimated regression model to predict or impute the missing values. This could cause relationships to be over-identified and suggest greater precision in the imputed values than is warranted. In order to deal with the problem of increased noise due to data imputation, MI, which repeats multiple times resulting in multiple imputed data sets, is recommended, especially when data are missing at random (MAR) [4].
With MI, the imputation uncertainty is accounted for by creating these multiple data sets. The MI follows three basic steps: imputation, analysis, and pooling [6] [7]. With MI bias can be reduced and estimates are more precise. MI has several desirable features. The first feature is that introducing appropriate random error into the imputing process makes it possible to get approximately unbiased estimates of all parameters. The second feature is that repeated imputation allows researchers to get better estimates of the standard errors. The third feature is that MI can be used with any kind of data and analyses.
Unlike many other missing data being imputed, it is well-known that the entire original field measurements were unavailable except only entry means under each environment. Therefore, an important step is to propose probability density function for each entry/genotype based on the published results that can be used to impute entire "original data" so that the genetic information including GE interactions harbored in the original data can be detected, accordingly. In the present study, our objectives included 1) to propose a new procedure to generate a new data set with repeated measurements from given entry means and 2) to numerically validate the new method with a data set containing six locations, 28 potato genotypes, and three replications in each of six locations [8]. The purpose of this study is to provide an alternative method and computer tool to improve data analysis and statistical tests and thus to reveal more information harbored

Linear Model for GE Analysis
The linear model used for an observation hij y , which represents the environment h, the genotype i, and the block j nested to the environment h, can be expressed as follows: In order to detect GE interaction effects, replication with each environment is required. Without replication, the GE interaction effects and random error are confounded and they cannot be separated and the GE interaction and block effects should be omitted from model (1).

Model Used for Data Imputation
The linear model for an observation under a single environment can be described as model (2)

Data Source
The data set (plrv) used for our imputation analysis, as a demonstration, is currently available in the R package agricolae [8]. The data set contains six environments, 28 potato genotypes, and three replications in each environment.
There were three agronomic traits in the data while only yield was used for this study. The major reason for using this data set as a demonstration is that it is publicly available [8] and interested parties can generate repeatable results via the codes developed by the author of this study.  [11] can be used to analyze each imputed data set subject to model (1) mentioned above. Variance components for genotypic effects and GE interaction effects were estimated by MINQUE approach [12]. Genotypic effects and GE interaction effects were predicted using the adjusted unbiased prediction (AUP) method [13]. Mean and its confidence interval (CI) of 95% for each parameter were calculated. All data analyses were conducted under the R environment [9]. The MINQUE package [14] with minque approach for variance component estimation and AUP approach [13] for random effect prediction was used for our imputed data analysis. The R scripts for data imputation and other related data analyses were developed by the first author of this study and will be available upon request.

Original and Imputed Phenotypic Means for 28 Potato Genotypes
The phenotypic means for 28 potato genotypes under six environments calculated from the original data set are provided in Table 1. Generally, wider ranges among six environments were observed compared to the ranges among genotypes within each environment ( Table 1), indicating that environmental effects played a more important role on yield than genotypes. Some genotypes were observed to be more adapted to specific environments. For example, Canchan was more adapted to Hyo02 (47.78) but Desiree was less adapted to the same environment (8.89). On the other hand, genotype Desiree was more adapted to the environment SR03 (11.42) than Canchan to the same environment (2.42), indicating that genotype-by-environment (GE) interactions also played an important role on potato yield. Therefore, it will be interesting to investigate GE interaction effects in the yield trial analysis. Phenotypic means and their 95% confidence intervals (ranges of 2.5% and 97.5% percentiles) for 28 entries under each environment over 50 imputed data sets are provided in Table 2. Comparing the results in Table 2 and Table 1, the imputed means and original means were close to each other with a maximum difference of 1.35 and a mean difference of 0.28. The correlation coefficient between the original phenotypic means and imputed phenotypic means was almost 1.0.

Imputed Entry Means for 28 Potato Genotypes
Correlation coefficients between phenotypic means from the original data and five sets of imputed means were obtained and are presented in Figure 1. The correlation coefficients between original phenotypic means and five sets of imputed phenotypic means were around 0.98 while coefficients among five sets of imputed phenotypic means were around 0.96. The results showed that phenotypic means obtained from each individual imputed data set were also highly consistent, implying that the imputed phenotypic means well represented the original phenotypic mean data.

Pooled Results
Due to some degree of uncertainty of imputed data, multiple imputed data sets were applied to reduce the bias potentially caused by single imputed data. The question is how many imputed data would be sufficient to adjust the bias. As mentioned in this study, we generated 10, 20, 50, 100, 200, and 500 imputed data sets, which were used to obtain the pooled phenotypic means for 28 genotypes under six environments, mean variance components for environment effects, genotypic effects, GE interaction effects, and random errors, and mean predicted environment effects, genotypic effects, and GE interaction effects.
However, due to large amount of results, only summarized results were provided. Figure 1. Correlations among individual phenotypic means from the original data set and five imputed data sets. OM = individual means from the original data set. I1 to I5 = individual means from the 1st five imputed data sets. Figure 2 showed that phenotypic means from the original data set were highly correlated and consistent with pooled phenotypic means from multiple imputed data sets (correlation coefficients were almost close to 1). Figure 3 showed that predicted environmental effects from the original data were highly consistent with the pooled predicted environmental effects from different imputed data sets (correlation coefficients among these predicted environmental effects were close to 1). The same conclusions can be made for predicted genotypic effects ( Figure  4) and predicted GE interaction effects ( Figure 5). These results suggested that 10 repeated imputed data sets were sufficient to obtain unbiased phenotypic means and predicted environmental effects, genotypic effects, and GE interaction effects.   . Correlations among predicted genotypic effects from the original data set and mean genotypic effects from different multi-imputed data sets. OG = genotypic effects from the original data set. IG10, IG20, IG50, IG100, IG200, and IG500 = pooled genotypic effects from 10, 20, 50, 100, 200, and 500 imputed data sets. Figure 5. Correlations among predicted GE interaction effects from the original data set and mean GE interaction effects from different multi-imputed data sets. OG = GE interaction effects from the original data set. IG10, IG20, IG50, IG100, IG200, and IG500 = mean GE interaction effects from 10, 20, 50, 100, 200, and 500 imputed data sets.
In summary, the results from imputed data were highly consistent with those results from the original data set, which includes replication. The results used for our comparisons included phenotypic means, environmental effects, genotypic effects, and GE interaction effects. In addition, it appears that pooled results from 10 repeated imputed data sets were almost identical to the results from the original data set with replications.

Discussion
Crop trial data can provide important information to researchers. Revisiting the historical data and discovering more information will help researchers reveal more genetic information in different respects. As mentioned above, however, Due to the uncertainty of single imputed data set, multiple imputed data sets have been applied in this study to reduce the potential bias for each parameter.
The question is how many independent imputed data sets are sufficient to represent the results from the original data set. Based on a demonstration data set, which is available in the R package agricolae, the correlation coefficient was around 0.98 between the phenotypic means from each of five individual imputed data sets and the phenotypic means from the original data while correlation coefficients was around 0.96 for the phenotypic means among five individual imputed data sets (Figure 1), showing that each imputed data set could be used to substitute the original data with replications. Our results also showed that 10 imputed data sets could sufficiently adjust the bias for this demonstration data set. However, it is likely that more imputed data sets would be required for a large MSE. It is possible sometimes that MSE values are not available on trial reports, one possible solution is that using a wide range of MSE values to impute multiple data sets. Such finding is important when individual MSEs in different environments are not available.
Though the method proposed in this study could help determine GE interaction from imputed data sets and increase the likelihood for statistical test and result validation, the imputation methods are based on the assumption of normal distribution for the original data, on which ANOVA analysis and mean comparisons are based. Many original field trial data included repeated field block; however, this information is often unavailable in the report. Our study showed that results from the imputed data without block effects were highly consistent with the results from the original data including blocks. Therefore, for simplicity, block effects could be omitted during the process of imputing trial data. In addition, data were imputed based on the MSE under each environment; Open Journal of Statistics however, it appears that imputed data based on individual MSE under each environment and MSE over environments yielded almost identical results, suggesting either individual MSE or pooled MSE over environments could be used to impute trial data. Statistical tests for each parameter of interest could follow several approaches. The first possible approach is jackknife based technique [15]. The second possible approach is to use a confidence interval test. With a large number of imputed data sets, we could construct a confidence interval (CI) for 95% or 99% and a CI statistical test can be employed. With the second one, a large number of imputed data sets will be required to provide more reliable CI tests for parameters of interest. Thus, the second approach could be computationally intensive if the original data set was large. However, with high-power servers and/or parallel algorithms, the time used to generate and analyze a large number of imputed data sets could be trivial.