RGxE : An R Program for Genotype x Environment Interaction Analysis

Genotype x environmental interaction (GxE) can lead to differences in performance of genotypes over environments. GxE analysis can be used to analyze the stability of genotypes and the value of test locations. We developed an Rlanguage program (RGxE) that computes univariate stability statistics, descriptive statistics, pooled ANOVA, genotype F ratio across location and environment, cluster analysis for location, and location correlation with average location performance. Univariate stability statistics calculated are regression slope (bi), deviation from regression (Sd), Shukla’s variance (σi), S square Wricke’s ecovalence (Wi), and Kang’s yield stability (YSi). RGxE is free and intended for use by scientists studying performance of polygenic or quantitative traits over multiple environments. In the present paper we provide the RGxE program and its components along with an example input data and outputs. Additionally, the RGxE program along with associated files is also available on GitHub at https://github.com/mahendra1/RGxE, http://cucurbitbreeding.com/todd-wehner/publications/software-sas-r-project/ and http://cuke.hort.ncsu.edu/cucurbit/wehner/software.html.


Introduction
Genotype x environmental interaction (GxE) refers to the modification of genetic factors by environmental factors, and to the role of genetic factors in determining the performance of genotypes in different environments.GxE can occur for quantitative traits of economic importance and is often studied in plant and animal breeding, genetic epidemiology, pharmacogenomics and con-servational biology research.The traits include reproductive fitness, longevity, height, weight, yield, and disease resistance.
Selection of superior genotypes in target environments is an important objective of plant breeding programs.A target environment is a production environment used by growers [1] [2] [3] [4] [5].In order to identify superior genotypes across multiple environments, plant breeders conduct trials across locations and years, especially during the final stages of cultivar development.GxE is said to exist when genotype performance differs over environments.Performance of genotype can vary greatly across environment because of the effect of environment on trait expression.Cultivars with high and stable performance are difficult to identify, but are of great value [6] [7].
Since it is impossible to test genotypes in all target environments, plant breeders do indirect selection using their own multiple-environment trials, or test environments.GxE reduces the predictability of the performance of genotypes in target environments based on genotype performance in test environments [8].An important factor in plant breeding is the selection of suitable test locations, since it accounts for GxE and maximizes gain from selection [9].An efficient test location is discriminating, and is representative of the target environments for the cultivars to be released.Discriminating locations can detect differences among genotypes with few replications.Representative locations make it likely that genotypes selected will perform well in target environments [9].
The analysis of variance (ANOVA) is useful in determining the existence, size and significance of GxE.In order to determine GxE for a group of elite cultivars, genotypes are often considered to be fixed effects and environments random.
However, for the purpose of estimating breeding values using best linear unbiased prediction (BLUP), genotypes are considered to be random and environments fixed.Some statisticians consider genotypes random effect, provided that the objective is to select the best ones [10].If GxE is significant, additional stability statistics can be calculated.
Several statistical methods have been proposed for stability analysis.These methods are based on univariate and multivariate models.The present paper focuses on univariate models for the analysis of stability measured using R programming, so a brief description of each stability measure is provided below.
The most widely used methods are univariate stability models based on regression and variance estimates.According to the regression model, stability is expressed in terms of the trait mean (M), the slope of regression line (b i ) and the sum of squares for deviation from regression ( ) High mean of a genotype performance is a precondition of stability.The slope (b i ) of regression indicates the response of genotype to the environmental index, which is derived from the average performance of all genotypes in each environment.If b i is not significantly different from unity, the genotype is adapted in all environments.A b i greater than unity describes genotypes with higher sensitivity to environmental change (below average stability), and greater specificity of adaptability to high yielding environments.A b i less than unity provides a measure of greater resistance to environmental change (above average stability), and therefore increasing specificity of adaptability to low yielding environments.

W
of a genotype is its contribution to the GxE squared and summed across all environments.Since the value of W is expressed as a sum of squares, a test of significance for W i 2 is not available.[12] proposed an unbiased estimate ( ) σ of the variance of GxE plus an error term associated with genotype.Shukla's stability variance ( ) W . Shukla's stability statistic measures the con- tribution of a genotype to the GxE and error term, therefore a genotype with low σ i 2 is regarded as stable.According to [13], W i 2 and σ i 2 are equivalent in ranking genotypes for stability.
The [14] stability statistic (YS i ) is a nonparametric stability procedure in which both the mean (M) and [12] stability variance ( )

Overview of the RGxE Program
RGxE is a user friendly and annotated R program that will allow user to analyze genotype stability and evaluate test location value of balanced mult-location replicated trial data.This program generates output (.csv or .txt)into the same folder from where it reads input dataset and Console window of helper application "R studio" [18] of R statistical software [19].A schematic representation of RGxE is presented in Figure 1.Below are the key components of RGxE program which user can independently run.
str(tempa) 'data.frame':400 obs. of 5 variables: $ YR : Factor w/ 2 levels "2009","2010": 1 1 1 1 1 1 1 1 1 1 ... $ LC : Factor w/ 5 levels "CI","FL","KN",..: 3 3 3 3 3 3 3 3 3 3 ... $ RP : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1  where m = grand mean.Depending on the objectives of the analysis, the genotype, location and year are defined as random or fixed effect, which gives five different ANOVA models (Table 1).The genotype is random when the aim is to estimate variance components, genetic parameters, genetic gains expected from selection or different breeding strategies etc. Conversely, genotype is fixed factor when aim is to make comparison of test material for selection or recommendation.Similarly, location is considered as random when the main interest is to estimate variance components for sites that are representative of the relevant population within target region.Location is fixed when interest is to make explicit comparison of one level another and each location represents a well-defined area with relative to crop management.The year and replication are usually treated as random factor.
Different combinations of random and fixed effects in ANOVA model have implications for the expectations of mean square (MS) values with the possible modification of the error term to be adopted in the F test.Therefore, sometimes the F test is not as straightforward as the ratio between two mean squares.
RGxE computes five different cases of ANOVA:  User friendly output is generated in "data.frame"class using dplyr and tidyr packages.
#Generate unique id for replication for anova tempa$RPid<-as.factor(paste(tempa$YR,tempa$LC, tempa$RP, sep="."))Fitness of ANOVA model for case 1 can be plotted using command print(anovacase1), where x-axis is model predicted value and y-axis is residual value (Figure 2).The uniform distribution of fitted residuals on both side of the reference line (value = 0) confirms the goodness of fit.
The best linear unbiased predictor (BLUP) of random effects can be extracted using ranef() function of lme4 package.BLUPs are estimates of random effects.They allow us to account environmental factors in our model and missing data; and can be used for making selection.BLUP tend to "shrunk" towards the population mean relative to their fixed effects estimates.RGxE computes BLUP of individual genotypes and generate user friendly output in "data.frame"class using dplyr and tidyr packages (see below code).Where BLUP = best linear unbiased predictor Similarly, remaining four cases of ANOVA model can be independently computed using code presented in Supplemental Material.

Descriptive Statistics
A new additional variable environment (ENV) is created and quality check of missing value is performed in dataset "tempa2" using dplyr package.Environment is location-year combination, which is highlighted is below code.RGxE validates structure of dataset "tempa2" as it serves input data for descriptive and other statistics (Figure 1).
The level of significance at 0.05, 0.01 and 0.001 is represented by "*", "**", "***"; respectively.Environmental index is average performance of all genotypes in each environment.Stability.par()function of agricolae package is used to compute Shukla's sigma (σ i 2 ), ssquares, Wricke's ecovalence (W i ) and Kang's statistics (YS i ).For selection of stable genotype, user can independently compute univariate stability statistics while feeding required input data (tempa2) in below code.Top 6 rows of input data "tempa2" is presented in section 4.2 descriptive statistics.

Location Value Statistics
Input data "tempa2" is used to calculate genotype F ratio across location and environment; correlation of location with average location performance; and location cluster analysis.

Genotype F Ratio across Location and Environment;
and Correlation among Location and Average Location RGxE computes analysis of variance by location using lm() function to get the genotype F values across location.When the mean of all genotypes are equal then the F ratio will be close to 1.The high genotype F value indicates high discriminating ability for that location.Similarly, Pearson's test of correlation of locations with average location is computed using cor.test()function of R built in stats package [19].Function cor.test() provide level of significance of correlation and the level of significance at 0.05, 0.01 and 0.001 is represented by "*", "**", "***"; respectively.RGxE generates user friendly output for genotype F ratio across location and environment; and correlation of location with average location performance using dplyr and tidyr packages as shown in below code.

Figure 1 .
Figure 1.Overview of overall process of RGxE program for genotype stability and location value.

#Figure 2 .
Figure 2. Residual plot for Case 1 of ANOVA model where genotype, location, year and replication are treated as random effect.

Table 1 .
ANOVA models including the factors genotype (CLT), location (LC), year (YR), and replication (RP) for multi-location replicated trials across years in a randomized complete block design., the associated degrees of freedom and p-value.According to Wilk's theorem, the negative two times the log likelihood ratio of two models approaches a Chi-Square distribution with k degrees of freedom, where k is number of random effects tested.RGxE create user defined anova_lrt() function to compute likelihood ratio test and it is stored in ANOVA model Case I code.
5 2009 KN 1GeorgiaRattlesnake 64.794 2009.KN.1 KN-2009 6 2009 KN 1 FiestaF1 70.907 2009.KN.1 KN-2009 Descriptive statistics including count, minimum (min), maximum (max), mean, sum, median, variance (var), standard deviation (sd), and coefficient of variation (cv) are computed using dplyr package.Using tidyr package results of descriptive statistics are transposed in user friendly layout so that researchers can interpret them easily (see Supplemental Material for descriptive statistics outputs).RGxE generates following descriptive statistics.• Trait mean over genotype and environment • Trait mean and sd over genotype and year • Trait mean, cv, sd and sum over genotype and location • Trait mean, sd, and sum over genotype, location and year • Trait mean over genotype, location and replication • Trait mean over location and year • Trait mean over location and replication • Trait count, min, max, mean, sum, median, var, and sd over location • Trait count, min, max, mean, sum, median, var, and sd over year • Trait count, min, max, mean, sum, median, var, and sd over genotype • Trait count, min, max, mean, sum, median, var, and sd over environment Where SLOPE = regression slope, DEVREG = deviation from regression, SIGMA = Shukla's sigma, SIGMA_SQUARE = ssquares, Ecovalence = Wricke's ecovalence, YS_Kang = Kang's statistics, ns = non-significant, + = indicate stable genotype according to Kang's stability statistics