Evaluation of Moving Grid Adjustment (MGA) Method in Field Variation Control ()
1. Introduction
Field variation in a large field trial could cause a significant issue in statistical analysis and conclusions. Therefore, spatial variation control has become a crucial step to screen large-scale plant genotypes in a large field trial. Conventional blocking designs (like randomized complete block, split plots, Latin square, and incomplete block designs, etc.) [1] have been commonly used for field variation control since Fisher proposed his “The Arrangement of Field Experiments” [2]. Fisher’s experimental design principles (1926) provide some measure of protection against spatial variation, many statisticians agree that additional layers of spatial control are beneficial [3]-[5]. Interestingly, though the spatial analysis was proposed by Papadakis just 11 years after Fisher’s blocking design [6], its applications to field trial analysis have been unpopular compared to Fisher’s blocking designs. A possible reason is the difficulty of integrating spatial analysis into the conventional field trial data process. Therefore, investigating spatial variation control methods featured with efficiency and integrability will be helpful to enhance field trial analysis.
Fisher’s blocking design (1926) is ANOVA (analysis of variance) based thus simplicity is one of the major features. Under the principle of a linear model, Fisher’s blocking design can be extended to multi-dimension or hierarchical blocking designs. Examples include Latin square and alpha-lattice designs [1] [7]. An efficient blocking design should be assumed that within-block variability is minimized while between-block variability being maximized [8]. Because uniformity within a large block is unlikely in most field trials, conventional blocking designs would be rarely satisfactory. On the other hand, it is more likely that plot-to-plot variation within a block may be affected by competition between genotypes with the trial [9]-[11] by soil variation and fertility and previous land use [12] [13].
Removing spatial variation from a field trial has been a crucial goal for many statisticians [14]. Over 80 years ago Papadakis [6] believed that traditional blocking in field experiments might not adequately represent soil fertility patterns and he instead proposed adjusting the yield of each plot by the performance of the neighboring plots. According to Bartlett [15], the adjustment method suggested by Papadakis (1937) was approximately valid and sometimes useful in experiments where a block size is large. Unfortunately, Papadakis’s method did not receive as serious attention as Fisher’s blocking designs until the 1970’s [16]-[18]. The nearest neighbor adjustment (NNA) method was claimed to be a relatively simple yet effective method for accounting for within-block variation [18] [19]. Later, a random field approach [20] was proposed as an improvement on the NNA methods. The key point of using the NNA methods is to develop covariates to adjust the original data [21]. Stroup et al. (1994) compared RCB analysis with the NNA methods in a multi-environment winter-wheat trial data and concluded that the NNA approach was better as evidenced by a lower coefficient of variation and greater ability to distinguish the differences among cultivars. A recent study showed that modified Papadakis covariates incorporating spatial positions could enhance the adjustment efficiency [21]. However, such a modified covariate may require more information that many data lack and thus it could reduce the application of this method.
Above mentioned covariates are based on the residuals obtained from a particular experimental design (equivalent to a particular statistical model). In other words, the calculation of these covariates is model-dependent. Recently, a moving grid adjustment (MGA) method was proposed by Technow [22] and a covariate is also required to adjust field spatial patterns. Compared to Papadakis’s covariate, the MGA covariate has two noticeable differences. The first one is that the MGA covariate is based on phenotypic data rather than residuals, implying that the MGA adjustment can stand alone. The second one is that each MGA covariate has the flexibility to catch different directions of spatial patterns. In addition, the corresponding R package for the MGA method mvngGrAd is publicly available [22]. Numerical evaluation regarding the adjustment efficiency via Monte Carlo simulations will reveal more statistical properties under the conventional RCB design.
Therefore, there were two objectives to be targeted in this study. Our first objective was to numerically validate the efficacy of the MGA methods via simulated data generated from a commonly used RCB design with one-way and two-way spatial patterns. Correlation coefficients between adjusted and control data were used to determine the efficiency of the MGA method. The second objective was to apply the MGA method to analyze a cotton trial data set with emphasis on cotton yield (sensitive to field spatial pattern) and lint percentage (non-sensitive to field spatial pattern). The main purpose of this study is to provide a useful method to enhance breeding data analysis when spatial patterns are significant.
2. Materials and Methods
2.1. Moving Grid Adjustment (MGA) Method
Assuming that a field trial is arranged in a (nearly) rectangular format, each plot (called grid in this study) is associated with its corresponding row and column indexes. With the MGA method [22], the covariate denoted as
is calculated as follows
(1)
The grid of the plot with row
and column
is denoted by
and I(.) is an indicator function that takes the value 1 if a plot is included in
or 0 if not. The observed phenotypic values from all plots which could be included in
are denoted by
. The covariate
is taken as a measure of the growing conditions for the plot with row
and column
and will be used as a covariate to calculate an adjusted phenotypic value
according to the following formula,
(2)
where
is the mean of all
and b is the regression coefficient, which can be estimated from the following simple linear regression model:
(3)
Since each adjusted data
from Equation (2) shares the identical information as the original data
does, it is very convenient to analyze the adjusted data
subject to different statistical models.
2.2. Simulation Data
A field layout with 25 rows and 40 columns (1,000 plots in total) was used for our Monte Carlo simulation study. With these 1000 plots, an RCB design with four reps where 250 entries/treatments were randomly assigned. However, interested readers can use different sizes of field layouts (different row and column numbers) and/or different treatments for additional investigation. One hundred (100) simulation data sets were generated for each of the four settings provided in Table 1.
Table 1. Four variance components settings for simulation.
Parameter† |
Setting 1‡ |
Setting 2 |
Setting 3 |
Setting 4 |
|
50 |
50 |
50 |
50 |
|
0 |
25 |
50 |
0 |
|
0 |
25 |
0 |
50 |
|
25 |
25 |
25 |
25 |
†:
= variance component for genotype,
= variance component for row direction,
= variance component for column direction, and
= variance component for random error. ‡: Setting 1: no field spatial pattern; Setting 2: spatial pattern exists in both directions; Setting 3: spatial pattern in row direction only; and Setting 4: spatial pattern in column direction only.
In our simulation study, we used the following procedures to generate simulation data:
Step 1: Using the variance components of setting 1 to generate a simulated data vector without spatial patterns (Table 1) as control data, namely
.
Step 2: Using each set of variance components of
(variance for row direction) and
(variance for column direction) from each of the Settings 2 to 4 to generate row and column effects. Sort the simulated row and column effects to generate spatial effects in both row and/or column directions, namely sp.
Step 3: Using
from Step 1 and sp from Step 2 to make pseudo-phenotypic data with spatial patterns, namely,
=
+ sp.
2.3. Actual Data Background
The data used in this study as demonstration/application were from our previous study [23]. The data included 188 cotton recombinant inbred (RI) (F8) lines developed by the single-hill procedure [24] from the cross of HS 46 (P1) MARCABUCAG8US-1-8 (P2) [25]. Originally, these 188 RI lines with their two parental lines (P1 and P2) and a check, ‘Stoneville 474’ (ST474) were planted at the Plant Science Research Center, Mississippi State, MS in 1999. The experiment design was an RCB design with four replicates, where each replicate was in the column direction. Each block consisted of 205 two-row plots with 12 m in length, 0.97 m of between-row spacing, and approximately 10 cm of plant spacing. Within each block, two parental lines (P1 and P2) were repeated twice and the check ST474 was repeated 13 times. The entire field was almost a rectangular arrangement with 17 rows and 52 columns. Rows were numbered from south to north direction while columns were numbered from west to east direction. Data collected from each plot included the following traits: lint yield per hectare (LY, kg) and lint percentage (LP, %).
2.4. Data Processing
For our simulation study, both
(data without spatial pattern, control) and
(data with spatial pattern) were adjusted by the MGA method and the adjusted data are named by
and
accordingly. The following correlation coefficients were calculated:
and
,
and
, and
and
. For simplification, we named these three correlation coefficients as
,
, and
, respectively. The correlation
indicates the impact of spatial patterns on the phenotype data. The correlation coefficient
indicates the adjustment efficiency of the MGA method on the data with a spatial pattern. The correlation coefficient
indicates similarity the adjusted data from both data with and without a spatial pattern.
For application, both lint yield and lint percentage were adjusted by the MGA method. Both original data and adjusted data were analyzed subject to CR (completely randomized) and RCB (randomized complete block) designs. A minimum norm quadratic unbiased estimation (MINQUE) method [26] [27] was used to estimate variance components subject to these two designs. In addition to variance component estimation, we estimated the residuals for both original and adjusted data subject to these two designs. The R library minque [28] was applied to estimate variance components and residuals.
Both simulated data and actual cotton data were adjusted by MGA with the R package mvngGrAd [22]. Heatmaps were generated from the R package desplot [29]. All R scripts including integrating the R functions from the cited packages were developed and customized by the senior author of this study. All data processing and graphics generating were conducted at the RStudio platform [30] [31].
3. Simulation Results
As addressed in Section 2, we define
representing phenotypic data without a spatial pattern thus
can be considered as the control group;
representing phenotypic data with a spatial pattern for each of the four settings (Table 1). Both
and
are the adjusted data for
and
by the MGA method, respectively. For simplification, we only report three correlation coefficients as
,
, and
, where,
between
and
indicates the impact of spatial patterns on phenotypic data;
is between
and
indicating the adjustment efficiency of the MGA method on the data with spatial patterns;
is between
and
indicating the similarity between adjusted data from data with and without a spatial pattern. The correlation coefficients are summarized (minimum, maximum, mean values) from 100 simulations for each of the four settings and are provided in Tables 2-5, separately.
The high correlation coefficients,
,
, and
, showed that the data for both
and
and
and
are identical or nearly identical when there
Table 2. Correlation coefficient among simulated non-adjusted and adjusted data† for setting 1.
|
Min |
Max |
Mean |
‡ |
1.0000 |
1.0000 |
1.0000 |
|
1.0000 |
1.0000 |
1.0000 |
|
0.9929 |
1.0000 |
0.9979 |
†:
= simulated data without spatial pattern (control);
= simulated data with spatial pattern;
= adjusted data for
; and
= adjusted data for
. ‡:
= correlation between
and
;
= correlation between
and
;
= correlation between
and
.
Table 3. Correlation coefficient among simulated non-adjusted and adjusted data† for setting 2.
|
Min |
Max |
Mean |
‡ |
0.7488 |
0.8048 |
0.7793 |
|
0.9522 |
0.9824 |
0.9694 |
|
0.9676 |
0.9820 |
0.9753 |
†:
= simulated data without spatial pattern (control);
= simulated data with spatial pattern;
= adjusted data for
; and
= adjusted data for
. ‡:
= correlation between
and
;
= correlation between
and
;
= correlation between
and
.
Table 4. Correlation coefficient among simulated non-adjusted and adjusted data† for setting 3.
|
Min |
Max |
Mean |
‡ |
0.7506 |
0.8026 |
0.7786 |
|
0.9581 |
0.9946 |
0.9785 |
|
0.9716 |
0.9849 |
0.9801 |
†:
= simulated data without spatial pattern (control);
= simulated data with spatial pattern;
= adjusted data for
; and
= adjusted data for
. ‡:
= correlation between
and
;
= correlation between
and
;
= correlation between
and
.
Table 5. Correlation coefficient among simulated non-adjusted and adjusted data† for setting 4.
|
Min |
Max |
Mean |
‡ |
0.7412 |
0.8012 |
0.7692 |
|
0.9414 |
0.9865 |
0.9708 |
|
0.9595 |
0.9796 |
0.9727 |
†:
= simulated data without spatial pattern (control);
= simulated data with spatial pattern;
= adjusted data for
; and
= adjusted data for
. ‡:
= correlation between
and
;
= correlation between
and
;
= correlation between
and
.
is no spatial pattern (sp = 0) (setting 1, Table 2). When a spatial pattern exists in both directions (setting 2), the mean correlation coefficient (
) between
and
was 0.779 while the mean correlation coefficients between
and
(
) or
and
(
) were 0.969 and 0.975, respectively (Table 3). These two correlation coefficients were significantly higher than that between
and
. Similar results can be observed for settings 3 and 4 (Table 4 and Table 5).
Figures 1-3 showed the heatmaps for one set of simulated data of
, sp,
,
Figure 1. Heat maps for simulated data without spatial pattern (a), simulated spatial pattern in two directions (b), and simulated data with spatial pattern ((c) = (a) + (b)), and adjusted data with MGA method (d).
Figure 2. Heat maps for simulated data without spatial pattern (a), simulated spatial pattern in row direction (b), and simulated data with spatial pattern ((c) = (a) + (b)), and adjusted data with MGA method (d).
and
for settings 2 to 4, respectively. These three figures showed that the heatmap for the adjusted data
is very similar to that for the control data
. The heatmaps from these three different settings are consistent with the correlation analysis. Therefore, according to the simulation results (Tables 2-5 and Figures 1-3), we can conclude that 1) the MGA method can effectively remove field spatial variation if it does exist and 2) the MGA method will not change phenotype results if field spatial variation does not exist.
4. Application
4.1. Heatmaps for Observed Cotton Lint Yield and Lint Percentage
The heatmaps for observed lint yield and lint percentage data are provided in Figure 4(a) and Figure 5(a) (Figure 4 and Figure 5). Lint yield showed that spatial patterns were associated with soil conditions in both column (west-east) and row (north-south) directions (Figure 4(a)). The heatmap showed that lint yield gradually increased from west to east as indicated in column direction and gradually increased from south to north as indicated in the row direction (Figure 4(a)). In other words, more cotton yielded in the NE (north-east) area compared to the SW (south-west) area. Such a spatial pattern likely was associated with the
Figure 3. Heat maps for simulated data without spatial pattern (a), simulated spatial pattern in column direction (b), and simulated data with spatial pattern ((c) = (a) + (b)), and adjusted data with MGA method (d).
gradual changes in soil conditions including different soil types, moisture levels, and nutrient components. The heatmap for the residuals for the original yield data being analyzed subject to CR (completely randomized) design also showed a similar pattern as the phenotypic data (Figure 4(b)). It suggested that lint yield was sensitive to the field condition as shown in the heatmaps on both phenotypic data and estimated residuals. On the other hand, regarding lint percentage, no visible field spatial variation trend was observed for the phenotypic data and estimated residuals subject to a CR design analysis (Figure 5(a) and Figure 5(b)), suggesting that this trait was not sensitive to field conditions and adjustment is not needed.
4.2. Heatmaps for Adjusted Cotton Lint Yield and Lint Percentage
Both lint yield and lint percentage data were adjusted by the MGA method. Similarly, the heatmaps for adjusted lint yield and adjusted lint percentage data are provided in Figure 4 and Figure 5. Adjusted data for lint yield showed that spatial patterns were removed and the heatmap appeared more uniform in both column (west-east) and row (north-south) directions (Figure 4(c)). The heatmap for the estimated residuals from the adjusted lint yield data subject to CR design analysis also appeared much more uniform compared to that from the original lint yield data (Figure 4(d)). As expected, the heatmaps for both the adjusted lint
Figure 4. Heat maps for the original lint yield data (a), residuals subject to CR (completely randomized) design from the original data (b), and adjusted lint yield data (c), and residuals subject to CR (completely randomized) design from the adjusted data (d).
percentage and the corresponding estimated residuals were similar to those for the original lint percentage data and the corresponding residuals subject to CR design analysis (Figure 5(c) and Figure 5(d)). It suggested that the MGA method would not change the phenotypic data if spatial patterns do not exist.
4.3. Variance Components Estimations Subject to CR and RCB Designs
The heatmaps for the estimated residuals subject to CR design for both agronomic traits were presented above. It is also important to numerically compare the results from the original and adjusted data subject to an RCB (randomized complete block) design. Variance components subject to two experimental designs for both original and adjusted data were estimated by linear mixed model approach [27] with minque package (Wu, 2014). The estimated variance components are provided in Table 6 and Table 7 for lint yield and lint percentage, respectively.
Results in Table 6 showed that genotypic effects for lint yield were not significant subject to a CR design analysis because of the dominant variance due to the residuals. However, significant genotypic effects were detected subject to RCB design analysis. The results indicated that the RCB design used for the trial was able to remove the variation due to the column direction. The results agreed with the field spatial pattern detected in this study (Figure 4(b)). With the adjusted lint yield data, the estimated genotypic and residual variance components were
Figure 5. Heat maps for the original lint percentage data (a), residuals subject to CR (completely randomized) design from the original data (b), and adjusted lint percentage data (c), and residuals subject to CR (completely randomized design from the adjusted data (d).
Table 6. Variance components for original and adjusted lint yield subject to two experimental designs†.
Variance component‡ |
CR design |
|
|
|
|
Original data |
|
Adjusted data |
|
|
−3.05 |
NS |
15,426.61 |
** |
|
117,378.91 |
** |
21,458.93 |
** |
|
RCB design |
|
|
|
|
Original data |
|
Adjusted data |
|
|
11,885.25 |
** |
15,420.95 |
** |
|
61,987.00 |
NS |
−0.02 |
NS |
|
56,801.03 |
** |
21,481.87 |
** |
†: CR design = completely randomized design; RCB design = randomized complete block design. ‡:
= variance component for genotype effects;
= variance component for block effects; and
= variance component for residuals.
Table 7. Variance components for original and adjusted lint percentage subject to two experimental designs†.
Variance component‡ |
CR design |
|
|
|
|
Original data |
|
Adjusted data |
|
|
1.92 |
** |
1.93 |
** |
|
0.77 |
** |
0.75 |
** |
|
RCB design |
|
|
|
|
Original data |
|
Adjusted data |
|
|
1.95 |
** |
1.96 |
** |
|
0.11 |
NS |
0.10 |
NS |
|
0.66 |
** |
0.66 |
** |
†: CR design = completely randomized design; RCB design = randomized complete block design. ‡:
= variance component for genotype effects;
= variance component for block effects; and
= variance component for residuals.
nearly identical between CR and RCB designs. In addition, block effects which were significant in the original data for lint yield but no significant block effects were detected in the adjusted data (Table 6). Block effects were trivial and non-significant for lint percentage. As expected, both genotypic variances and residual variances were comparable between CR and RCB experimental designs. In addition, these estimated variance components from adjusted data were comparable to those from the original data.
Based on the results from lint yield and lint percentage, we can conclude that 1) the spatial pattern would be visible if it does exist; 2) spatial pattern including block effects can be removed with the MGA method; and 3) results would remain the same if a spatial pattern doesn’t exist.
5. Discussion
It has been a common practice that large-scare entries are evaluated for either early screening of breeding lines or phenotyping for association mapping study. However, field spatial variation is a common phenomenon in a field trial, especially for a large field trial. Such field spatial variation could greatly impact the conclusions such as inflation/deflation of differences among entries being evaluated. Field spatial variation for a large trial normally results in inflated residual variance and also can be visualized via heatmap. Effectively removing field variation is highly needed and can be indicated by reduced variance and/or depatterned heatmap for residuals.
There are two major ways to control field spatial variation. One major way is to apply various experimental designs such as conventional block designs and augmented experimental designs [1] [32]-[34]. If field conditions are known prior to a field trial being executed, appropriate blocking designs would help control field variation well for a small block size. However, the assumption of within-block uniformity is easily violated when block size is large due to many uncontrollable factors [8], including the row-column incomplete block designs Among various blocking designs, row-column designs (sometimes called rectangular design) [35]-[37] were more efficient than the conventional RCB designs for different trials and experiment sizes [38]. With our limited number of field trial data processing, we also noticed that the rectangular provided equally well or better field variation control compared to the other blocking designs [39]-[41]. Without spatial analysis, the rectangular design is a recommended option over the conventional RCB designs. The other major approach is to apply spatial statistical analysis which is NNA (near-neighbor adjustment) based. Though NNA based spatial analysis was initially proposed in 1937 (Papadis 1937), the application of NNA based method has been very few compared to Fisher’s based blocking designs.
The efficiency of the MGA method was numerically evaluated by Monte Carlo simulation. The simulation results showed that the MGA method can be used to effectively remove a field spatial pattern in one-way or two-way direction as evidenced by the heatmaps (Figures 1-3) and high coefficient of correlation between control data and adjusted data (Tables 2-5). The application of MGA method to an actual cotton trial data also showed the similar conclusion (Figure 4 & Figure 5; Table 6 & Table 7). Our results also suggested that the MGA method can effectively remove block effects (Table 6). In one of our previous studies, we applied several augmented designs to analyze the same lint yield and lint percentage data. We found that the rectangular blocking design (row-column) was better than other blocking designs for lint yield because both row and column variations were significantly captured. Similar results were found among different experimental designs for lint percentage because no significant row or column effects were detected [39].
The estimated genotype and residual variance components for lint yield from the adjusted data by the MGA method in this study (Table 6) were compared with the estimated variance components subject to the rectangular (row-column) model reported in one of our previous studies [39] (Table 4). Interestingly, the results were comparable between two studies (15,421 vs 16,400 and 21,482 vs 20,400, equivalent to 6.0% and 5.3% in difference for genotype and residual variance components, respectively). The results suggest that both the MGA method and the row-column augmented blocking design can effectively remove the two-way field spatial variation and thus they both function equally well regarding this data. However, it will be interesting to compare these two methods with more complicated field spatial patterns.
Data adjustment by the MGA can stand alone. Therefore, this adjustment method has a great flexibility to integrate with other statistical analyses including various genetic model analyses and association mapping study [28] [42]-[44]. Many crop trials are repeated over locations or years. It could be statistically challenging to analyze multi-environment data with a GGE (genotype and genotype × environment interaction) model simultaneously when field spatial patterns vary among environments. However, spatial analysis model is usually fitted for each environment [45]-[48] to produce a spatially adjusted genotype means for each environment. With application of the MGA method and other spatial analysis methods, the spatially adjusted data are then combined for the second stage for an across-environment analysis [49] [50].
6. Conclusion
In conclusion, both our simulation study and actual data analysis showed that the MGA method has a desirable adjustment efficiency when a field spatial pattern exists in single or both directions. In addition, because the R package for the MGA method has been publicly available [22], it will be flexible to integrate with other statistical methods under the RStudio platform [30] for more complicated statistical analysis. Therefore, this method can be a great addition to enhance field trial data analyses with potential field variation.
Disclaimer
Mention of trade names or commercial products in this publication is solely for the purposes of providing specific information and does not imply recommendation or endorsement by the U.S. Department of Agriculture. USDA is an equal opportunity provider and employer.