Using PopVar

Introduction

To make progress in breeding, populations should have a favorable mean and high genetic variance (Bernardo 2010). These two parameters can be combined into a single measure called the usefulness criterion (Schnell and Utz 1975), visualized in Figure 1.

Figure 1. Visualization of the mean, genetic variance, and superior progeny mean of a single population.
Figure 1. Visualization of the mean, genetic variance, and superior progeny mean of a single population.

Ideally, breeders would identify the set of parent combinations that, when realized in a cross, would give rise to populations meeting these requirements. PopVar is a package that uses phenotypic and genomewide marker data on a set of candidate parents to predict the mean, genetic variance, and superior progeny mean in bi-parental or multi-parental populations. Thre package also contains functionality for performing cross-validation to determine the suitability of different statistical models. More details are available in Mohammadi, Tiede, and Smith (2015). A dataset think_barley is included for reference and examples.

Installation

You can install the released version of PopVar from CRAN with:

install.packages("PopVar")

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("UMN-BarleyOatSilphium/PopVar")

Functions

Below is a description of the functions provided in PopVar:

Function Description
pop.predict Uses simulations to make predictions in recombinant inbred line populations; can internally perform cross-validation for model selections; can be quite slow.
pop.predict2 Uses deterministic equations to make predictions in populations of complete or partial selfing and with or without the induction of doubled haploids; is much faster than pop.predict; does not perform cross-validation or model selection internally.
pop_predict2 Has the same functionality as pop.predict2, but accepts genomewide marker data in a simpler matrix format.
x.val Performs cross-validation to estimate model performance.
mppop.predict Uses deterministic equations to make predictions in 2- or 4-way populations of complete or partial selfing and with or without the induction of doubled haploids; does not perform cross-validation or model selection internally.
mpop_predict2 Has the same functionality as mppop.predict, but accepts genomewide marker data in a simpler matrix format.

Examples

Below are some example uses of the functions in PopVar:

# Load the package
library(PopVar)

# Load the example data
data("think_barley", package = "PopVar")

Predictions using simulated populations

The code below simulates a single population of 1000 individuals for each of 150 crosses. For the sake of speed, the marker effects are predicted using RR-BLUP and no cross-validation is performed.

out <- pop.predict(G.in = G.in_ex, y.in = y.in_ex, map.in = map.in_ex,
                   crossing.table = cross.tab_ex,
                   nInd = 1000, nSim = 1, 
                   nCV.iter = 1, models = "rrBLUP")
#> Warning in read.cross.csv(dir, file, na.strings, genotypes, estimate.map, : There is no genotype data!
#> Warning in summary.cross(cross): Some markers at the same position on chr
#> 1,2,3,4,5,6,7; use jittermap().
#> Warning in summary.cross(cross): Strange genotype pattern on chr 7.

The function returns a list, one element of which is called predictions. This element is itself a list of matrices containing the predictions for each trait. They can be combined as such:

predictions1 <- lapply(X = out$predictions, FUN = function(x) {
  x1 <- as.data.frame(apply(X = x, MARGIN = 2, FUN = unlist), stringsAsFactors = FALSE)
  cbind(x1[,c("Par1", "Par2")], sapply(X = x1[,-1:-2], as.numeric)) 
})

# Display the first few lines of the predictions for grain yield
knitr::kable(head(predictions1$Yield_param.df))
Par1 Par2 midPar.Pheno midPar.GEBV pred.mu pred.mu_sd pred.varG pred.varG_sd mu.sp_low mu.sp_high low.resp_FHB low.resp_DON low.resp_Height high.resp_FHB high.resp_DON high.resp_Height cor_w/_FHB cor_w/_DON cor_w/_Height
FEG66-08 MN97-125 99.200 100.65607 100.67108 NA 7.744983 NA 95.72413 105.3825 23.11769 23.73826 78.78646 25.07849 25.25430 75.72681 0.3267277 0.3136146 -0.3675142
MN99-71 FEG90-31 NaN 99.17635 99.23144 NA 5.116231 NA 95.32773 103.0642 21.56813 23.55814 77.73700 24.62120 24.88625 75.77018 0.4914562 0.1890384 -0.1914478
MN96-141 FEG183-52 NaN 101.28761 101.27173 NA 5.377596 NA 97.39391 105.3030 23.82710 23.95326 79.50837 27.37068 28.58543 75.97826 0.7375570 0.7399023 -0.5544720
MN99-02 FEG183-52 NaN 100.78451 100.79905 NA 10.194490 NA 95.31410 106.3870 25.85846 24.11886 74.63583 26.22223 26.75520 73.64793 0.0948712 0.4017003 -0.1020523
FEG99-10 FEG148-56 NaN 98.68505 98.66757 NA 4.042377 NA 95.19892 102.1283 19.27629 20.13234 85.97129 23.09968 24.33009 80.42155 0.5805314 0.6164489 -0.5455833
MN99-62 MN01-46 105.775 102.29724 102.32428 NA 4.978463 NA 98.65287 106.0425 26.00114 28.34806 73.02783 26.42711 28.42103 73.98770 0.2021334 -0.1216202 0.3623222

Predictions using deterministic equations

Generating predictions via simulated populations can become computationally burdensome when many thousands or hundreds of thousands of crosses are possible. Fortunately, deterministic equations are available to generate equivalent predictions in a fraction of the time. These equations are provided in the pop.predict2 and pop_predict2 functions.

The pop.predict2 function takes arguments in the same format as pop.predict. We have eliminated the arguments for marker filtering and imputation and cross-validation, as the pop.predict2 function does not support these steps. (You may continue to conduct cross-validation using the x.val function.) Therefore, the genotype data input for pop.predict2 must not contain any missing data. Further, these predictions assume fully inbred parents, so marker genotypes must only be coded as -1 or 1. The data G.in_ex_imputed contains genotype data that is formatted properly.

Below is an example of using the pop.predict2 function:

out2 <- pop.predict2(G.in = G.in_ex_imputed, y.in = y.in_ex, map.in = map.in_ex,
                     crossing.table = cross.tab_ex, models = "rrBLUP")

knitr::kable(head(subset(out2, trait == "Yield")))
parent1 parent2 trait pred_mu pred_varG pred_musp_low pred_musp_high cor_w_FHB cor_w_DON cor_w_Yield cor_w_Height pred_cor_musp_low_FHB pred_cor_musp_low_DON pred_cor_musp_low_Yield pred_cor_musp_low_Height pred_cor_musp_high_FHB pred_cor_musp_high_DON pred_cor_musp_high_Yield pred_cor_musp_high_Height
3 FEG66-08 MN97-125 Yield 100.41166 7.768362 95.52021 105.3031 0.3412831 0.3303378 NA -0.4004154 22.27035 23.76433 NA 78.84012 24.96750 25.91641 NA 75.11838
7 MN99-71 FEG90-31 Yield 98.98546 5.298105 94.94591 103.0250 0.4959745 0.2776666 NA -0.1612311 21.86563 23.41697 NA 77.04669 24.85938 25.18159 NA 75.75153
11 MN96-141 FEG183-52 Yield 101.07137 5.571655 96.92884 105.2139 0.6132612 0.6961326 NA -0.4603399 24.36629 23.67733 NA 79.59345 27.88028 28.40806 NA 75.16009
15 MN99-02 FEG183-52 Yield 100.82967 9.499737 95.42053 106.2388 0.0249962 0.4211875 NA -0.1929275 26.15174 23.73807 NA 74.76672 26.29180 26.60344 NA 72.86270
19 FEG99-10 FEG148-56 Yield 98.01593 4.121803 94.45292 101.5789 0.5959335 0.6401323 NA -0.5080571 19.67552 19.58928 NA 85.81854 23.38585 24.23917 NA 80.69997
23 MN99-62 MN01-46 Yield 102.51094 4.673197 98.71709 106.3048 0.0755600 -0.0827658 NA 0.4010907 26.07342 28.37124 NA 72.67125 26.20580 28.16102 NA 74.51340

Note that the output of pop.predict2 is no longer a list, but a data frame containing the combined predictions for all traits.

The formatting requirements of G.in for pop.predict and pop.predict2 are admittedly confusing. Marker genotype data is commonly stored as a n x p matrix, where n is the number of entries and p the number of markers. The function pop_predict2 accommodates this general marker data storage. Here is an example:

out3 <- pop_predict2(M = G.in_ex_mat, y.in = y.in_ex, map.in = map.in_ex,
                     crossing.table = cross.tab_ex, models = "rrBLUP")

knitr::kable(head(subset(out2, trait == "Yield")))
parent1 parent2 trait pred_mu pred_varG pred_musp_low pred_musp_high cor_w_FHB cor_w_DON cor_w_Yield cor_w_Height pred_cor_musp_low_FHB pred_cor_musp_low_DON pred_cor_musp_low_Yield pred_cor_musp_low_Height pred_cor_musp_high_FHB pred_cor_musp_high_DON pred_cor_musp_high_Yield pred_cor_musp_high_Height
3 FEG66-08 MN97-125 Yield 100.41166 7.768362 95.52021 105.3031 0.3412831 0.3303378 NA -0.4004154 22.27035 23.76433 NA 78.84012 24.96750 25.91641 NA 75.11838
7 MN99-71 FEG90-31 Yield 98.98546 5.298105 94.94591 103.0250 0.4959745 0.2776666 NA -0.1612311 21.86563 23.41697 NA 77.04669 24.85938 25.18159 NA 75.75153
11 MN96-141 FEG183-52 Yield 101.07137 5.571655 96.92884 105.2139 0.6132612 0.6961326 NA -0.4603399 24.36629 23.67733 NA 79.59345 27.88028 28.40806 NA 75.16009
15 MN99-02 FEG183-52 Yield 100.82967 9.499737 95.42053 106.2388 0.0249962 0.4211875 NA -0.1929275 26.15174 23.73807 NA 74.76672 26.29180 26.60344 NA 72.86270
19 FEG99-10 FEG148-56 Yield 98.01593 4.121803 94.45292 101.5789 0.5959335 0.6401323 NA -0.5080571 19.67552 19.58928 NA 85.81854 23.38585 24.23917 NA 80.69997
23 MN99-62 MN01-46 Yield 102.51094 4.673197 98.71709 106.3048 0.0755600 -0.0827658 NA 0.4010907 26.07342 28.37124 NA 72.67125 26.20580 28.16102 NA 74.51340

Benchmarking and comparisons

The code below compares the functions pop.predict and pop.predict2 with respect to computation time and results:

time1 <- system.time({
  capture.output(pop.predict.out <- pop.predict(
    G.in = G.in_ex_imputed, y.in = y.in_ex, map.in = map.in_ex, crossing.table = cross.tab_ex,
    nInd = 1000, nSim = 1, nCV.iter = 1, models = "rrBLUP"))
})
#> Warning in read.cross.csv(dir, file, na.strings, genotypes, estimate.map, : There is no genotype data!
#> Warning in summary.cross(cross): Some markers at the same position on chr
#> 1,2,3,4,5,6,7; use jittermap().
#> Warning in summary.cross(cross): Strange genotype pattern on chr 7.

time2 <- system.time({pop.predict2.out <- pop.predict2(
  G.in = G.in_ex_imputed, y.in = y.in_ex, map.in = map.in_ex,
  crossing.table = cross.tab_ex,model = "rrBLUP")})

# Print the time (seconds) required for each function.
c(pop.predict = time1[[3]], pop.predict2 = time2[[3]])
#>  pop.predict pop.predict2 
#>        29.28         0.75

Plot results

predictions1 <- lapply(X = pop.predict.out$predictions, FUN = function(x) {
  x1 <- as.data.frame(apply(X = x, MARGIN = 2, FUN = unlist), stringsAsFactors = FALSE)
  cbind(x1[,c("Par1", "Par2")], sapply(X = x1[,-1:-2], as.numeric))
})

pop.predict.out1 <- predictions1$Yield_param.df[,c("Par1", "Par2", "pred.varG")]
pop.predict2.out1 <- subset(pop.predict2.out, trait == "Yield", c(parent1, parent2, pred_varG))

toplot <- merge(pop.predict.out1, pop.predict2.out1, by.x = c("Par1", "Par2"),
                by.y = c("parent1", "parent2"))

plot(pred.varG ~ pred_varG, toplot,
     xlab = "pop.predict2", ylab = "pop.predict",
     main = "Comparsion of predicted genetic variance")

Multi-parent populations

PopVar also includes functions for predicting the mean, genetic variance, and superior progeny mean of multi-parent populations. The mppop.predict function takes the same inputs as pop.predict or pop.predict2, and the mppop_predict2 function takes the same inputs as pop_predict2. Both require the additional argument n.parents, which will determine whether the populations are formed by 2- or 4-way matings (support for 8-way populations is forthcoming.)

Below is an example of using the mppop.predict function:

# Generate predictions for all possible 4-way crosses of 10 sample parents
sample_parents <- sample(unique(unlist(cross.tab_ex)), 10)

mp_out <- mppop.predict(G.in = G.in_ex_imputed, y.in = y.in_ex, map.in = map.in_ex,
                        parents = sample_parents, n.parents = 4, models = "rrBLUP")

knitr::kable(head(subset(mp_out, trait == "Yield")))
parent1 parent2 parent3 parent4 trait pred_mu pred_varG pred_musp_low pred_musp_high FHB DON Yield Height pred_cor_musp_low_FHB pred_cor_musp_low_DON pred_cor_musp_low_Yield pred_cor_musp_low_Height pred_cor_musp_high_FHB pred_cor_musp_high_DON pred_cor_musp_high_Yield pred_cor_musp_high_Height
3 FEG148-26 FEG183-25 FEG184-24 FEG31-68 Yield 97.21566 7.935479 92.27187 102.1594 0.3399113 0.5015924 NA -0.3115662 20.28264 17.06047 NA 83.76323 22.82428 20.84008 NA 80.82095
7 FEG148-26 FEG183-25 FEG184-24 FEG55-14 Yield 100.50647 7.196051 95.79865 105.2143 0.2058013 0.5038860 NA -0.2986956 20.48292 17.29009 NA 83.84982 21.92910 21.10897 NA 81.11356
11 FEG148-26 FEG183-25 FEG184-24 FEG67-12 Yield 99.48691 5.657875 95.31246 103.6614 0.2427504 0.5186568 NA -0.3527085 21.17820 17.98368 NA 83.39269 22.72225 21.49834 NA 80.38271
15 FEG148-26 FEG183-25 FEG184-24 FEG96-55 Yield 99.80053 7.170087 95.10121 104.4998 0.2661716 0.5068394 NA -0.3997340 21.40929 17.38364 NA 83.10275 23.24221 21.20973 NA 79.38459
19 FEG148-26 FEG183-25 FEG184-24 MN02-38 Yield 98.78104 7.709443 93.90817 103.6539 0.2206565 0.4494046 NA -0.2479670 21.89197 19.34685 NA 81.11197 23.52702 22.91285 NA 78.53044
23 FEG148-26 FEG183-25 FEG184-24 MN03-74 Yield 99.81935 8.335248 94.75257 104.8861 0.1281087 0.4244475 NA -0.1876705 22.24945 19.34953 NA 79.97713 23.20921 22.69514 NA 77.93463

Below is an example of using the mppop_predict2 function:

# Generate predictions for all possible 4-way crosses of 10 sample parents
mp_out2 <- mppop_predict2(M = G.in_ex_mat, y.in = y.in_ex, map.in = map.in_ex,
                          parents = sample_parents, n.parents = 4, models = "rrBLUP")

knitr::kable(head(subset(mp_out2, trait == "Yield")))
parent1 parent2 parent3 parent4 trait pred_mu pred_varG pred_musp_low pred_musp_high FHB DON Yield Height pred_cor_musp_low_FHB pred_cor_musp_low_DON pred_cor_musp_low_Yield pred_cor_musp_low_Height pred_cor_musp_high_FHB pred_cor_musp_high_DON pred_cor_musp_high_Yield pred_cor_musp_high_Height
3 FEG148-26 FEG183-25 FEG184-24 FEG31-68 Yield 97.21566 7.935479 92.27187 102.1594 0.3399113 0.5015924 NA -0.3115662 20.28264 17.06047 NA 83.76323 22.82428 20.84008 NA 80.82095
7 FEG148-26 FEG183-25 FEG184-24 FEG55-14 Yield 100.50647 7.196051 95.79865 105.2143 0.2058013 0.5038860 NA -0.2986956 20.48292 17.29009 NA 83.84982 21.92910 21.10897 NA 81.11356
11 FEG148-26 FEG183-25 FEG184-24 FEG67-12 Yield 99.48691 5.657875 95.31246 103.6614 0.2427504 0.5186568 NA -0.3527085 21.17820 17.98368 NA 83.39269 22.72225 21.49834 NA 80.38271
15 FEG148-26 FEG183-25 FEG184-24 FEG96-55 Yield 99.80053 7.170087 95.10121 104.4998 0.2661716 0.5068394 NA -0.3997340 21.40929 17.38364 NA 83.10275 23.24221 21.20973 NA 79.38459
19 FEG148-26 FEG183-25 FEG184-24 MN02-38 Yield 98.78104 7.709443 93.90817 103.6539 0.2206565 0.4494046 NA -0.2479670 21.89197 19.34685 NA 81.11197 23.52702 22.91285 NA 78.53044
23 FEG148-26 FEG183-25 FEG184-24 MN03-74 Yield 99.81935 8.335248 94.75257 104.8861 0.1281087 0.4244475 NA -0.1876705 22.24945 19.34953 NA 79.97713 23.20921 22.69514 NA 77.93463

References

Bernardo, Rex. 2010. Breeding for Quantitative Traits in Plants. 2nd ed. Woodbury, Minnesota: Stemma Press.

Mohammadi, Mohsen, Tyler Tiede, and Kevin P. Smith. 2015. “PopVar: A Genome-Wide Procedure for Predicting Genetic Variance and Correlated Response in Biparental Breeding Populations.” Crop Science 55 (5): 2068–77. https://doi.org/10.2135/cropsci2015.01.0030.

Schnell, F. W., and H. F. Utz. 1975. “F1-leistung und elternwahl euphyder züchtung von selbstbefruchtern.” In Bericht über Die Arbeitstagung Der Vereinigung Österreichischer Pflanzenzüchter, 243–48. Gumpenstein, Austria: BAL Gumpenstein.