Department of Botany and Plant Science, University of California, Riverside, California, 92521, USA

Soybean Research Institute (Chinese Education Ministry's Key Laboratory of Soybean Biology), Northeast Agricultural University, 150030 Harbin, PR China

Dept. of Electrical & Computer Engineering, University of Miami 1251 Memorial Drive, Coral Gables, FL 33146, USA

Abstract

Background

Most quantitative traits are controlled by multiple quantitative trait loci (QTL). The contribution of each locus may be negligible but the collective contribution of all loci is usually significant. Genome selection that uses markers of the entire genome to predict the genomic values of individual plants or animals can be more efficient than selection on phenotypic values and pedigree information alone for genetic improvement. When a quantitative trait is contributed by epistatic effects, using all markers (main effects) and marker pairs (epistatic effects) to predict the genomic values of plants can achieve the maximum efficiency for genetic improvement.

Results

In this study, we created 126 recombinant inbred lines of soybean and genotyped 80 makers across the genome. We applied the genome selection technique to predict the genomic value of somatic embryo number (a quantitative trait) for each line. Cross validation analysis showed that the squared correlation coefficient between the observed and predicted embryo numbers was 0.33 when only main (additive) effects were used for prediction. When the interaction (epistatic) effects were also included in the model, the squared correlation coefficient reached 0.78.

Conclusions

This study provided an excellent example for the application of genome selection to plant breeding.

Background

Genome selection refers to a method for genomic value prediction using markers of the entire genome

The importance of epistasis in genetic determination may vary across different species. In agricultural crops, most quantitative traits in barley do not have a strong basis of epistatic effects

The number of somatic embryos is an important trait for consideration in soybean breeding program because it is directly related to the plant regeneration system that is essential for effective gene transfer. The capacity of plant regeneration through immature embryo culture of soybean is genetically determined, reflected by significant variation across different lines (from 0% to 100% of regeneration). The genetic knowledge of the regeneration trait based on immature embryo culture and the discovery of molecular markers associated with regeneration will offer a great opportunity to develop efficient elite inbred lines with increased regeneration capacity. However, studies on the genetic basis of embryogenesis are lacking. There is no information available about the role of epistasis. In this study, we used advanced statistical methods to investigate not only the main effects but also pair-wise interaction (epistatic) effects for soybean somatic embryogenesis.

Statistical methods for QTL mapping are available but mainly for individual marker (main effect) analysis and individual marker pair (epistatic effect) analysis

Result

Main effect model

The numerical codes (marker IDs) and names of the 80 markers are given in Table

Names, positions (cM) and linkage groups (LG) of the 80 markers (M1-M80) presented in Tables 2 and 3, Figures 1, 2, 3 and 5.

**Marker ID**

**Marker name**

**cM**

**LG**

**Marker ID**

**Marker name**

**cM**

**LG**

M1

Satt005

0.00

1

M11

Satt123

0.00

10

M74

Satt579

149.39

1

M73

Satt576

70.86

10

M37

Satt290

241.33

1

M9

Satt094

138.02

10

M70

Satt537

586.72

1

M6

Satt052

0.00

11

M4

Satt032

0.00

2

M44

Satt353

89.85

11

M58

Satt436

196.10

2

M61

Satt469

169.72

11

M79

Satt605

272.50

2

M23

Satt181

169.72

11

M69

Satt532

388.23

2

M57

Satt434

293.93

11

M75

Satt584

0.00

3

M41

Satt317

465.41

11

M31

Satt234

152.25

3

M72

Satt568

670.93

11

M49

Satt387

244.95

3

M15

Satt150

0.00

12

M3

Satt022

304.72

3

M63

Satt494

345.39

12

M32

Satt247

0.00

4

M26

Satt201

455.66

12

M43

Satt337

345.39

4

M22

Satt175

538.75

12

M13

Satt137

439.85

4

M71

Satt567

588.30

12

M62

Satt475

545.69

4

M56

Satt427

0.00

13

M47

Satt375

545.69

4

M12

Satt130

91.38

13

M33

Satt264

609.52

4

M25

Satt199

190.84

13

M5

Satt046

670.69

4

M65

Satt505

259.01

13

M34

Satt268

0.00

5

M76

Satt594

604.40

13

M28

Satt213

345.39

5

M24

Satt195

0.00

14

M78

Satt602

434.04

5

M19

Satt161

68.96

14

M52

Satt411

528.24

5

M38

Satt294

170.89

14

M48

Satt384

600.86

5

M51

Satt399

255.15

14

M30

Satt231

674.74

5

M68

Satt529

0.00

15

M77

Satt598

743.69

5

M29

Satt215

72.32

15

M40

Satt307

0.00

6

M67

Satt528

417.71

15

M36

Satt286

41.97

6

M17

Satt158

0.00

16

M53

Satt422

88.51

6

M27

Satt206

345.39

16

M35

Satt281

164.63

6

M14

Satt146

0.00

17

M66

Satt520

252.18

6

M18

Satt160

89.85

17

M80

GMA

300.27

6

M54

Satt425

197.04

17

M7

Satt082

0.00

7

M46

Satt374

273.57

17

M50

Satt397

345.39

7

M16

Satt155

0.00

18

M21

Satt168

0.00

8

M39

Satt300

76.49

18

M60

Satt467

67.12

8

M42

Satt330

0.00

19

M10

Satt122

247.35

8

M59

Satt451

98.09

19

M45

Satt373

0.00

9

M2

Satt008

N/A

N/A

M64

Satt495

345.39

9

M8

Satt085

N/A

N/A

M20

Satt166

462.25

9

M55

Satt426

N/A

N/A

The empirical Bayesian estimates of the top 27 marker (main) effects.

**Marker**

**Variance**

**Effect**

**StdErr**

**LOD**

**p- value**

**H**

M56

1.0181

1.0016

0.1228

14.44

0.0000

0.1303

M8

0.7492

0.8581

0.1118

12.77

0.0000

0.0980

M44

0.4744

-0.6783

0.1193

7.01

0.0000

0.0562

M39

0.4328

-0.6466

0.1217

6.13

0.0000

0.0525

M12

0.3084

-0.5437

0.1134

4.98

0.0000

0.0384

M22

0.3108

0.5455

0.1143

4.94

0.0000

0.0399

M51

0.3114

0.5444

0.1228

4.26

0.0000

0.0375

M65

0.2437

0.4797

0.1163

3.69

0.0000

0.0304

M69

0.2126

-0.4477

0.1109

3.53

0.0001

0.0271

M53

0.2095

0.4439

0.1111

3.46

0.0001

0.0241

M15

0.2217

-0.4543

0.1240

2.91

0.0003

0.0270

M72

0.1711

-0.3989

0.1095

2.88

0.0003

0.0210

M26

0.1781

-0.4054

0.1167

2.62

0.0005

0.0203

M23

0.1909

-0.4192

0.1241

2.47

0.0007

0.0222

M42

0.1915

0.4188

0.1267

2.37

0.0010

0.0232

M60

0.1662

-0.3900

0.1185

2.35

0.0010

0.0191

M54

0.1506

-0.3712

0.1133

2.33

0.0011

0.0173

M25

0.1437

0.3625

0.1111

2.31

0.0011

0.0181

M10

0.1336

0.3492

0.1075

2.29

0.0012

0.0166

M36

0.1433

-0.3605

0.1150

2.13

0.0017

0.0167

M4

0.1363

0.3513

0.1138

2.07

0.0020

0.0158

M16

0.1171

-0.3233

0.1121

1.80

0.0039

0.0135

M34

0.1141

0.3187

0.1120

1.76

0.0044

0.0133

M61

0.0982

-0.2951

0.1055

1.70

0.0052

0.0118

M45

0.1027

-0.2997

0.1136

1.51

0.0083

0.0116

M5

0.0961

-0.2890

0.1121

1.44

0.0099

0.0106

M13

0.0802

-0.2619

0.1081

1.27

0.0154

0.0093

The last column (H) gives the proportion of phenotypic variance contributed by each marker. The markers are sorted based on their LOD scores.

LOD scores for the 80 markers (main effects) obtained from the empirical Bayesian analysis

**LOD scores for the 80 markers (main effects) obtained from the empirical Bayesian analysis**.

We now examine the result of cross validation. All the 80 markers were ordered from the largest to the smallest according to the LOD scores. We then used the cross validation analysis to calculate the squared correlation coefficient between the observed phenotypic values and the predicted genomic values. For the prediction, we examined the change of the r-square with the number of markers included in the model for prediction. The result is plotted in Figure

The r-square between the phenotypic values of somatic embryogenesis of soybean and their predictions by the leave-one-out cross validation analysis

**The r-square between the phenotypic values of somatic embryogenesis of soybean and their predictions by the leave-one-out cross validation analysis**. The black curve represents the change of squared correlation coefficient by increasing the number of sorted markers (from the strongest to the weakest) in the model. The blue curve shows the change of squared correlation coefficient by increasing the number of randomly selected markers in the model. The squared correlation coefficient reaches its maximum at 0.33 when 27 markers with the largest LOD scores are included in the model.

Epistatic effect model

The epistatic effect model included 3240 (80 main + 3160 epistatic) effects. The LOD scores of the 3160 epistatic effects are given in Figure

The empirical Bayesian estimates of the top 66 marker (main) effects and marker pair (epistatic) effects.

**Marker 1**

**Marker 2**

**Variance**

**Effect**

**StdErr**

**LOD**

**p- value**

**H**

M3

M39

0.5519

0.7409

0.0476

52.60

0.0000

0.0648

M1

M26

0.4070

0.6365

0.0479

38.29

0.0000

0.0469

M7

M35

0.2960

-0.5417

0.0460

30.06

0.0000

0.0324

M1

M56

0.2342

-0.4832

0.0453

24.66

0.0000

0.0286

M1

M11

0.2163

0.4633

0.0446

23.42

0.0000

0.0263

M37

M59

0.1847

0.4280

0.0414

23.23

0.0000

0.0230

M7

M50

0.2234

-0.4710

0.0456

23.11

0.0000

0.0262

M1

M8

0.1722

-0.4127

0.0412

21.78

0.0000

0.0214

M4

M55

0.2026

-0.4468

0.0454

20.99

0.0000

0.0231

M8

M65

0.1768

0.4174

0.0436

19.87

0.0000

0.0223

M4

M22

0.1574

0.3950

0.0413

19.85

0.0000

0.0194

M11

M24

0.1484

0.3826

0.0414

18.51

0.0000

0.0189

M12

M42

0.1273

-0.3549

0.0385

18.39

0.0000

0.0157

M20

M24

0.1551

-0.3913

0.0426

18.32

0.0000

0.0194

M18

M80

0.1854

-0.4284

0.0475

17.67

0.0000

0.0215

M13

M34

0.1437

0.3772

0.0422

17.34

0.0000

0.0185

M13

M68

0.1175

0.3407

0.0387

16.82

0.0000

0.0156

M3

M70

0.1521

0.3866

0.0451

15.94

0.0000

0.0183

M5

M22

0.1252

0.3515

0.0418

15.37

0.0000

0.0153

M4

M79

0.1377

0.3681

0.0445

14.82

0.0000

0.0167

M4

M36

0.1016

0.3166

0.0395

13.91

0.0000

0.0120

M1

M16

0.1602

0.3975

0.0499

13.74

0.0000

0.0192

M8

M33

0.0941

-0.3046

0.0409

12.04

0.0000

0.0121

M2

M9

0.0920

0.3002

0.0434

10.37

0.0000

0.0105

M12

M24

0.0799

0.2789

0.0413

9.91

0.0000

0.0101

M27

M36

0.0748

0.2701

0.0420

8.97

0.0000

0.0094

M1

M45

0.0741

0.2695

0.0420

8.93

0.0000

0.0090

M33

M50

0.0682

0.2575

0.0435

7.61

0.0000

0.0087

M9

M19

0.0696

-0.2599

0.0443

7.46

0.0000

0.0089

M12

M71

0.0563

0.2338

0.0404

7.28

0.0000

0.0066

M8

M56

0.0630

0.2465

0.0434

7.01

0.0000

0.0077

M11

M26

0.0696

0.2596

0.0483

6.27

0.0000

0.0077

M7

M27

0.0702

-0.2605

0.0499

5.92

0.0000

0.0081

M1

M52

0.0558

-0.2322

0.0451

5.76

0.0000

0.0061

M28

M63

0.0500

0.2187

0.0437

5.43

0.0000

0.0061

M34

M38

0.0506

0.2203

0.0449

5.22

0.0000

0.0059

M53

M79

0.0520

-0.2232

0.0461

5.08

0.0000

0.0059

M39

M45

0.0495

0.2178

0.0451

5.05

0.0000

0.0055

M9

M80

0.0380

-0.1907

0.0407

4.76

0.0000

0.0043

M1

M17

0.0446

-0.2069

0.0442

4.75

0.0000

0.0051

M25

M80

0.0367

-0.1869

0.0411

4.48

0.0000

0.0042

M15

M21

0.0408

-0.1971

0.0449

4.18

0.0000

0.0051

M1

M22

0.0417

-0.1987

0.0471

3.86

0.0000

0.0050

M2

M10

0.0288

0.1643

0.0394

3.78

0.0000

0.0032

M5

M27

0.0330

-0.1769

0.0436

3.57

0.0001

0.0040

M14

M26

0.0347

-0.1810

0.0447

3.55

0.0001

0.0040

M5

M38

0.0232

-0.1454

0.0442

2.35

0.0010

0.0025

M11

M18

0.0217

-0.1408

0.0446

2.16

0.0016

0.0025

M2

M37

0.0240

-0.1480

0.0469

2.16

0.0016

0.0024

M13

M58

0.0193

0.1321

0.0428

2.07

0.0020

0.0023

M25

M53

0.0176

-0.1259

0.0421

1.94

0.0028

0.0019

M44

M80

0.0193

0.1305

0.0474

1.65

0.0059

0.0018

M22

M27

0.0132

-0.1074

0.0404

1.53

0.0079

0.0015

M10

M22

0.0136

0.1086

0.0411

1.51

0.0083

0.0016

M22

M24

0.0098

0.0921

0.0363

1.39

0.0112

0.0011

M1

M77

0.0135

-0.1061

0.0453

1.19

0.0193

0.0013

M71

M76

0.0105

-0.0930

0.0425

1.04

0.0288

0.0010

M29

M48

0.0071

0.0738

0.0415

0.68

0.0759

0.0007

M6

M67

0.0049

-0.0604

0.0364

0.60

0.0976

0.0005

M20

M45

0.0057

0.0639

0.0401

0.55

0.1114

0.0005

M1

M6

0.0037

-0.0509

0.0344

0.48

0.1385

0.0003

M22

M66

0.0035

0.0475

0.0359

0.38

0.1858

0.0003

M19

M47

0.0029

0.0419

0.0337

0.33

0.2141

0.0002

M50

M76

0.0027

0.0396

0.0338

0.30

0.2410

0.0002

M62

M68

0.0021

-0.0338

0.0316

0.25

0.2847

0.0002

M22

M23

0.0027

0.0375

0.0359

0.24

0.2964

0.0002

The last column (H) gives the proportion of phenotypic variance contributed by each effect. The effects are sorted based on their LOD scores.

LOD scores for all the marker pairs (epistatic effects) obtained from the empirical Bayesian analysis

**LOD scores for all the marker pairs (epistatic effects) obtained from the empirical Bayesian analysis**.

The 66 effects were selected from the leave-one-out cross validation analysis. Figure

The r-square between the phenotypic values of somatic embryogenesis and their predicted values

**The r-square between the phenotypic values of somatic embryogenesis and their predicted values**. The black curve represents the change of squared correlation coefficient by increasing the number of sorted (from the strongest to the weakest) effects (markers and marker pairs) in the model. The blue curve shows the change of squared correlation coefficient by increasing the number of randomly selected effects (markers and marker pairs) in the model. The squared correlation coefficient reaches its maximum at 0.78 when 66 effects with the largest LOD scores are included in the model.

Figure

Epistatic interaction network for the 80 markers investigated in the soybean genome selection mapping project

**Epistatic interaction network for the 80 markers investigated in the soybean genome selection mapping project**. The total number of interaction effects is 66. The corresponding names and chromosome positions of the 80 markers can be found in Table 1. The degree of darkness of the lines (epistatic effects) represents the strength of the effects, i.e., darker lines represent stronger epistatic effects.

The conclusion from the epistatic model analysis was that selecting the top 66 effects for prediction, the accuracy (r-square) can reach 78%. If all effects are included, the accuracy decreased slightly to 0.73. Therefore, genome selection using the epistatic model can be very effective in soybean breeding for the embryogenesis trait.

Discussion and conclusion

We evaluated two models of genome selection for soybean embryogenesis, the main effect model and the epistatic effect model. The main effect model is already very efficient with a 0.33 r-square between the observed phenotypes and the predicted genomic values. With the epistatic effect model, the squared correlation was 0.78, more than twice the efficiency of the main effect model. This discovery provides surprisingly good news to soybean breeders. The squared correlation coefficient between the observed phenotypes and the predicted genomic values is different from the ^{2 }value in multiple regression analysis. The r-square here is the squared correlation between the observed phenotypes and the estimated genetic values in which the individual lines predicted contribute to the parameter estimation. The r-square of our epistatic effect model was very high (0.78). In the cross validation generated correlation, the lines predicted do not contribute to the parameter estimation, and thus the squared correlation is a good indication of the predictability. If a new line occurs from the same population but has no phenotypic value, using the marker effects estimated from the existing lines to predict the genetic value of the new line will be as accurate as 0.78. This can save tremendous resources for plant breeding because we can use marker information alone to select plants in several cycles without field evaluation of the phenotypes.

Cross validation analysis showed that the main effect model found 27 important markers, but none of the 27 markers appeared as main effects in the epistatic effect model. This means that epistasis plays a more important role in determining the variation of soybean embryogenesis. These 27 markers did not disappear completely; parts of their effects have been absorbed by the epistatic effects due to correlation of variables in the design matrix. The epistatic model provides better prediction of the genomic values, but it does not disqualify the main effect model. When epistatic effects are excluded from the model, parts of their effects have been absorbed by the main effects due to complicated correlation among the design matrices of the effects. If a breeder decides to use the additive effect model alone for genome selection, he/she still has an accuracy of 0.33.

We used the cross validation analysis to select markers and marker pairs for genome selection. We progressively increased the number of effects in the model to monitor the change in the accuracy of prediction. The optimal number of effects included is the one that maximizes the accuracy of prediction or minimizes the prediction error. The pattern of the change depends on the models used and the traits analyzed. For the main effect model, including more effects than necessary is detrimental to genome selection. For the epistatic effect model, the decrease of the prediction accuracy does not seem to be significant after the optimal number of effects is reached. The reason for the difference in the two models (additive and epistatic) is due to the fact that the large number of effects in the epistatic model caused strong shrinkage of the estimated regression coefficients. Once the prediction accuracy reached the peak value, additional effects included all have extremely small values (virtually equal to zero). Including these zero effects does not cause much damage to the prediction. These patterns of the correlation profiles cannot be generalized to other situations. Therefore, cross validation must be performed for different traits in different experiments.

Unfortunately, the marker density in this experiment was very low in the experiment, which prevented us from using the interval mapping and composite interval mapping approaches for fine mapping. The average interval for the 77 mapped markers was 40 cM. For such a low marker density, we were more concerned about not detecting any QTL. To our surprise, we found more QTL than we originally anticipated. Our cross validation analysis confirmed 27 main effects and 66 epistatic effects. Therefore, we conclude that the soybean embryogenesis is a polygenic trait. The current analysis provides a guideline for further study in fine mapping of QTL for embryogenesis of soybean. The next step is to develop more markers to saturate the entire genome. With a high density marker map, more efficient genome selection can be conducted.

The number of somatic embryos is a discrete trait, which can be modeled by the Poisson distribution. In this particular experiment, each original data point was the average of 10 plants. Therefore, treating the average value as a continuous trait is more reasonable than fitting the Poisson model. For curiosity, we did analyze the data using the Poisson distribution. We also performed a data transformation using the square root of the data

The main purpose of this study was not to develop new statistical methods for genome selection; rather, we used an existing method

In plant breeding, it is common to select superior cultivars from a random pool of existing accessions. The empirical Bayesian method used in this study works equally well for such randomly selected populations. It is also common to select superior RILs from an RIL crossing experiment derived from two inbred lines

This study demonstrates the importance of epistasis in determination of the embryogenesis of soybean. This finding is clearly in contrast to what Xu and Jia

Although there were hundreds of studies in the past decades claiming discovery of major QTL in plants, only a few of them have successfully cloned QTL

The entire data analysis was conducted using the QTL procedure in SAS, called PROC QTL

Methods

Experimental material

The mapping population consisted of 126 F_{5:6 }recombinant inbred lines (RILs) that were advanced by single seed descent from the cross of Peking (higher primary and secondary embryogenesis) and Keburi (lower primary and secondary embryogenesis) parents. This population was evaluated for primary embryogenesis capacity from immature embryo cultures by measuring the somatic embryo number per explant. A total of 80 simple sequence repeat markers were available and used for the QTL study. Among the 80 markers, 77 of them were mapped to 19 linkage groups

Empirical Bayesian analysis

Main effect model

Genotypes of the Peking cultivar (high embryogenesis) and the Keburi cultivar (low embryogenesis) are denoted by _{1}
_{1 }and _{2}
_{2}, respectively. The genotype of each marker locus was numerically coded and represented by a variable _{
jk
}for the _{
jk
}is shown below,

The linear model for the plot-mean-adjusted phenotypic value is

where _{
k
}is the effect of the _{
j
}~ _{2}) is the residual error. The marker effect _{k }

Epistatic effect model

Let

where _{kk' }

Prior distribution

Both the main effect model and the epistatic effect model were analyzed using the empirical Bayesian method _{k }
_{kk'}

where the variance

The hyper parameters (

LOD score calculation

We first calculate the Wald-test statistic. For the main effect analysis, the Wald test statistic was

For the epistatic effect between loci

The

where ^{2 }distribution with one degree of freedom evaluated at

which was presented in the final report and also used to select markers and marker pairs in the cross validation analyses.

Cross validation

We used the leave-one-out cross validation approach _{j }

where _{k }
_{jk }

is the squared correlation coefficient. Note that this statistic is not the Pearson correlation; rather, it represents the proportion of the phenotypic variance explained by the markers.

Authors' contributions

ZQ wrote the computer program and performed data analyses. YL, XS and YH provided the experimental material, carried out the molecular genetic studies and participated in the field investigation. XC improved the computer program and helped drafting the manuscript. SX participated in the design of this project, developed the statistical method and drafted the manuscript. WL conceived the project, designed the experiment, supervised students and postdocs involved in the project and revised the manuscript. All authors have read and approved the final manuscript.

Acknowledgements

This project was supported by the Agriculture and Food Research Initiative (AFRI) of the USDA National Institute of Food and Agriculture under the Plant Genome, Genetics and Breeding Program 2007-35300-18285 to SX and by The Chinese National 973 Program (2009CB118400) to WL.