Weather Characterization for Optimizing Genomic Prediction in Miscanthus sacchariflorus

Shaik, A., Sacks, E., Leakey, A. D. B., Zhao, H., Kjeldsen, J. B., Jorgensen, U., Ghimire, B. K., Lipka, A. E., Njuguna, J. N., Yu, C. Y., Seong, E. S., Yoo, J. H., Nagano, H., Anzoua, K. G., Yamada, T., Chebukin, P., Jin, X., Clark, L. V., Petersen, K. K., Peng, J., Sabitov, A., Dzyubenko, E., Dzyubenko, N., Glowacka, K., Nascimento, M., Campana Nascimento, A. C., Dwiyanti, M. S., Bagment, L., Proma, S., Garcia-Abadillo, J., Jarquin, D.

Published 2026-03-20

📖 6 min read🧠 Deep dive

View on bioRxiv ↗PDF ↗

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Picture: Breeding Super-Plants Without Breaking the Bank

Imagine you are a plant breeder trying to create the ultimate "super-grass" (specifically a type called Miscanthus) that can be used to make clean biofuel. You have hundreds of different genetic "recipes" (genotypes) for this grass. Your goal is to figure out which recipes will produce the most fuel in different parts of the world.

The Problem: Testing these grasses in the real world is expensive and slow. You can't plant them in every single field on Earth. You have to pick a few test sites (like Denmark, Japan, the US, Korea, and China), grow them there for a few years, and then try to guess how they will do in a new place you haven't tested yet.

The Solution: The authors of this paper used a high-tech tool called Genomic Selection. Think of this like a "crystal ball" that uses the plant's DNA to predict its future performance. But, a crystal ball isn't perfect. It needs to be "trained" on real data. The big question they asked was: "Do we need to train our crystal ball using data from all our test sites, or can we get away with just a few?"

The Analogy: The Weather Forecasting School

To understand their findings, let's use an analogy of a Weather Forecasting School.

Imagine you want to train a student to predict the weather in a new city (let's call it "City X"). You have data from five other cities:

City A (Denmark): Cold and windy.
City B (Japan): Cool and rainy.
City C (USA): Hot summers, cold winters.
City D (Korea): Similar to City C.
City E (China): Very hot and humid (subtropical).

The Old Way: You make the student study the weather data from all five cities before trying to predict City X. This takes a lot of time and resources.

The New Way (This Paper's Discovery): The researchers realized that some cities are "twins" in terms of weather.

City C and City D are very similar.
City A and City B are somewhat similar.
City E is the odd one out (very hot).

They found that if you want to predict the weather for City C, you don't need to study all five cities. You just need to study City D (its twin) or maybe City B and City D together. Studying the other cities actually adds "noise" or confusion because their weather is too different.

What They Actually Did

The researchers took 516 different clones of the grass and grew them in those five locations over three years. They collected two types of data:

DNA Data: The genetic code of every plant.
Weather Data: Temperature, wind, rain, and humidity for every day.

They ran a computer simulation where they "hid" the data from one location (the test site) and tried to predict it using data from the other locations (the training sites). They tested three different models:

Model 1: Just looked at the plant's past performance (Phenotype).
Model 2: Looked at the DNA + Environment.
Model 3: Looked at the DNA + Environment + How the DNA reacts to specific environments (Interaction).

The "Aha!" Moments

Here is what they discovered, broken down simply:

1. Quality Over Quantity
You don't need a massive dataset from everywhere to get a good prediction. In fact, using too many different locations can sometimes make the prediction worse.

Analogy: If you are trying to learn how to cook Italian food, studying a French chef and a Japanese chef might confuse you. It's better to study one or two Italian chefs who cook in a similar style to what you want to achieve.
Result: Often, data from just one or two similar locations was enough to predict the results for a new location better than using data from all four other locations combined. This saves a huge amount of money and time.

2. The "Weather Twin" Rule
The most accurate predictions happened when the "Training City" and the "Test City" had similar weather patterns.

If you wanted to predict how the grass would grow in Urbana, USA, using data from Korea (which has similar weather) worked great.
If you tried to use data from China (which is much hotter and more humid) to predict the US results, the prediction failed.
Key Takeaway: Match your training data to the weather of the place you are trying to predict.

3. The "Odd One Out" Problem
One location (Zhuji, China) was very different from the rest (it was subtropical). Predicting how the grass would grow there was harder because none of the other locations were "twins" to it. However, even here, they found that a specific combination of two other locations worked better than using all of them.

Why This Matters for You

This study is a game-changer for plant breeders and, eventually, for the energy we use.

Save Money: Breeders don't need to set up expensive test fields in 10 different countries. They can pick 2 or 3 "representative" locations that cover the weather patterns they care about.
Faster Results: By using less data to train their models, they can identify the best grass varieties faster.
Better Biofuels: Since Miscanthus is a top candidate for sustainable biofuel, getting the breeding process right means we can get clean energy crops to market sooner.

The Bottom Line

The paper proves that smart selection beats brute force. You don't need to throw everything at the wall to see what sticks. By understanding the "personality" of the weather in different places, breeders can pick the perfect few test sites to train their AI models. This allows them to predict with high accuracy how a new plant will perform in a new world, saving time, money, and resources while helping us grow better crops for a greener future.

1. Problem Statement

Plant breeding programs face significant challenges in optimizing resource allocation for Multi-Environment Trials (METs), particularly for perennial crops like Miscanthus sacchariflorus (Msa). Genomic Selection (GS) is widely used to accelerate breeding cycles by predicting Genomic Estimated Breeding Values (GEBVs), but its accuracy in unobserved environments depends heavily on the composition of the training set.

The Core Issue: Breeders often lack a systematic method to determine which specific locations (environments) should be included in a training set to best predict performance in a new, unobserved target location.
The Gap: While weather data influences crop performance, few studies have utilized environmental characterization to optimize training set composition specifically for perennial crops, where selection cycles are long and phenotyping costs are high. The study asks: Can identifying environmentally correlated locations based on weather patterns allow breeders to reduce the number of training sites (and thus costs) without sacrificing prediction accuracy?

2. Methodology

Data Sources

Genotypes: 516 unique clonal genotypes of M. sacchariflorus (derived from a panel of 722, filtered to 34,605 SNPs).
Phenotypes: 7,740 dry biomass yield records collected over three years (2016–2018).
Locations: Five distinct sites across three continents:
- L1 (AU): Foulum, Denmark (Temperate, cool).
- L2 (HU): Sapporo, Japan (Temperate, cool).
- L3 (UI): Urbana, Illinois, USA (Temperate).
- L4 (KNU): Chuncheon, South Korea (Temperate).
- L5 (ZJU): Zhuji, China (Subtropical, warm/humid).
Weather Data: Nine daily covariates (mean/max/min temperature, dew point, wind speed, relative humidity, precipitation, longwave/shortwave irradiance) obtained from NASA Earth Exchange (NEX) via the EnvRtype R package.

Statistical Models

Three linear mixed models were implemented using the BGLR package to predict biomass yield:

M1 (S+T+L): Baseline phenotypic model (Site + Time + Line). No genomic data.
M2 (S+T+G): Genomic Baseline (STGBLUP). Includes Genomic main effects (G) via a genomic relationship matrix ( $G$ ).
M3 (S+T+G+G×S): Genomic + Genotype-by-Environment Interaction. Includes the interaction between genomic effects and sites ( $G \times S$ ) using a Hadamard product of covariance structures.

Cross-Validation (CV) Schemes

Two realistic breeding scenarios were tested using a "Leave-One-Location-Out" approach:

CV00: Predicting untested genotypes in an unobserved environment. (Training set: 2/3 of genotypes; Testing set: 1/3 of genotypes, all unobserved in the target location).
CV0: Predicting tested genotypes in an unobserved environment. (Training set: All genotypes observed in other locations; Testing set: Genotypes observed elsewhere but missing in the target location).

Training Set Optimization Strategy

Instead of using all available locations for training, the authors tested 15 unique combinations of training sets for each target location. These combinations varied in size from 1 to 4 locations (e.g., predicting L1 using only L2, or L2+L3, or L2+L3+L4, etc.).

Weather Characterization: A Principal Component Analysis (PCA) was performed on the environmental kinship matrix ( $\Omega$ ) derived from the weather covariates to identify clusters of similar locations.

3. Key Contributions

Perennial Crop Focus: This is one of the first studies to apply weather-based training set optimization to a perennial bioenergy crop (Miscanthus), where long selection cycles make resource optimization critical.
Training Set Parsimony: The study demonstrates that reducing the training set size (using only 1 or 2 correlated locations) often yields equal or superior predictive ability compared to using all available locations.
Methodological Approach: Rather than directly feeding weather covariates into the prediction model (which was limited by the small number of environments), the authors used weather data to characterize environmental similarity and strategically select training sets. This avoids overfitting in small datasets while leveraging environmental correlations.
Resource Allocation Framework: Provides a decision-making framework for breeders to select specific trial sites based on weather correlation to the target environment, potentially reducing phenotyping costs by up to 75%.

4. Key Results

Environmental Characterization (PCA)

The first two principal components explained ~86% of the environmental variance.
Clustering: Locations L1 (Denmark) and L2 (Japan) were cooler; L3 (USA) and L4 (Korea) were correlated and intermediate; L5 (China) was distinct (subtropical/warm) and uncorrelated with the others.
Implication: L5 is an outlier; predicting it requires specific training sets, while L1–L4 share more similar weather patterns.

Prediction Accuracy (CV00 - Untested Genotypes)

Overall Range: Predictive ability (Pearson's correlation) varied between 0.45 and 0.90.
Model Performance: M2 (Genomic main effects) and M3 (with G×E) performed similarly. M3 did not significantly outperform M2, likely due to the small number of environments preventing robust estimation of complex interaction terms.
Training Set Size:
- L3 (UI): Achieved the highest accuracy (~0.90–0.93). Using a single location (L4) or two locations (L1+L2) was sufficient; adding more locations did not improve results.
- L5 (ZJU): Best predicted using L2 and L4 (correlated sites). Using L1 (cold) resulted in poor accuracy.
- General Finding: In most cases, one or two correlated locations provided the best results, often outperforming models trained on all four remaining locations.

Prediction Accuracy (CV0 - Tested Genotypes)

Overall Range: Similar to CV00, with high accuracy for L3 (~0.94) and lower accuracy for L1 and L4.
Model Performance: M1 (phenotypic only) performed surprisingly well when multiple locations were used, but M2/M3 were superior for single-location training sets.
Key Insight: For L3 (UI), using a single location (L4) yielded results comparable to using all other locations. For L4 (KNU), prediction accuracy declined significantly in Year 3 due to atypical weather, highlighting the risk of relying on single-year data.

Weather Correlation Impact

High predictive ability was consistently observed when the training and testing locations shared similar weather patterns (e.g., L2 predicting L1, or L4 predicting L3).
Low or negative correlations occurred when training sets included environmentally distinct sites (e.g., using L5 to predict L1, or L1 to predict L5).

5. Significance and Conclusion

The study concludes that environmental characterization is a powerful tool for optimizing genomic prediction in perennial crops.

Cost Efficiency: Breeders do not need to phenotype in all available locations to achieve high prediction accuracy. By selecting 1–2 locations with weather patterns correlated to the target environment, they can reduce training set sizes by up to 75% while maintaining or improving predictive ability.
Strategic Trial Design: The results suggest that future breeding trials should be designed based on weather similarity rather than geographic proximity alone.
Model Selection: For datasets with a limited number of environments (like this study with 5 sites), a standard genomic model (M2) is often sufficient. Adding complex G×E interaction terms (M3) did not yield significant gains, suggesting that the benefit comes from selecting the right training environments rather than increasing model complexity.

This work provides a practical roadmap for breeding programs to maximize genetic gain while minimizing the high costs associated with multi-environment field trials for perennial bioenergy crops.