Improving Full Waveform Inversion in Large Model Era

Imagine you are trying to figure out what's inside a giant, opaque rock without breaking it open. You can only tap on the surface and listen to the echoes. This is essentially what Full Waveform Inversion (FWI) does for geologists: it tries to map the hidden underground world (like oil reservoirs or fault lines) by analyzing sound waves recorded on the Earth's surface.

For a long time, this has been like trying to solve a massive, broken jigsaw puzzle where many pieces are missing, and the picture keeps changing. Traditional methods are slow, expensive, and often get stuck guessing the "average" shape of the underground, missing the sharp, interesting details like salt domes or oil pockets.

Recently, scientists started using AI to solve this puzzle faster. But there was a catch: the AI models were like small, over-caffeinated students. They memorized the few practice tests they were given (simple, computer-generated data) but failed miserably when faced with a real, messy exam (complex real-world geology). They tended to "overfit," meaning they just drew a blurry, safe guess instead of the real picture.

This paper introduces a new approach called BigFWI. Think of it as upgrading from that small student to a genius super-learner with a billion "brain cells" (parameters). Here is how they made this giant brain work without it getting confused:

1. The "Library of Imagination" (Data Augmentation)

The biggest problem was that geologists didn't have enough real-world data to train a giant AI. It's like trying to teach a chef to cook a million different dishes but only giving them 400 recipes.

The Fix: They used a "dream machine" (a Diffusion Model) to invent millions of new, fake geological maps.
The Analogy: Imagine a chef who practices on 400 real recipes, but then uses an AI to generate 5 million new variations of those recipes (adding weird spices, changing textures). The chef trains on all of them. Even though the new recipes are made up, they follow the laws of physics. When the chef finally faces a real customer, they are so well-practiced that they can cook the dish perfectly, even if they've never seen that exact recipe before.

2. The "All-Seeing Eye" (Non-Causal Modeling)

Old AI models read the seismic data like a book, one word at a time (left to right). By the time they got to the end of the sentence, they had forgotten the beginning.

The Fix: The new model looks at the entire picture at once.
The Analogy: Instead of reading a mystery novel one page at a time and guessing the ending, this model is like a detective who can see the whole crime scene simultaneously. It connects the dots between the sound waves and the underground layers instantly, understanding the "big picture" context rather than just local details.

3. The "High-Definition Camera" (ViT-VQGAN Tokenizer)

To teach the AI, they have to turn the underground maps into a language the computer understands (tokens). Old methods squashed the image down, losing the fine details, like taking a 4K photo and shrinking it to a tiny, blurry thumbnail.

The Fix: They built a new "translator" that keeps the image huge and sharp.
The Analogy: Instead of describing a forest as "a bunch of green trees," this new translator describes every single leaf, branch, and shadow. It preserves the tiny, critical details (like the sharp edge of a salt dome) that older models would blur out.

4. The "Coach and the Referee" (Reinforcement Learning & Physics)

Even with a giant brain and good data, the AI sometimes makes small, weird mistakes that look okay but break the laws of physics.

The Fix: They added two final steps:
- The Coach (RL): After the AI makes a guess, a "coach" checks if the whole map looks geologically sensible. If the AI draws a weird, disconnected island of rock, the coach says, "No, that doesn't make sense," and nudges the AI to try again.
- The Referee (Latent Gradient Descent): Finally, they run a quick physics check. If the sound waves predicted by the AI's map don't match the actual recorded waves, they tweak the map slightly to ensure it obeys the laws of sound.
The Analogy: It's like a student taking a test. First, they write the answers (the AI). Then, a coach reviews the logic to make sure the story makes sense (RL). Finally, a referee checks the math to ensure no calculation errors (Physics).

The Result

When they tested this "BigFWI" system on real-world geological benchmarks (like the famous Marmousi or Salt models) that the AI had never seen before, it didn't just guess the average shape. It drew sharp, clear boundaries and found complex structures that other methods missed.

In short: By combining a massive AI brain, a library of millions of "dreamed-up" practice maps, and a strict adherence to the laws of physics, the researchers turned a blurry, unreliable guess into a high-definition, accurate map of the Earth's hidden depths. It proves that if you train a giant model correctly, even on simple data, it can learn to understand the complex, messy real world.

1. Problem Statement

Full Waveform Inversion (FWI) is a critical technique in geophysics used to reconstruct subsurface velocity maps from surface-recorded seismic waveforms. It is governed by the acoustic wave equation but is inherently a highly nonlinear and ill-posed inverse problem.

Current Limitations: Traditional iterative solvers (gradient descent, adjoint-state methods) are computationally expensive, sensitive to initialization, and prone to local minima.
Data-Driven Challenges: Existing deep learning approaches for FWI typically rely on small-scale models due to the limited volume and diversity of available geological datasets. These models suffer from overfitting and fail to generalize to realistic, complex geological structures (e.g., salt bodies, strong heterogeneity) that differ from their training data.
The Gap: While large models (like LLMs) have revolutionized NLP, applying them to scientific inverse problems like FWI is difficult due to data scarcity and the need for strict physical consistency.

2. Methodology

The authors propose a "working recipe" to tame a billion-parameter model for FWI by coordinating scaling across three axes: model capacity, data diversity, and training strategy. The framework, named BigFWI (in the context of the paper's comparison, though the authors' method is the proposed large-model approach), consists of the following components:

A. Architecture: Non-Causal Transformer & ViT-VQGAN

Backbone: A 1-billion-parameter Transformer is used as the backbone. Unlike standard autoregressive models that generate tokens sequentially (causal), this model employs non-causal parallel decoding. All velocity tokens are generated simultaneously using full self-attention, allowing for global contextual modeling between seismic and velocity data, which significantly improves efficiency and accuracy.
Tokenization (ViT-VQGAN): To discretize velocity maps without losing fine geological details, the authors replace standard CNN-based VQGANs with a ViT-VQGAN tokenizer.
- It removes the compression bottleneck by interpolating input/output to a higher resolution (5x larger).
- It uses a larger latent grid ( $25 \times 25 \times 196$ ) to preserve high-frequency geological structures.
- Rotary Positional Embeddings (RoPE) are used to maintain spatial coherence.

B. Data Strategy: Diffusion-Driven Augmentation

To address the scarcity of real-world geological data:

A Latent Diffusion Model is trained on the existing OPENFWI dataset to synthesize diverse subsurface velocity maps.
For every synthesized velocity map, an acoustic forward simulator generates the corresponding seismic data.
This ensures physical consistency between the two modalities.
Result: The training corpus is expanded from 408k to over 5 million velocity-seismic pairs, introducing hybrid geological structures that mix features from different sub-datasets.

C. Training Strategy: Two-Stage Pipeline

Supervised Pre-training: The model learns token-wise mappings from seismic-conditioned inputs to velocity representations using the expanded dataset.
Reinforcement Learning (RL) Post-training:
- The model is fine-tuned using a policy optimization approach (GRPO-style).
- Instead of token-level cross-entropy, the model optimizes for map-level rewards that encourage geological continuity and physical plausibility.
- This bridges the gap between purely supervised learning and structural priors.

D. Post-Processing: Latent-Space Gradient Refinement

After token prediction, a physics-based Gradient Descent (GD) refinement is performed directly in the continuous latent space of the VQGAN decoder.
This optimizes the latent embeddings to minimize residuals against the forward-modeled seismic data, enforcing consistency with the wave equation.
Unlike traditional FWI which updates the velocity map directly (requiring heavy regularization), this latent-space refinement preserves high-frequency details while correcting physical inconsistencies.
Ensemble Aggregation: Stochastic sampling is used to generate multiple reconstructions, which are averaged to reduce predictive uncertainty.

3. Key Contributions

Large-Scale Scaling for FWI: Demonstrates that a billion-parameter model trained entirely on simple synthetic data can generalize remarkably well to complex, unseen geological benchmarks.
Novel Architecture: Introduces a non-causal, parallel-decoding transformer combined with a high-fidelity ViT-VQGAN tokenizer, overcoming the limitations of sequential decoding and information compression.
Data Synthesis Pipeline: Develops a diffusion-based augmentation strategy that creates 5M+ physically consistent training pairs, solving the data scarcity bottleneck for large models.
RL & Physics Alignment: Integrates Reinforcement Learning for structural fidelity and latent-space gradient refinement for physical consistency, creating a robust end-to-end pipeline.

4. Results

The method was evaluated on the OPENFWI benchmark and six challenging, unseen geophysical benchmarks (Marmousi, 2D SEG/EAGE Salt and Overthrust, 2004 BP, Sigsbee, and SEAM Phase I).

Performance on OPENFWI:
- Achieved a new state-of-the-art MAE of 0.0136 (compared to 0.0437 for BigFWI-M and 0.0770 for the causal baseline).
- The ablation study showed that each component (Data Scaling, Non-Causal modeling, ViT-VQGAN, RL, Latent GD) contributed incrementally to performance.
Zero-Shot Generalization:
- On realistic benchmarks containing complex salt bodies and faults (absent in training), the method achieved an SSIM of 0.7669, a significant improvement over the baseline BigFWI (0.5844).
- Qualitative Improvement: Unlike baseline methods that collapse into "over-smoothed, mean-shaped" solutions missing key interfaces, the proposed model recovers sharp boundaries, coherent stratigraphy, and geologically meaningful high-velocity regions (salt bodies).
Efficiency: The non-causal parallel decoding significantly improves inference speed compared to autoregressive baselines.

5. Significance

This work marks a paradigm shift in data-driven geophysical inversion. It proves that scaling laws (increasing model size, data diversity, and training complexity) are applicable to scientific inverse problems, even when trained on simple synthetic data.

Generalization: It narrows the long-standing generalization gap between synthetic training data and realistic geological structures.
Physical Consistency: By integrating RL and physics-guided latent refinement, the model produces results that are not just statistically accurate but geologically and physically plausible.
Future Impact: The framework provides a blueprint for applying large foundation models to other scientific domains governed by physical laws, suggesting a path toward automated, high-resolution subsurface imaging without the need for massive, expensive real-world labeled datasets.

Improving Full Waveform Inversion in Large Model Era

1. The "Library of Imagination" (Data Augmentation)

2. The "All-Seeing Eye" (Non-Causal Modeling)

3. The "High-Definition Camera" (ViT-VQGAN Tokenizer)

4. The "Coach and the Referee" (Reinforcement Learning & Physics)

The Result

1. Problem Statement

2. Methodology

A. Architecture: Non-Causal Transformer & ViT-VQGAN

B. Data Strategy: Diffusion-Driven Augmentation

C. Training Strategy: Two-Stage Pipeline

D. Post-Processing: Latent-Space Gradient Refinement

3. Key Contributions

4. Results

5. Significance

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank