A Multimodal Vision Transformer-based Modeling Framework for Prediction of Fluid Flows in Energy Systems

Imagine you are trying to predict how a complex, swirling cloud of gas will move inside a high-tech engine. Traditionally, scientists use supercomputers to run "Digital Twins" of these engines. These simulations are incredibly accurate, but they are also like trying to solve a million-piece jigsaw puzzle while running a marathon: they take forever, cost a fortune in electricity, and are often too slow to be useful for real-time design.

This paper introduces a new, faster way to do this using a type of Artificial Intelligence called a Multimodal Vision Transformer. Here is the breakdown in simple terms:

1. The Problem: The "Slow Motion" Bottleneck

Think of traditional fluid simulations as a high-definition, slow-motion camera that records every single drop of gas. It's perfect, but it takes hours to record just one second of action. Engineers need to see the movie in real-time to design better engines, but the computer is too slow.

2. The Solution: The "Super-Intelligent Movie Editor"

The researchers built a new AI model that acts like a super-intelligent movie editor. Instead of calculating every single drop of gas from scratch, this AI has "watched" thousands of different simulations (movies) of gas moving. It has learned the rules of how gas behaves, so it can predict the next frame of the movie almost instantly.

They call this a Multimodal Vision Transformer. Let's break down the fancy name:

Vision Transformer: Think of this as a brain that looks at an image (the gas flow) and understands how different parts of the image relate to each other, even if they are far apart. It's like looking at a crowd and instantly knowing how the people on the left are reacting to the people on the right.
Multimodal: This is the cool part. Usually, AI needs data in one specific format (like just a photo). This AI is like a multilingual translator. It can understand data from different "angles" or "senses."
- It can look at a slice of the gas (like cutting a loaf of bread to see the inside).
- It can look at a projection (like an X-ray or a shadow cast on a wall).
- It can switch between them. If you give it an X-ray, it can guess what the inside slice looks like, and vice versa.

3. How It Was Trained: The "Gym for AI"

To teach this AI, the researchers didn't just show it one type of gas flow. They created a massive "gym" of training data using a specific scenario: shooting a jet of Argon gas into a room full of Nitrogen.

They varied the conditions to make the AI tough and smart:

Different Grids: Some simulations were low-resolution (blurry), some were high-resolution (crisp).
Different Physics: They used different mathematical rules for how the gas behaves (some simple, some very complex).
Different Angles: They showed the AI the gas from the side, from the top, and as a shadow.

By training on this messy, diverse mix, the AI learned the universal laws of the gas, not just the specific details of one simulation. It learned to generalize, meaning it can handle new situations it hasn't seen before.

4. What Can It Do? (The Two Superpowers)

The paper tests the AI on two main tasks:

Task A: Predicting the Future (Time Travel)

The Analogy: You show the AI a photo of a gas cloud at 1:00 PM. It predicts exactly what the cloud will look like at 1:01 PM, 1:02 PM, and so on.
The Result: It's very good at predicting the big picture (where the cloud is moving). It's a bit "blurry" on the tiny, chaotic swirls inside the cloud, but it captures the main movement perfectly and does it thousands of times faster than a supercomputer.

Task B: Filling in the Blanks (X-Ray Vision)

The Analogy: Imagine you only have a shadow of a person on a wall. Can you guess what their face looks like?
The Result: The AI can take a "shadow" (a projected view) of the gas and reconstruct the "face" (the actual 3D slice). It can also take a slice and turn it into a shadow. It's not perfect (it smooths out the tiny details), but it gets the general shape and structure right.

5. Why Does This Matter?

Currently, designing engines or energy systems is slow because we have to wait for the "slow-motion" simulations to finish.

This new framework is like giving engineers a crystal ball. Instead of waiting days for a simulation, they can get a "good enough" prediction in seconds. This allows them to test hundreds of designs quickly, find the best one, and then use the slow, expensive supercomputer just for the final check.

In summary: The researchers built an AI that learned to "read" fluid dynamics like a movie. It can predict the future of gas flows and see through solid objects (mathematically), all by learning from a diverse library of simulations. It's a giant leap toward making energy systems faster, cheaper, and more efficient to design.

1. Problem Statement

Computational Fluid Dynamics (CFD) simulations for complex fluid flows in energy systems (e.g., high-pressure gas injection in engines) are computationally prohibitive due to strong nonlinearities, multiphysics interactions, and multiscale behaviors. Traditional surrogate modeling approaches, such as Neural Operators (DeepONet, FNO), often suffer from a lack of generalization because they are typically trained on specific geometries or flow configurations. There is a need for scientific foundation models—large-scale, pretrained neural networks capable of learning unified operators across diverse physical regimes, grid resolutions, and data modalities to enable fast, data-driven prediction and inference.

2. Methodology

Data Curation

The authors utilized a curated dataset of multi-fidelity CFD simulations of an argon jet injected into a quiescent nitrogen environment (35 bar into 5 bar). The dataset introduces systematic variability to test generalization:

Grid Resolutions: Coarse and Fine grids.
Turbulence Models: Reynolds-Averaged Navier–Stokes (RANS) and Large-Eddy Simulation (LES).
Thermodynamics: Ideal Gas (IG) and Real Gas (RG) equations of state.
Diffusion: Variations in effective Schmidt numbers (Sc).
Modalities: The 3D simulation data was processed into three distinct 2D views:
1. Longitudinal Slice: Horizontal cut-plane ( $x-z$ ) through the jet centerline.
2. Longitudinal Projection: Line-of-sight integrated view (analogous to X-ray radiography).
3. Transverse Slice: Axial cross-sections at specific distances ( $z=2$ mm, $z=10$ mm).

Model Architecture: SwinV2-UNet

The core framework employs a hierarchical Vision Transformer based on the SwinV2 architecture, structured as a U-Net (Encoder-Decoder).

Backbone: Uses Shifted Window attention to reduce computational complexity from quadratic to linear relative to patch count, enabling high-resolution processing.
Encoder: Builds a multi-resolution hierarchy via patch merging and SwinV2 blocks, incorporating ConvNeXt blocks for local spatial feature extraction.
Decoder: Mirrors the encoder with patch expansion layers and skip connections to preserve fine spatial details.
Auxiliary Conditioning: To handle multimodal data, the model uses auxiliary tokens added to patch embeddings:
- Time-step Embedding: Encodes the temporal increment ( $\Delta t$ ).
- Data Source Token (DST): A one-hot vector encoding grid resolution, modality (slice/projection), turbulence model, and equation of state. This allows a single architecture to adapt to different data sources.

Training Tasks

The framework addresses two primary tasks using the same backbone:

Spatiotemporal Prediction (Rollouts): Autoregressive prediction of future flow states ( $t + \Delta t$ ) given the current state. The model predicts the residual $\Delta u$ . Training strategies include single-step, multi-step rollout, and pushforward (loss computed only on the final step).
Feature Transformation: Intra-timestep inference where the model maps observed fields (e.g., density) to unobserved fields (e.g., velocity) or transforms between modalities (e.g., projection to slice).

3. Key Contributions

Multimodal Foundation Model for Fluids: Demonstrates a unified transformer architecture capable of reasoning across heterogeneous data sources (different resolutions, physics models, and observational modalities) without retraining the core weights.
Context-Aware Conditioning: Introduces a mechanism to explicitly encode simulation metadata (modality, fidelity, time step) into the transformer, enabling the model to generalize across distinct physical regimes.
Dual-Task Framework: Successfully adapts a single model for both temporal forecasting (predicting evolution) and spatial reconstruction (inferring missing fields/modalities).
Proof-of-Concept for Energy Systems: Bridges the gap between large-scale vision transformers and practical engineering problems, specifically high-pressure gas injection relevant to propulsion.

4. Results

Spatiotemporal Prediction

Performance: The model accurately captures large-scale flow evolution, spray boundaries, and global motion across unseen configurations (e.g., predicting Fine-grid LES Ideal Gas cases trained on a mix of other fidelities).
Limitations: While global structures are robust, fine-scale turbulent details are often smoothed out.
Training Strategies: Multi-step rollout training captures intrinsic flow details better than single-step training, though it leads to error accumulation over longer horizons. The "pushforward" strategy offers a balance, producing coherent multi-step rollouts.

Feature Transformation

The model was tested on five cross-modal and cross-variable tasks:

Density $\to$ Velocity: Successfully inferred in-plane velocity components ( $u, w$ ) from density. The out-of-plane component ( $v$ ) showed lower accuracy due to inherent ambiguity in planar observations.
Projection $\to$ Transverse Slice: Successfully reconstructed transverse density slices from longitudinal projections, capturing jet boundaries and mixing regions, albeit with significant smoothing.
Bidirectional Slice/Projection: The model effectively fused projected density and velocity to reconstruct slice fields (Case 3) and aggregated slice data into global projections (Case 4).
Cross-Plane Transfer: Successfully inferred downstream flow structures ( $z=10$ mm) from upstream measurements ( $z=2$ mm), capturing flow expansion and mixing timing.

General Observation: Across all transformation tasks, the model exhibits a smoothing effect, struggling to recover high-frequency turbulent fluctuations, but excels at preserving topological structures and large-scale dynamics.

5. Significance and Future Work

This work establishes that Vision Transformers can serve as powerful foundation models for fluid dynamics in energy systems. By learning from diverse, multi-fidelity simulations, these models can generalize across resolutions and physics, offering a pathway to replace expensive CFD simulations with fast, data-driven surrogates for design and optimization.

Future Directions:

Scaling the architecture using efficient parallelism techniques (e.g., SWiPe).
Incorporating probabilistic modeling via flow matching and latent masked training (e.g., OmniCast).
Adapting the framework to complex geometries by replacing patch-based representations with graph-based or point cloud-based representations to handle unstructured meshes typical in engineering applications.