An Ocean Model Ported by a Large Language Model:… — Plain-Language Explanation

Original authors: Nikolay V. Koldunov, Suvarchal K. Cheedela, Sergey Danilov, Dmitry Sidorenko, Sebastian Beyer, Thomas Jung

Published 2026-06-11

📖 5 min read🧠 Deep dive

View on arXiv ↗PDF ↗

CC BY 4.0

Original authors: Nikolay V. Koldunov, Suvarchal K. Cheedela, Sergey Danilov, Dmitry Sidorenko, Sebastian Beyer, Thomas Jung

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you have a massive, incredibly complex, and highly successful recipe for a 5-star dish. This recipe has been written in a very old, specialized language (let's call it "Fortran") that only a few master chefs understand. It's been tested for decades, and everyone knows it works perfectly. However, the kitchen is changing: the new ovens (modern supercomputers with powerful GPUs) don't speak "Fortran" anymore. They speak "C++."

The problem? Translating this 74,000-line recipe from the old language to the new one is like trying to translate a novel while simultaneously rebuilding the house it's written in. If you make even one tiny mistake in the math, the dish could turn into poison, or the kitchen could catch fire. Usually, this takes a team of human experts years to do.

This paper describes a new experiment: Can an AI (a Large Language Model) do this translation job for us, and can it do it without ruining the recipe?

Here is how they did it, using simple analogies:

1. The Two-Step Translation Strategy

Instead of asking the AI to jump straight from "Old Language" to "New High-Speed Language," the team forced it to take a detour.

Step 1: The "Clean Copy" (Fortran → C): First, they asked the AI to translate the recipe into a simpler, middle-ground language called "C."
- The Rule: The AI was strictly forbidden from "improving" the recipe. It couldn't swap ingredients to make them "better" or change the cooking times to be more efficient. It had to be a literal, word-for-word copy.
- The Goal: To make sure the flavor (the physics) stayed exactly the same. They ran this new "C" version for five years of simulated time. It tasted identical to the original "Fortran" version, with differences so tiny they were like a grain of salt in an ocean.
Step 2: The "Speed Upgrade" (C → C++/Kokkos): Once the "C" version was proven to be perfect, they asked the AI to translate that into the modern "C++" language, which is built to run on super-fast GPU ovens.
- The Safety Net: Because the "C" version was already perfect, the AI could now focus on speed. They checked every single step of the cooking process to ensure the new "C++" version produced the exact same numbers as the "C" version on standard computers.

2. The "Twin" Check System

How did they know the AI didn't sneak in a mistake? They used a system of "Twins."

Imagine you have a master chef (the original code) and a student chef (the new code). Every time the student chef chops an onion, they have to show the master chef the result immediately.

The "Twin" Test: For every single cooking step, the computer runs the new code and the old code side-by-side. If the numbers differ by even a tiny fraction, the system screams "Stop!" and tells the AI, "You messed up this specific step."
The "Stale Halo" Trap: One common mistake the AI made was forgetting to update the edges of the data (like forgetting to wash the cutting board between cuts). The team built a special "probe" that checks the edges specifically to catch these invisible errors.

3. The Results: Speed and Accuracy

The experiment was a success. Here is what happened:

Accuracy: The new code is scientifically trustworthy. Over five years of simulation, the new version's ocean temperatures and salinity were almost indistinguishable from the original. On the new super-fast GPUs, the results were "statistically close"—meaning the tiny differences were just due to how the computer does math, not because the physics was wrong.
Speed: The new code runs on modern GPUs (like the NVIDIA A100) and is 1.6 to 3.7 times faster than the old code running on standard CPUs.
Portability: The best part? They wrote the code once, and it runs on different types of supercomputers (NVIDIA, AMD, and others) without needing to be rewritten. It's like a universal adapter that fits any outlet.

4. What Went Wrong (and How They Fixed It)

The AI isn't perfect. It tried to "help" by simplifying things, which almost broke the physics.

The "Simplification" Trap: The AI wanted to round off numbers or change a constant value because it looked "cleaner." The team had to strictly forbid this. They told the AI: "If the original says 0.1, you write 0.1. Do not guess."
The "Comment" Trap: The AI sometimes read a comment in the code that said "The value is 5" but the actual code said "The value is 10." The AI trusted the comment. The team fixed this by forcing the AI to check the actual code line every time.

The Bottom Line

This paper proves that with the right rules and a strict "safety ladder" of checks, an AI can translate a massive, complex scientific model from an old language to a new, super-fast one in a matter of weeks.

It didn't just copy the code; it preserved the science. The ocean model still behaves exactly like the real ocean, but now it runs fast enough to help us predict the future climate on the world's most powerful computers. The key wasn't just the AI; it was the discipline of the humans guiding it: strict rules, literal translation, and constant checking.

Technical Summary: An Ocean Model Ported by a Large Language Model

Problem Statement
Climate projections are increasingly requiring kilometer-scale ocean resolutions, necessitating the migration of established, large-scale Fortran ocean general-circulation models (GCMs) to modern hardware, particularly GPUs. However, these models, often developed over decades for distributed-memory CPU clusters, face significant barriers to porting: a scarcity of human expertise in domain knowledge, porting, and performance tuning, and the difficulty of maintaining scientific fidelity during translation. While Large Language Models (LLMs) have demonstrated success in translating smaller code segments or individual functions, it remained unestablished whether an LLM could port a complete, production-grade geophysical model to a different language and framework (specifically for GPU acceleration) without degrading its physics or numerical accuracy.

Methodology
The authors ported FESOM2, an unstructured-mesh finite-volume ocean–sea-ice model (approximately 74,000 lines of core Fortran), using an agentic LLM coding assistant (Claude Code with the Opus 4.7 model) under the direction of domain experts. The porting process was structured around three critical practices to ensure reliability:

Two-Stage Translation: The translation was split into two distinct phases to separate numerical correctness from parallelism.
- Stage 1 (Fortran to C): The model was translated into a clean, single-threaded C reference. This stage collapsed the highly configurable Fortran code into the specific configuration used for the run, resolving ambiguities regarding active compile-time options and runtime defaults. The translation was strictly literal, prohibiting the LLM from "improving" or simplifying the code.
- Stage 2 (C to C++/Kokkos): The C reference was then wrapped in C++ using the Kokkos performance-portability layer to target both CPUs and GPUs. This stage focused on parallelization while preserving the arithmetic of the C reference.
Strict Literal Translation: The LLM was instructed to perform line-by-line translation, converting 1-based to 0-based indexing, adapting column-major to row-major storage, and converting global USE variables to struct passing. No semantic changes were permitted. This ensured that any divergence from the reference was a porting bug rather than a physics modification.
Tiered Validation Ladder: A rigorous validation framework was applied at each stage:
- Fortran to C: Validated via long-term statistical agreement (5-year integrations) rather than bit-for-bit equality, as language and compiler differences preclude exact byte-level matching.
- C to Kokkos (CPU): Validated via bit-for-bit identity against the C reference on deterministic back-ends (Serial/OpenMP).
- Kokkos (GPU): Validated via statistical closeness on GPUs (where floating-point reduction orders differ) and strict "gates" (e.g., 20-step runs with active sea ice) to detect real errors versus expected numerical divergence.
- Debugging Tools: Custom tools, such as per-substep reference dumps, identical-input operator diffs, and stale-halo probes, were developed to isolate failures to specific kernels or subsystems.

Key Results

Fidelity:
- The C port reproduced the original Fortran model over a five-year integration with a global sea-surface temperature (SST) root-mean-square difference of 0.006 °C and salinity difference of 0.002 PSU. Deep ocean differences were statistically indistinguishable from zero below 700 m.
- The Kokkos CPU builds were bit-for-bit identical to the C reference over a full simulated year.
- The Kokkos GPU builds remained statistically close to the C reference, with SST correlations of 1.0 and biases of $+10^{-4}$ °C. The GPU-induced divergence was approximately three orders of magnitude smaller than the uncertainty introduced in the Fortran-to-C translation.
Performance:
- On high-resolution meshes (up to 7.4 million surface vertices), a single NVIDIA A100 GPU node ran 1.6–3.7× faster than a CPU node.
- The model achieved the production target of 1–2 simulated years per day (SYPD) on multi-million vertex meshes across all tested hardware.
- On the NVIDIA GH200 system, throughput reached up to 3.5 SYPD.
Portability:
- A single Kokkos source codebase successfully compiled and ran on diverse hardware without rewriting physics code: NVIDIA A100, H100, and GH200 (via CUDA) and AMD MI250X (via HIP). Porting to the AMD system required less than one day of work, primarily involving a minor change to a preprocessor guard.

Significance and Claims
The paper claims to be the first demonstration that an LLM-assisted port can carry a full production ocean–sea-ice model to a GPU-capable implementation while retaining scientific fidelity and reaching production-relevant performance. The authors emphasize that the success was not due to the LLM's autonomous capability alone, but rather a disciplined workflow combining:

Agentic assistance for tireless translation and harness construction.
Human domain expertise for strategy, plan review, and catching subtle physics errors.
A tiered validation procedure that converts silent physics errors into immediate, localized failures.

The work establishes that LLMs can move established Fortran models into modern, performance-portable languages (C++/Kokkos) in a matter of weeks, provided the translation is constrained by strict rules and validated against appropriate acceptance criteria. The authors present this not as a final optimization of the model, but as a validated, competitive starting point that preserves the physics of the original model while enabling execution on modern accelerators.

An Ocean Model Ported by a Large Language Model: Experience and Lessons from FESOM2 (Fortran to C to C++/Kokkos)