Consistency of AI-Generated Exercise Prescriptions: A… — Plain-Language Explanation

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you walk into a gym and ask a super-smart, infinitely knowledgeable robot coach to create a workout plan for you. You ask it, "What should I do?" and it gives you a perfect plan. But then, you ask the exact same question again, word for word, and it gives you a slightly different plan. Then you ask a third time, and it's different again.

Would you trust that robot with your health?

This paper, titled "Consistency of AI-Generated Exercise Prescriptions," is basically a "reality check" for that robot coach. The author, Kihyuk Lee, wanted to see if an AI (specifically Google's Gemini 2.5 Flash) could give the same answer every time you asked it the same question, or if it was just guessing randomly like a dice roll.

Here is the breakdown of what they found, using some everyday analogies:

1. The Experiment: The "20 Times" Test

The researchers didn't just ask the AI once. They created 6 different patient profiles (ranging from a healthy 30-year-old wanting to get buff, to a 70-year-old with knee pain and a history of falls).

For each profile, they asked the AI to create an exercise plan 20 times in a row, using the exact same words. In total, they got 120 different workout plans. They then compared them to see how much the AI "wobbled" in its answers.

2. The Three Things They Checked

They looked at the AI's answers through three different lenses:

The "Vibe" Check (Semantic Consistency):
- The Metaphor: Imagine asking a friend to describe a movie. If you ask them 20 times, will they tell you the same story?
- The Result: Yes, mostly. The AI was very good at telling the same "story." The words and general tone were almost identical every time (90% similar). It didn't get confused about who the patient was.
The "Recipe" Check (Structural Consistency):
- The Metaphor: Imagine a recipe for chocolate cake. If you ask a chef 20 times, they should always say "2 cups of flour, 3 eggs." If one time they say "2 cups" and the next time they say "a handful," the cake might fail.
- The Result: Here's the problem. While the AI knew what to do, it couldn't agree on the numbers.
  - Frequency: It was pretty good at saying "do this 3 times a week."
  - Intensity (The Big Issue): This was the messiest part. For resistance training (lifting weights), the AI couldn't decide on the weight. In 10% to 25% of the plans, it gave vague answers like "lift heavy" without saying how heavy, or it gave numbers that didn't make sense. It was like a chef saying, "Add a pinch of salt" one time and "Add a cup of salt" the next.
The "Safety Net" Check (Safety Consistency):
- The Metaphor: Does the robot always remind you to wear a helmet?
- The Result: Yes, but with a twist. The AI always included safety warnings (100% of the time). However, the amount of warning varied wildly. For a healthy young person, it gave a short safety note. For a sick, older patient, it wrote a whole novel of warnings. This is actually good! It shows the AI knows that sicker people need more caution.

3. The Big Takeaway: "Strict Rules = Better Answers"

The study found something interesting about constraints.

When the patient had a very specific, strict medical condition (like "I have knee pain and can't walk far"), the AI gave very consistent answers. It was like a student taking a strict exam with only one right answer.
When the patient was healthy and just wanted to "get strong," the AI had more freedom to guess. It gave different answers every time because there were many "right" ways to get strong.

4. Why This Matters (The "So What?")

The author concludes that AI is great at writing the story of a workout, but it's still shaky at doing the math.

If you are a doctor or a trainer using AI to help patients:

Don't trust the numbers blindly. The AI might say "run at 70% speed" today and "run at 60% speed" tomorrow for the same person.
The AI is a Draftsman, not the Architect. It can generate a great-looking plan, but a human expert needs to double-check the specific numbers (intensity, weight, time) to make sure they are safe and consistent.

In a nutshell: The AI is a very polite, well-read assistant who remembers the rules of exercise perfectly. But if you ask it to do the math on how heavy a weight should be, it might give you a different answer every time you blink. Until we fix that, we need a human to hold the clipboard and double-check the work.

1. Problem Statement

While Large Language Models (LLMs) show promise as decision-support tools for generating personalized exercise prescriptions, their reproducibility and consistency under identical conditions remain insufficiently examined.

The Core Issue: LLMs utilize probabilistic token generation, meaning identical prompts can yield different outputs. In clinical settings, this variability is critical; a patient with a specific profile might receive structurally different or numerically inconsistent prescriptions across repeated generations.
The Gap: Previous studies have focused on output accuracy and safety (often via expert review), but few have quantified the intra-model consistency (stability) of LLMs when generating complex, guideline-based exercise plans. Without establishing consistency, the reliability of AI for clinical deployment is questionable.

2. Methodology

The study employed a repeated generation design to evaluate the intra-model consistency of the Gemini 2.5 Flash model.

Experimental Design:
- Scenarios: Six distinct clinical scenarios were used:
  - High-Risk Clinical Cases (n=4): Type 2 Diabetes + Obesity, Knee Osteoarthritis + Fall Risk, Post-Colon Cancer Recovery, and Multimorbidity (Hypertension + T2DM + Obesity).
  - Healthy Adult Cases (n=2): Fat Loss/Endurance and Muscle Hypertrophy/Strength.
- Generation Protocol: Each scenario was prompted 20 times under identical conditions (using the Vertex AI API with a temperature of 1.0 to capture stochastic variability), resulting in 120 total outputs.
- Prompting: Prompts were based on a structured framework from previous literature, explicitly requesting adherence to the FITT principle (Frequency, Intensity, Time, Type) and including specific intensity notations (e.g., %1RM).
Evaluation Framework (Three Dimensions):
1. Semantic Consistency: Measured using SBERT (all-MiniLM-L6-v2) to calculate pairwise cosine similarity among the 20 outputs per scenario (190 pairs per scenario).
2. Structural Consistency (FITT): Evaluated using an "AI-as-a-Judge" approach. A separate LLM (Claude Sonnet 4.6) classified the outputs into FITT components (Type, Frequency, Intensity, Time). Intensity was categorized as Low, Moderate, High, or Unclassifiable based on ACSM guidelines.
3. Safety Expression Consistency: Assessed the binary presence and sentence-level quantity of four safety categories: contraindications, precautions, symptom monitoring, and risk warnings.
Statistical Analysis: Non-parametric tests (Kruskal-Wallis and Dunn's post hoc test) were used to analyze differences in similarity scores and safety sentence counts across scenarios.

3. Key Contributions

Quantification of Intra-Model Variability: This study provides empirical data on the stability of LLM-generated exercise prescriptions, moving beyond simple "accuracy" checks to measure reproducibility.
Methodological Pipeline: It establishes a robust evaluation framework combining SBERT-based semantic analysis with AI-as-a-Judge structural classification, offering a scalable alternative to manual expert review for consistency testing.
Differentiation of Clinical vs. Healthy Contexts: The study highlights that consistency is not uniform; it is significantly influenced by the degree of clinical constraint and the complexity of the patient profile.
Identification of Numerical Instability: It specifically isolates exercise intensity (particularly resistance training %1RM) as a weak point where LLMs struggle to generate stable numerical values, even when explicitly prompted.

4. Key Results

Semantic Consistency (High):
- Overall cosine similarity was high across all scenarios, ranging from 0.879 to 0.939.
- Clinical constraints increased consistency: Scenarios with strict clinical limitations (e.g., Post-Colon Cancer, Knee OA) showed the highest consistency (Mean ~0.93–0.94).
- Healthy cases showed lower consistency: The Muscle Hypertrophy scenario (S6) had the lowest consistency (Mean 0.879) and highest variability, suggesting that when clinical constraints are loose, the model has more "creative latitude," leading to divergent outputs.
Structural Consistency (Variable):
- Frequency: Generally consistent patterns were observed (e.g., 3–4x/week aerobic for clinical cases).
- Intensity (The Critical Flaw):
  - Aerobic: Clinical cases were consistently prescribed low intensity (90–100% low); healthy cases were moderate/high.
  - Resistance: Significant instability was found. In clinical cases, 10–25% of outputs were "unclassifiable" regarding intensity (e.g., missing %1RM values or using vague terms), despite explicit prompting.
- Time: Duration varied significantly, with some outputs providing estimates rather than explicit values.
Safety Consistency:
- Inclusion Rate: 100% of outputs included all four safety categories (contraindications, precautions, monitoring, warnings).
- Quantity Variability: The number of safety sentences varied significantly by scenario ( $H=86.18, p<0.001$ ). The multimorbidity case (S4) generated the most safety text (Mean = 61.4 sentences), while healthy cases generated the least.
- Finding: While safety content is reliably included, the volume of safety advice is driven by clinical complexity rather than just the binary status of being "clinical" vs. "healthy."

5. Significance and Implications

Clinical Reliability Warning: High semantic similarity does not guarantee clinical equivalence. Two outputs can look semantically similar but differ in critical quantitative parameters (e.g., 50% vs. 70% HRmax), which could lead to unsafe exercise stimuli.
Prompt Engineering is Critical: The study demonstrates that the reliability of LLM outputs is heavily dependent on prompt structure and the rigidity of clinical constraints. Looser constraints (healthy populations) lead to higher variability.
Need for Hybrid Systems: The findings suggest that LLMs cannot yet be deployed as standalone clinical tools for exercise prescription. They require:
1. Structured constraints to limit output variability.
2. Expert validation or automated quantitative checks specifically for numerical components (intensity/dosage).
3. Human-in-the-loop verification before patient application.
Future Research Direction: The study calls for multi-model comparisons and the development of standardized evaluation frameworks that prioritize quantitative stability alongside semantic accuracy.

In conclusion, while LLMs demonstrate high semantic stability and a strong ability to include safety warnings, their inability to consistently generate stable numerical prescriptions (especially for resistance intensity) remains a significant barrier to autonomous clinical deployment.

Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Using a Large Language Model