Structured Exploration vs. Generative Flexibility: A Field Study Comparing Bandit and LLM Architectures for Personalised Health Behaviour Interventions

Imagine you are trying to get in shape, and you have a digital coach on your phone. Every day, this coach sends you a message to keep you motivated. But here's the big question: How should that coach be built?

Should it be a strict librarian who picks the perfect book for you based on a complex algorithm? Or should it be a creative writer who just talks to you naturally, making up stories on the fly?

This paper is a 4-week experiment that pitted five different types of "digital coaches" against each other to see which one people actually found helpful.

The Five Coaches (The Contenders)

The researchers tested five different ways to generate these daily messages:

The Randomizer (RCT): A coin flip. It picks a technique at random and sends a pre-written, generic message. (Like a vending machine that just gives you whatever is left).
The Strict Librarian (cMAB_only): A smart algorithm that analyzes your mood and stress levels to pick the perfect pre-written message. It's very logical but uses the same old scripts.
The Creative Writer (LLM_only): An AI that reads what you wrote that day and writes a brand new, unique message from scratch. It's flexible and conversational.
The Creative Writer with Memory (LLM_tracing): Same as above, but it remembers what you talked about last week to keep the conversation flowing.
The Hybrid Coach (cMAB+LLM): The best of both worlds! The "Strict Librarian" picks the topic (e.g., "Today we focus on tracking steps"), and the "Creative Writer" fleshes it out with a unique message.

The Big Surprise: The "Smart" Librarian Didn't Win

The researchers expected the Hybrid Coach (the one combining the smart algorithm with the creative writer) to be the champion. They thought, "If we use math to pick the perfect topic and an AI to write the message, it will be perfect!"

They were wrong.

Here is what actually happened:

The Creative Writers (LLMs) crushed it. People rated the messages written by the AI as much more helpful than the pre-written scripts.
The "Smart Librarian" added nothing. Whether the AI picked the topic itself or the algorithm picked it for the AI, the users felt the exact same level of helpfulness.
The "Strict Librarian" (algorithm only) was the least helpful. Even though it was mathematically "optimizing" the choice, people hated the robotic, pre-written messages.

The Secret Sauce: "I Heard You"

Why did the Creative Writers win? It wasn't about picking the right psychological trick (like "gain-framing" vs. "loss-framing"). It was about acknowledgement.

Think of it like talking to a friend:

The Robotic Coach: You tell them, "I'm so sad because my dog passed away." The coach replies, "Great job walking! Remember to track your steps."
- Your reaction: "This person isn't listening. I feel ignored."
The Creative Coach: You tell them, "I'm so sad because my dog passed away." The coach replies, "I'm so sorry to hear about your dog. That's really hard. Maybe a short, gentle walk today could help clear your head, but don't push yourself."
- Your reaction: "They actually heard me. This feels helpful."

The study found that users didn't care if the AI was using a complex math formula to pick a topic. They only cared if the AI acknowledged what they just said. If the AI ignored their input, even the "perfect" psychological technique felt useless.

The "Journaling" Effect

Another fascinating finding was how people viewed the app. They didn't see it as a "chatbot" they were having a conversation with. They saw it as a digital diary.

Because it felt like a diary (a tool) and not a person, people felt safer sharing deep, sad, or embarrassing secrets.
They said things like, "I wouldn't tell my friends about my anxiety, but I told the computer."
However, because it felt like a diary, they didn't expect a two-way conversation. They just wanted their "entry" to be acknowledged with a thoughtful note back.

The "Discovery" Bonus

There was one small win for the "Strict Librarian" (the algorithm).

The Creative Writers tended to get stuck in a rut. They loved "Gain-Framing" (telling you the good things about exercise) and used it 70% of the time.
The Algorithms forced variety. They made sure to try "Loss-Framing" (telling you the bad things about not exercising) and "Social Comparison" (comparing you to others) just as often.

Users actually liked this variety! They told the researchers, "I didn't know I needed to try that specific type of motivation, but the algorithm forced me to try it, and it worked!" It was like a music playlist that forces you to listen to a genre you usually skip, only to discover you love it.

The "Reveal" Twist

At the end of the study, the researchers told the participants: "Hey, that message you liked? It was actually written by a robot, not a human."

The result? People's opinions changed instantly.

When they thought a message was "smart AI," they judged it harshly if it wasn't perfect.
When they thought a message was "simple," they were more forgiving.
Lesson: How you frame the technology changes how people feel about it, even if the message is exactly the same.

The Takeaway for Designers

If you are building an AI health app, here is the recipe for success:

Don't obsess over the math: You don't need a super-complex algorithm to pick the "perfect" psychological trick.
Focus on the conversation: Make sure the AI actually reads what the user wrote and responds to it. Acknowledgement is more important than optimization.
Be a tool, not a friend: Position the AI as a helpful journaling tool. This makes people feel safe enough to be honest without the pressure of a fake "human" relationship.
Force some variety: Let the AI try different approaches so users can discover what works for them, rather than just repeating the same "good vibes" message every day.

In short: A robot that listens and responds to your feelings is better than a super-smart robot that ignores you to give you a mathematically perfect lecture.

Here is a detailed technical summary of the paper "Structured Exploration vs. Generative Flexibility: A Field Study Comparing Bandit and LLM Architectures for Personalised Health Behaviour Interventions."

1. Problem Statement

Digital health interventions often suffer from intervention fatigue, where standardized, template-based messages fail to maintain user engagement over time. While Contextual Multi-Armed Bandits (cMABs) offer a statistically rigorous method for optimizing the selection of Behavior Change Techniques (BCTs) based on user context, they often rely on rigid templates that lack nuance. Conversely, Large Language Models (LLMs) provide flexible, context-sensitive message generation but operate as "black boxes," making their decision-making processes opaque and difficult to regulate.

The core research gap lies in understanding how these two architectural paradigms compare in real-world settings:

Does combining algorithmic optimization (cMAB) with generative flexibility (LLM) yield superior user perception?
How do users perceive the trade-off between structured exploration (ensuring diverse BCT usage) and generative autonomy?
What specific mechanisms (e.g., contextual acknowledgement) drive perceived helpfulness in reflective health systems?

2. Methodology

The authors conducted a 4-week within-subjects field study involving N=54 participants focused on physical activity motivation.

Experimental Design

Participants received daily motivational messages via a mobile app. The study employed a randomized design where each participant interacted with five distinct message-generation architectures over 28 days:

RCT (Random Control): Randomly selects one of four BCTs; delivers fixed, psychologist-written templates.
cMAB_only: Uses Contextual Thompson Sampling to select the BCT based on psychometric features (self-efficacy, regulatory focus, etc.); delivers fixed templates.
LLM_only: The LLM selects the BCT and generates a personalized message based on current inputs (no interaction history).
LLM_tracing: The LLM selects the BCT and generates a message using the full interaction history (previous messages and ratings) to maintain longitudinal consistency.
cMAB+LLM (Hybrid): cMAB selects the BCT; the LLM generates a personalized message constrained to that specific BCT.

Data Collection

Quantitative: Daily 5-point Likert scale ratings for "helpfulness." Pre- and post-study psychometric assessments (BREQ-3, Big Five).
Qualitative: Post-study semi-structured interviews (N=9) involving a "reveal" phase where participants were informed of the underlying methodologies and asked to re-rank messages.
Analysis: Linear mixed-effects models for quantitative data; inductive thematic analysis for qualitative data.

3. Key Contributions

The paper makes three primary contributions to the HCI and Digital Health literature:

Empirical Insight: It demonstrates that optimizing BCT selection alone does not increase perceived helpfulness if the message generation lacks contextual responsiveness. Hybrid systems (cMAB+LLM) performed no better than LLM-only systems.
Conceptual Mechanism: It identifies "Contextual Acknowledgement" and "I/O Proportionality" as the primary drivers of perceived helpfulness. Users valued the system's ability to reflect their specific free-text inputs more than the statistical optimization of the intervention technique.
Architectural Trade-off: It articulates a "Structured Exploration vs. Generative Autonomy" trade-off. While LLMs naturally converged on a single effective technique (Gain-Framing), cMAB systems enforced systematic exploration across all techniques, facilitating user discovery of new strategies.

4. Key Results

Quantitative Findings

LLM Superiority: LLM-based approaches (LLM_only, LLM_tracing, cMAB+LLM) were rated significantly more helpful ( $M \approx 3.8$ ) than template-based approaches (RCT, cMAB_only; $M \approx 2.6$ ).
No Hybrid Advantage: There was no significant difference in helpfulness ratings between the Hybrid (cMAB+LLM) and LLM-only conditions. The algorithmic optimization of BCT selection added no incremental value when the generation method was identical.
BCT Distribution:
- LLM-only systems heavily converged on Gain-Framing (60–70% of selections), likely due to safety alignment or training data bias.
- cMAB-based systems distributed selections evenly across all four techniques (Gain, Loss, Monitoring, Social Comparison) due to the inherent exploration of Thompson Sampling.
BCT Effectiveness: Gain-framing was rated most helpful overall; Loss-framing was rated least helpful.

Qualitative Findings (Themes)

Interactive Diary Mode: Users viewed the system as a journaling tool rather than a conversational agent. This framing reduced social pressure.
Increased Disclosure: Participants shared sensitive personal information (e.g., grief, anxiety) with the AI that they would not share with humans, citing the lack of judgment.
I/O Proportionality: Users expected a proportional response to their input depth. Generic templates following detailed emotional reflections were perceived as "being ignored," violating an implicit contract.
Discovery vs. Preference: Users appreciated the algorithmic exploration of different BCTs, noting they would not have voluntarily tried techniques like "Social Comparison" or "Loss-Framing" but found them effective in specific contexts.
Expectation Disconfirmation: After the methodology was revealed, user preferences shifted. Participants retrospectively downgraded LLM-only messages (perceiving them as less sophisticated) and upgraded cMAB-only messages, despite having rated them lower during the study.

5. Significance and Implications

Design Implications

Context over Optimization: For reflective health systems, the priority should be contextual acknowledgement (responding to user input) rather than complex algorithmic selection of BCTs. If a system cannot acknowledge context, algorithmic optimization is moot.
Tool Positioning: Explicitly positioning the AI as a "tool" or "diary" rather than a "friend" may facilitate greater self-disclosure and reduce social anxiety, contrary to traditional relational agent design which often aims for anthropomorphism.
Managing Expectations: Transparency about AI methodology is ethically necessary but can trigger expectation disconfirmation. Designers must carefully frame system capabilities to avoid creating gaps between perceived sophistication and actual performance.

Theoretical Implications

Preference Discovery: The study challenges the "preference-matching" paradigm. Instead of assuming user preferences are fixed and static, health interventions should facilitate exploration, allowing users to discover effective techniques they were previously unaware of.
Hybrid Architecture Limits: Simply stacking a cMAB on top of an LLM does not guarantee better user experience. The constraints imposed by the bandit (forcing a specific BCT) may limit the LLM's generative flexibility without adding perceived value if the user's primary need is contextual responsiveness.

Future Directions

The authors suggest exploring Neural Contextual Bandits for larger BCT taxonomies, investigating longitudinal effects (beyond 4 weeks) on habituation, and studying how disclosure timing affects user trust and satisfaction in clinical settings.

Conclusion

The paper concludes that while LLMs significantly outperform templates in perceived helpfulness due to their ability to acknowledge context, adding algorithmic optimization (cMAB) for BCT selection does not further enhance user perception. However, the structural constraint of cMABs provides a unique value: systematic exploration that prevents the system from converging too quickly on a single strategy, thereby enabling users to discover diverse behavior change techniques. The optimal design lies in balancing structured exploration with generative autonomy, prioritizing contextual responsiveness over algorithmic complexity.