Tailored Behavior-Change Messaging for Physical Activity: Integrating Contextual Bandits and Large Language Models

Imagine you are trying to get a friend to start walking more every day. You know that what works for them on a sunny Monday morning might be totally different from what works on a rainy, stressful Friday afternoon.

This paper is about building a "Super-Coach" that knows exactly what to say to keep people moving, without getting tired or confused. The researchers combined two powerful tools to create this coach: a Smart Decision-Maker and a Creative Writer.

Here is the breakdown of their experiment in simple terms:

1. The Two Tools in the Toolbox

The Smart Decision-Maker (Contextual Bandit): Think of this as a traffic controller. It looks at the "traffic" (the user's mood, stress level, and confidence) and decides which type of message to send. It has four pre-set strategies:
1. Self-Monitoring: "Hey, let's track your steps."
2. Gain-Framing: "Walking will make you feel energetic and happy!" (Focus on the good stuff).
3. Loss-Framing: "If you don't walk, you might feel sluggish and gain weight." (Focus on avoiding the bad stuff).
4. Social Comparison: "Everyone else in your group is walking; join them!"
- The Problem: This controller is great at picking the right strategy, but it's stuck using the same old, pre-written scripts. It can't change the tone or wording to fit the person's specific mood that day.
The Creative Writer (Large Language Model / LLM): Think of this as a talented speechwriter. It can write a unique, personalized message for any situation. It can sound empathetic, funny, or serious depending on what the user just wrote.
- The Problem: If you let the speechwriter choose the strategy and write the message, it can get confused, make inconsistent choices, or become too expensive to run (like hiring a writer for every single sentence).

2. The "Hybrid" Solution: The Best of Both Worlds

The researchers created a Hybrid Coach (cMABxLLM) that splits the job:

The Traffic Controller looks at the data and says, "Today, this person needs a Gain-Framed message."
The Speechwriter takes that instruction and writes a beautiful, unique message: "Since you're feeling a bit tired from work, remember that a quick 15-minute walk could be the perfect reset button to boost your energy for the evening!"

This way, the decision is logical and transparent (we know why they chose that strategy), but the words are fresh and personal.

3. The Experiment: A 30-Day Challenge

The team tested this on 54 people over 30 days. They split the participants into five groups to see which "Coach" worked best:

Group A (Random): Got messages assigned by a coin flip.
Group B (Traffic Controller Only): Got the right strategy, but with boring, pre-written templates.
Group C (Speechwriter Only): The AI picked the strategy and wrote the message.
Group D (Speechwriter + History): The AI picked the strategy and wrote the message, remembering what it said yesterday to avoid repetition.
Group E (The Hybrid): The Traffic Controller picked the strategy, and the Speechwriter wrote the message.

4. What Did They Find?

Personalization Wins: People loved the messages written by the Speechwriter (Groups C, D, and E) much more than the boring, pre-written templates. They felt the messages were "for them."
The Hybrid is the Sweet Spot: The Hybrid Coach (Group E) was just as popular as the pure Speechwriter groups, but it had two huge advantages:
1. It was cheaper: It didn't need to ask the AI to "think" about which strategy to pick, saving money and computing power.
2. It was clearer: Because the Traffic Controller made the choice, the researchers knew exactly why a specific message was sent. This makes it easier to trust and improve the system later.
The "Good News" vs. "Bad News" Rule: Interestingly, people generally liked messages that focused on the benefits of walking (Gain-Framing) more than messages that focused on the costs of not walking (Loss-Framing). Even the AI couldn't make the "scary" messages feel as good as the "encouraging" ones.

5. The Takeaway

The study shows that you don't have to choose between a logical robot and a creative human-like writer. By letting a logical algorithm decide what to say and a creative AI decide how to say it, you get a system that is:

Personalized: It feels like a real friend talking to you.
Efficient: It doesn't waste money or energy.
Trustworthy: We can understand the logic behind the decisions.

In short: They built a coach that knows when to be tough, when to be gentle, and how to say it in a way that makes you actually want to go for a walk.

1. Problem Statement

The paper addresses the challenge of designing effective Just-In-Time Adaptive Interventions (JITAIs) for physical activity (PA). While digital health interventions are promising, they face two primary limitations:

Contextual Multi-Armed Bandits (cMABs): These algorithms excel at selecting the type of intervention based on user context (e.g., mood, self-efficacy) to maximize engagement. However, they typically rely on a finite set of pre-defined, fixed message templates. This rigidity limits their ability to adapt the tone, wording, and framing of a message to the user's specific daily narrative, potentially reducing perceived relevance.
Large Language Models (LLMs): LLMs offer superior linguistic personalization, capable of generating nuanced, context-aware messages. However, using LLMs alone for decision-making is often a "black box," lacking the interpretability, reproducibility, and theoretical grounding of bandit algorithms. Furthermore, LLMs can be computationally expensive (high token usage) and may lack consistent decision logic.

Core Research Gap: There is a lack of frameworks that combine the transparent, adaptive decision-making of cMABs with the flexible, generative personalization of LLMs in a scalable, real-world deployment.

2. Methodology

The authors propose and evaluate a hybrid cMABxLLM approach in a 30-day physical activity intervention study involving 93 participants (54 active completers).

A. Experimental Design

The study utilized a micro-randomized trial design where participants were assigned daily to one of five experimental models:

RCT (Randomized Controlled Trial): Uniform random assignment to one of four intervention types; fixed templates.
cMAB-only: Uses Contextual Thompson Sampling to select the intervention type based on context ( $X_t$ ); delivers fixed templates.
LLM-only: The LLM selects the intervention type and generates the message content based on context ( $X_t$ ) and user free-text ( $L_t$ ).
LLM-tracing: Similar to LLM-only but includes a history trace ( $H_t$ ) of the last 10 interactions to ensure longitudinal consistency.
cMABxLLM (Hybrid): The cMAB selects the intervention type (using Thompson Sampling on context $X_t$ ), and the LLM generates the message content conditioned on that specific type, context, and user text.

B. Intervention Types

Four behavioral change strategies were used as the "arms" of the bandit:

Behavioral Self-Monitoring: Prompts reflection on activity tracking.
Gain-Framing: Emphasizes benefits of exercise.
Loss-Framing: Emphasizes costs of inactivity.
Social Comparison: Leverages normative behavior of peers.

C. Contextual Inputs

The models utilized specific contextual variables ( $X_t$ ) collected via daily Ecological Momentary Assessments (EMAs):

Self-Efficacy: Confidence in meeting PA goals.
Social Influence: Likelihood of joining others.
Regulatory Focus: Prevention (loss-avoidance) vs. Promotion (growth) orientation.
Free-text Narrative: User's daily reflection on events/mood.
(Note: Mood and stress levels were collected but excluded from real-time assignment to serve as post-hoc confounders.)

D. Reward Signal

The primary reward ( $R_t$ ) for the bandit algorithm was the Message Acceptance Rating (1–5 Likert scale), reported by users immediately after reading the message.

3. Key Contributions

Hybrid Architecture (cMABxLLM): The paper introduces a novel architecture that decouples decision-making from generation. The cMAB handles the strategic selection of which intervention type to deliver (ensuring interpretability and adaptive learning), while the LLM handles how to express it (ensuring linguistic personalization).
Scalable Personalization: The hybrid approach achieves the high acceptance rates of pure LLM models while significantly reducing token usage (by removing the need for the LLM to choose the strategy) and providing explicit decision rules for intervention selection.
Empirical Validation in Real-World Setting: Unlike many prior studies relying on simulations or short-term probes, this study deployed the system over 30 days with real users, collecting longitudinal data on motivation and message utility.
Statistical Framework: The authors provide a rigorous statistical analysis (Linear Mixed-Effects Models) accounting for repeated measures, time trends, and potential confounders to evaluate the causal pathways from model type to message acceptance and long-term motivation.

4. Results

The study analyzed 941 rated messages across 54 participants.

A. Message Acceptance (RQ1)

LLM Superiority: All LLM-based models (LLM-only, LLM-tracing, cMABxLLM) significantly outperformed non-personalized baselines (RCT and cMAB-only) in message acceptance.
- Means: LLM models (~~3.79–3.89) vs. Non-LLM models (~~2.62–2.76).
- Statistical Significance: The difference between LLM and non-LLM baselines was highly significant ( $p < 0.001$ ).
Hybrid Performance: The cMABxLLM model achieved acceptance ratings statistically indistinguishable from the pure LLM models, proving that separating the selection and generation steps does not degrade user experience.
Intervention Type Effects:
- Gain-framed messages received the highest average acceptance (3.74).
- Loss-framed messages received the lowest (2.93).
- This suggests that even with LLM personalization, the choice of intervention type remains a critical factor.

B. Long-Term Motivation (RQ2)

Limited Evidence: The study found limited evidence of significant changes in long-term motivation (measured via BREQ-3 pre/post surveys).
Reasons: The sample size for matched pre/post data was small ( $n=28$ ), and the study period coincided with a high-stress exam period, introducing noise. The results suggest that 30 days may be insufficient to shift stable motivational traits, or that daily step counts/motivation are better short-term metrics.

C. Efficiency

The cMABxLLM approach was more efficient than LLM-only. By pre-selecting the intervention type, the system prompt did not need to list all four options for the LLM to choose from, reducing token consumption and generation latency.

5. Significance and Implications

Bridging the Gap: This work demonstrates a practical pathway to combine Bayesian adaptive experimentation (for interpretability and efficient learning) with Generative AI (for human-centric communication).
Deployability: The hybrid model offers a "best of both worlds" solution for digital health: it retains the reproducibility and auditability required for clinical or public health deployment (via the bandit's decision rule) while delivering the empathy and relevance of generative text.
Future Directions: The authors highlight the need for:
- Longer study durations to capture stable behavioral changes.
- More frequent behavioral outcome measures (e.g., daily steps) rather than relying solely on pre/post surveys.
- Refining evaluation metrics for "continuous" action spaces created by LLMs, as traditional bandit regret analysis assumes a finite set of arms.

In conclusion, the paper validates that cMABxLLM is a robust, efficient, and highly acceptable framework for delivering personalized health interventions, successfully balancing the need for adaptive learning with the nuances of human language.