Elder-Sim: A Psychometrically Validated Platform for Personality-Stable Elderly Digital Twins

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Idea: Building a "Digital Grandparent" That Doesn't Forget Who They Are

Imagine you are trying to teach a very smart robot how to act like a specific elderly person—let's call him "Grandpa Wang." You want this robot to be able to talk to doctors, family members, and social workers to help them understand how Grandpa Wang thinks, feels, and reacts to stress.

The goal is to create a "Digital Twin": a computer version of a real person that can be used to test medical treatments or family advice before trying them on the real person.

The Problem:
Current AI chatbots are like actors who forget their lines. If you ask them the same question twice, they might give two completely different answers. One day they are grumpy and suspicious; the next day, they are cheerful and trusting. In the real world, people's personalities are stable. If Grandpa Wang is usually anxious about his health, he shouldn't suddenly become a carefree optimist just because the AI "forgot" its script. This inconsistency is called "Personality Drift," and it makes the AI useless for serious medical planning.

The Solution (Elder-Sim):
The researchers built a new system called Elder-Sim. They didn't just tell the AI, "Be Grandpa Wang." They built a complex "brain" for the AI to ensure it stays consistent over time.

How They Fixed the "Drift" (The Three Magic Ingredients)

The researchers tested four different versions of the AI to see what made it most stable. Think of it like building a house:

1. The Baseline (Just a Prompt)

The Analogy: This is like giving an actor a single line of dialogue: "You are a 72-year-old man named Wang who is worried about his health."
The Result: The actor tries, but after a few minutes of conversation, they start to drift. They might forget they are worried or suddenly act like a different character.
Performance: Okay, but not reliable enough for doctors.

2. Adding Memory (The Notebook)

The Analogy: Now, we give the actor a notebook. Every time they talk, they write down what happened. "I talked to my son today. He was rude."
The Result: The actor remembers the facts better. They don't forget that they have high blood pressure.
The Catch: Remembering facts isn't the same as having a personality. The actor might remember the event but still react to it differently every time.
Performance: Slightly better, but still drifts a bit.

3. Adding the "Cognitive Map" (The Internal Compass)

The Analogy: This is the big breakthrough. Instead of just a notebook, we give the actor a rulebook for how their brain works.
- The Rule: "If someone criticizes your medicine, you feel scared because you believe you are a burden."
- The Process: Event happens $\rightarrow$ Brain checks rulebook $\rightarrow$ Generates emotion $\rightarrow$ Generates action.
The Result: This is based on a real psychological method called Cognitive Behavioral Therapy (CBT). It forces the AI to think through why it feels a certain way before it speaks. It's like giving the actor a deep understanding of their own soul, not just a script.
Performance: Huge improvement. The AI became very consistent. It reacted the same way to the same problem every time, just like a real human would.

4. Adding "Domain Training" (The Specialized School)

The Analogy: Finally, we send the actor to a special school where they only study "Elderly Care." They read thousands of books and listen to thousands of real conversations between old people and doctors.
The Result: The actor doesn't just know how to be Grandpa Wang; they know how old people actually talk. They use the right slang, the right worries, and the right tone.
Performance: The Best. This combination of the "Internal Compass" (Step 3) and the "Specialized School" (Step 4) created a digital twin that was almost perfect at staying in character.

The Results: Did it Work?

The researchers ran a "driving test" for these AI characters. They asked them the same 10 difficult questions (like "Your son is angry at you" or "You can't afford your medicine") over and over again.

The "Drift" Test: They measured how much the AI's personality changed between answers.
- Without the special tools: The AI was all over the place.
- With the "Internal Compass" and "Specialized School": The AI was rock solid. It was 97% accurate at staying in character.

The Key Takeaway:
The most surprising finding was that just having a memory (a notebook) wasn't enough. You can remember everything, but if you don't have a consistent way of processing those memories (the Cognitive Map), you will still act crazy.

Why Does This Matter?

Imagine a doctor wants to try a new therapy for a lonely elderly patient. Instead of risking the patient's feelings by trying it on them first, they can try it on the Digital Twin.

If the Digital Twin says, "This therapy makes me feel angry," the doctor knows, "Okay, this won't work for the real patient."
If the Digital Twin says, "This makes me feel hopeful," the doctor knows, "Great, let's try this with the real patient."

In short: This paper proves we can build AI that doesn't just "chat," but actually simulates a human mind with a stable personality. This opens the door to safer, smarter, and more personalized healthcare for the elderly.

1. Problem Statement

The paper addresses a critical barrier in developing clinical-grade elderly digital twins: personality drift.

Context: Large Language Models (LLMs) offer a pathway to creating conversational agents that simulate older adults' lived experiences and behavioral responses. However, current LLM-based agents often exhibit inconsistent trait expression across repeated, longitudinal interactions.
The Issue: Even when a specific persona is defined, agents tend to shift behavioral tendencies (drift) over time. This undermines the reliability of generated trajectories and the validity of simulating intervention responses in geriatric care.
Gap: Existing evaluations focus on task performance or safety, rarely applying psychometric rigor (e.g., internal consistency, test-retest reliability) to verify if an agent maintains a stable personality over time. Furthermore, general-purpose LLMs often lack the domain-specific behavioral distributions required for elderly care.

2. Methodology

The authors developed ELDER-SIM, a modular, multi-role platform designed to construct and validate personality-stable elderly digital twins.

A. Platform Architecture

Built using n8n for workflow orchestration and local LLM inference (via Ollama/vLLM with Qwen2.5 models), the system integrates five functional layers:

Workflow Orchestration: Supports dual-agent dialogues, multi-agent social simulations, structured intervention protocols (e.g., CBT), and standardized assessment patterns.
LLM Inference: Configured with fixed parameters (Temperature=0.7, Top-p=0.9) to ensure consistency.
Agent Management: Uses structured JSON profiles to define demographics, health status, and personality parameters.
Memory Systems: A 3-tier architecture:
- Short-term: Sliding window for immediate context.
- Long-term: MySQL database storing episodic events, semantic facts, belief updates, and dialogue summaries.
- Cognitive Conceptualization Diagram (CCD): A module based on Beck's Cognitive Behavioral Therapy (CBT) framework. It links background history $\rightarrow$ belief systems (core/intermediate beliefs) $\rightarrow$ triggered automatic thoughts/emotions $\rightarrow$ behaviors.
Evaluation Modules: Automated psychometric scoring.

B. Agent Design

Personality Model: Agents are constrained by the Big Five (OCEAN) personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism) on a 1–5 scale.
Domain Adaptation: A LoRA (Low-Rank Adaptation) fine-tuning module was implemented using 19,717 instruction pairs derived from the China Health and Retirement Longitudinal Study (CHARLS) to align the model with elderly-care discourse.

C. Experimental Design

The study employed a systematic ablation study across four conditions to isolate the impact of specific components on personality stability:

Baseline: Prompt-only personality description.
+Memory: Baseline + Short-term and Long-term memory.
+CCD: +Memory + Cognitive Conceptualization Diagram (CBT framework).
+LoRA: +CCD + Domain-specific fine-tuning.

Evaluation Metrics:

Internal Consistency: Cronbach's $\alpha$ (measuring coherence of trait expression within a scale).
Test-Retest Reliability: Intraclass Correlation Coefficient (ICC) (measuring stability across repeated administrations).
Role Discrimination: Classification accuracy to ensure distinct agent profiles do not collapse into generic responses.
Scenarios: 10 standardized elderly-care scenarios (e.g., medication adherence, family conflict, ageism) administered 5 times per agent.

3. Key Results

The study generated 1,200 total responses across 6 agent configurations.

A. Personality Consistency (Reliability)

Baseline: Achieved acceptable internal consistency ( $\alpha \approx 0.70$ ) and good test-retest reliability (ICC $\approx 0.86$ ).
+Memory: Provided negligible improvement ( $\alpha \approx 0.705$ ), indicating memory alone does not stabilize personality.
+CCD: Produced the largest gain in internal consistency, raising mean $\alpha$ from 0.702 to 0.892 ( $p < 0.001$ ). ICC improved to 0.924.
+LoRA: Achieved excellent internal consistency ( $\alpha = 0.940$ ) and the highest test-retest reliability (ICC = 0.958).

B. Role Discrimination

Accuracy in distinguishing between different agent profiles improved stepwise:

Baseline: 83.3%
+Memory: 88.9%
+CCD: 94.4%
+LoRA: 97.2%

C. Component Contributions

CCD was identified as the primary driver of stability, suggesting that structured cognitive modeling (linking beliefs to behaviors) is essential for preventing drift.
LoRA acted as a refinement layer, improving the naturalness and domain alignment of the responses once the cognitive scaffold was in place.
Memory alone was insufficient for stability; without cognitive constraints, memory could even introduce variability.

4. Key Contributions

ELDER-SIM Platform: A reproducible, open-source (code available) framework for building multi-role elderly care simulations with integrated memory and cognitive modeling.
Psychometric Validation Framework: The first application of rigorous psychometric metrics (Cronbach's $\alpha$ , ICC) to quantify personality stability in LLM-based digital twins, moving beyond simple task accuracy.
Architectural Insight: Demonstrated that personality drift is a solvable engineering problem. The study proves that structured cognitive scaffolding (CCD) is more critical for stability than memory persistence alone.
Domain Adaptation Strategy: Validated that combining cognitive modeling with domain-specific LoRA fine-tuning yields the highest fidelity for geriatric care simulations.

5. Significance and Implications

Clinical Credibility: The findings suggest that LLM agents can achieve "clinically meaningful" reliability for longitudinal simulations, a prerequisite for using digital twins in in silico testing of interventions (e.g., CBT protocols, adherence strategies) before real-world deployment.
Design Principle: For stable digital twins, identity must be anchored in a cognitive architecture (belief systems and appraisal mechanisms) rather than just prompt engineering or memory retrieval.
Geriatric Care: The platform enables the simulation of complex psychosocial dynamics (loneliness, ageism, family conflict) with stable personality traits, allowing researchers to test how specific interventions affect different personality types over time.
Future Directions: While the study shows high reliability in simulation, future work must validate these agents against real-world longitudinal data and diverse cultural contexts to ensure ecological validity.

Conclusion: The paper establishes that personality-stable elderly digital twins are technically achievable by integrating Big Five constraints, CBT-based cognitive modeling, and domain adaptation, validated through rigorous psychometric testing.