CORE-Acu: Structured Reasoning Traces and Knowledge Graph Safety Verification for Acupuncture Clinical Decision Support

Imagine you are trying to teach a brilliant, super-fast robot chef how to cook a very delicate, traditional dish: Acupuncture.

The problem is that this robot (a Large Language Model, or LLM) is like a genius who has read every cookbook in the world but has never actually cooked. It can describe a recipe perfectly, but it might accidentally tell you to put a live fish in a cake because it "sounds good" in the sentence structure, even though it's dangerous. In medicine, especially acupuncture, a mistake isn't just a bad meal; it can hurt a patient.

The paper introduces CORE-Acu, a new system designed to fix this robot chef. It does this using three clever tricks, which we can think of as a Three-Layer Safety Kitchen.

1. The "Step-by-Step" Recipe Book (Structured Reasoning)

The Problem: Normal AI models often jump straight from "Patient has a headache" to "Here are the needles to use." They skip the thinking part. It's like a chef saying, "I'm hungry, so I'll make lasagna," without explaining why or how. This is a "black box"—we can't see the logic, so we can't trust it.

The CORE-Acu Solution:
The researchers forced the AI to write out its full thought process before giving the answer. They created a special "Recipe Book" (called S-CoT) that demands the AI follow a strict chain of logic:

Diagnosis: What is the problem? (e.g., "Liver Fire")
Pathology: Why is it happening? (e.g., "The fire is rising to the head")
Principle: What is the strategy? (e.g., "Cool the fire")
Selection: Which needles fit this strategy?

The Analogy: Instead of just handing you the finished lasagna, the AI now has to show you its shopping list, its cooking steps, and its tasting notes first. If the logic doesn't make sense (e.g., "I'm cooling the fire" but "I'm adding spicy peppers"), the system catches it immediately.

2. The "Strict Safety Inspector" (Knowledge Graph Veto)

The Problem: Even if the AI thinks logically, it might still hallucinate (make things up). For example, it might suggest a needle that is strictly forbidden for pregnant women because it could induce labor. A normal AI might just guess the wrong needle because it's statistically common in its training data.

The CORE-Acu Solution:
The team built a digital "Safety Rulebook" (a Knowledge Graph) containing thousands of hard rules, like "Do not use Needle X on Pregnant Patients" or "Needle A and Needle B cannot be used together."

They added a Safety Inspector (a Symbolic Veto Mechanism) that sits between the AI and the patient.

The Process: The AI generates a prescription $\rightarrow$ The Inspector checks it against the Rulebook $\rightarrow$ If the AI breaks a rule, the Inspector slams the brakes.
The "Do-Over": The Inspector doesn't just say "No." It sends the AI back to the kitchen with a note: "You tried to use Needle X on a pregnant patient. That's forbidden. Try again." The AI rewrites the recipe until it passes the inspection.

The Analogy: Imagine a bouncer at a club. Even if you have a great outfit (a fluent sentence), if you don't have a valid ID (you broke a safety rule), you don't get in. If you try to sneak in, the bouncer checks your ID against a database and kicks you out until you fix it.

3. The "Highlighter Pen" (Reweighted Loss)

The Problem: When AI learns, it treats every word the same. It cares just as much about learning the word "the" or "and" as it does about learning the name of a specific needle like "Hegu." But in acupuncture, getting the name of the needle wrong is a disaster, while getting the word "the" wrong is fine. This is called the Frequency-Importance Mismatch.

The CORE-Acu Solution:
The researchers invented a special training method called LMERL. Think of this as a Highlighter Pen for the AI's brain.

When the AI is learning, the system highlights the dangerous, critical words (like needle names and safety rules).
If the AI gets a critical word wrong, it gets a "super penalty" (a huge shock to its brain).
If it gets a common word wrong, it gets a tiny nudge.

The Analogy: Imagine a student taking a test. If they misspell "the," they get a small red mark. But if they misspell the name of a life-saving drug, the teacher slams the desk and says, "This is the most important part! You must get this right!" This forces the AI to pay extra attention to the dangerous stuff.

The Results: Why This Matters

The researchers tested this system on 1,000 real-world cases.

Other AI models (like GPT-4o): Made safety mistakes in about 8.5% of cases. They were fluent but dangerous.
CORE-Acu: Made 0 mistakes. It caught every single safety violation and fixed it before showing the result.

Summary

CORE-Acu is like taking a brilliant but reckless robot doctor and giving it:

A checklist to force it to think before speaking.
A strict safety inspector to catch and fix dangerous errors.
A highlighter pen to make sure it never forgets the most critical details.

This turns a "black box" AI into a transparent, safe, and trustworthy assistant for doctors, ensuring that when it suggests acupuncture, it's not just guessing—it's reasoning, verifying, and prioritizing safety above all else.

1. Problem Statement

The paper addresses critical limitations in applying Large Language Models (LLMs) to Acupuncture Clinical Decision Support (CDS). While LLMs show promise in general medical tasks, their deployment in high-stakes acupuncture scenarios faces three specific challenges:

Black-Box Reasoning & Hallucinations: Standard LLMs optimize for probabilistic next-token prediction, often bypassing the rigorous intermediate reasoning steps required in Traditional Chinese Medicine (TCM), such as syndrome differentiation (Bian Zheng) and pathogenesis analysis. This leads to "reasoning-output discrepancies" where a plausible prescription is generated despite flawed logic.
Safety Risks in Invasive Procedures: Unlike herbal medicine, acupuncture involves physical interventions at specific anatomical locations. Standard LLMs lack robust mechanisms to enforce hard safety constraints (e.g., contraindications for pregnancy or specific acupoint combinations), risking patient safety through hallucinated or rule-violating recommendations.
Frequency-Importance Mismatch: In standard training objectives (like cross-entropy loss), high-frequency function words are weighted equally with rare but safety-critical entities (e.g., specific acupoint names like Taixi vs. Taiyuan). This leads to terminology drift and imprecise generation of critical medical entities.

2. Methodology: The CORE-Acu Framework

The authors propose CORE-Acu (Constrained Ontology Reasoning Engine for Acupuncture), a neuro-symbolic framework that integrates structured reasoning with knowledge graph (KG) verification. The framework operates through a four-component lifecycle:

A. Structured Chain-of-Thought (S-CoT) & Schema-Constrained Fine-Tuning

Dataset Construction: The team constructed Acu-Reasoning, the first large-scale acupuncture S-CoT dataset containing 42,512 samples.
Causal Modeling: Instead of mapping symptoms directly to prescriptions, the model is trained to follow an explicit causal chain:
1. Diagnosis (Bian Zheng)
2. Pathology (Bing Ji)
3. Therapeutic Principle (Zhi Ze)
4. Acupoint Selection (Xuan Xue)
Goal: This transforms implicit TCM reasoning into observable, machine-verifiable logic, mitigating the black-box nature of LLMs.

B. TCM Knowledge Graph (KG) for Symbolic Safety

Construction: A domain-specific KG was built containing 4,628 nodes and 12,500 edges, standardized against WHO and Chinese national standards (GB/T 12346-2006).
Constraint Modeling: The KG encodes over 1,200 hard constraints, including:
- Conditional Prohibitions: e.g., Hegu (LI4) is prohibited for Pregnancy.
- Anatomical Risks: Flags for deep needling or proximity to vital organs.
Role: Serves as the "symbolic ground truth" for deterministic safety verification.

C. Lexicon-Matched Entity-Reweighted Loss (LMERL)

Problem Addressed: To fix the frequency-importance mismatch where common words drown out critical medical terms.
Mechanism: A custom loss function that adaptively up-weights the gradient contribution of domain-specific entities (acupoints, syndromes) during fine-tuning.
Formula: It applies a weight factor $\alpha$ (set to 1.5) to tokens found in the domain lexicon, ensuring the model prioritizes learning safety-critical terminology over general fluency.

D. Neuro-Symbolic Closed-Loop Inference (Generate–Verify–Revise)

Process:
1. Generate: The LLM produces a preliminary S-CoT prescription.
2. Verify: A symbolic module checks the output against the TCM-KG for violations (e.g., contraindicated acupoints).
3. Revise: If violations are detected, a negative feedback signal (specific error triples) is injected, and the model is prompted to rewrite the prescription.
4. Fallback: If violations persist after a maximum number of retries ( $T_{max}$ ), the system triggers a "Human Confirmation Required" fallback, ensuring zero unsafe outputs are released.

3. Key Contributions

First Acupuncture S-CoT Dataset: Creation of a high-quality, causally structured dataset that enforces explicit reasoning steps from diagnosis to treatment.
Symbolic Veto Mechanism: A novel neuro-symbolic governance module that uses a KG to deterministically intercept safety violations in real-time, enforcing hard boundaries that probabilistic models cannot guarantee.
LMERL Optimization: Introduction of a re-weighted loss function that specifically targets the precision of safety-critical medical entities, addressing the limitations of standard cross-entropy optimization in specialized domains.
Closed-Loop Safety: A "Generate–Verify–Revise" system that guarantees safety compliance through iterative self-correction and conservative fallbacks.

4. Experimental Results

The framework was evaluated on a held-out test set of 1,000 clinical cases against baselines including GPT-4o, Qwen-Max, GLM-4.6, and medical-specific models like HuatuoGPT.

Safety Compliance (Critical Metric):
- CORE-Acu: Achieved 0/1,000 (0%) safety violations.
- GPT-4o: Exhibited an 8.5% violation rate.
- Qwen-Max: 4.1% violation rate.
- HuatuoGPT: 12.0% violation rate (highlighting that domain fluency alone does not ensure safety).
Reasoning Quality:
- CORE-Acu outperformed all baselines in Entity-F1 (0.4612 vs. ~0.29–0.35 for others), demonstrating superior accuracy in generating specific acupoint names and syndromes.
- In LLM-as-a-Judge evaluations (5 dimensions: Reasoning, Diagnosis, Pathology, Prescription, Explicability), CORE-Acu achieved the highest total score (42.96) and significantly better scores in Pathological Logic and Reasoning Validity compared to general-purpose models.
Ablation Study: Removing the KG verification loop increased the violation rate from 0% to 4.0%, proving the necessity of the symbolic veto mechanism.

5. Significance

Bridging the Trust Gap: CORE-Acu demonstrates that LLMs can be made safe and auditable for high-stakes medical procedures by integrating neuro-symbolic constraints, moving beyond "black-box" probabilistic generation.
Safety-First AI: It establishes a new paradigm for medical AI where deterministic safety rules (via KGs) take precedence over probabilistic fluency, ensuring that patient safety is never compromised by model hallucinations.
TCM Modernization: By formalizing the implicit reasoning of TCM into structured, machine-verifiable traces, the framework provides a scalable path for digitizing and standardizing complex acupuncture practices.
Generalizability: The approach of using S-CoT for reasoning transparency and LMERL for entity precision is applicable to other high-risk medical domains requiring strict adherence to guidelines and terminology.

In conclusion, CORE-Acu represents a robust step toward trustworthy AI in medicine, successfully balancing the generative power of LLMs with the rigorous safety requirements of invasive clinical interventions.