Imagine you are sitting in a long, complex conversation with a friend. You start by agreeing that "coffee is hot." Two turns later, your friend says, "Coffee is cold," and then five turns after that, they claim, "Coffee is a solid rock."

If you were a standard AI evaluator, it might look at each sentence in isolation. "Coffee is cold" sounds like a normal sentence. "Coffee is a solid rock" sounds grammatically correct. The AI might give your friend a high score for being polite and fluent, completely missing the fact that they are contradicting themselves and losing their mind.

This is the problem SKG-Eval solves. It is a new way to grade AI conversations that acts less like a spell-checker and more like a detective with a giant, evolving whiteboard.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Amnesiac" Judge

Current AI judges (like asking a super-smart AI to grade another AI) usually look at one sentence at a time. They are like a judge who forgets everything that happened five minutes ago.

The Flaw: If an AI says "I love cats" in Turn 1, and then "I hate cats" in Turn 10, a standard judge might miss it because it's too busy looking at the grammar of Turn 10.
The Result: AI systems can drift off-topic, forget rules, or contradict themselves without getting penalized.

2. The Solution: The "Living Whiteboard" (Semantic Knowledge Graph)

SKG-Eval doesn't just read the text; it builds a map of the conversation as it happens. Think of this map as a giant, living whiteboard in a classroom.

The Nodes (Sticky Notes): Every time the AI mentions a person, object, or fact (like "coffee," "metabolism," or "skipping breakfast"), it writes it on a sticky note and puts it on the board.
The Edges (String): It ties these notes together with string to show how they relate (e.g., "Coffee" $\rightarrow$ is hot $\rightarrow$ "Liquid").
The Update: As the conversation continues, the AI doesn't start a new page; it adds to the same board. If the AI tries to say "Coffee is cold," the system sees the string connecting "Coffee" to "Hot" and immediately spots the conflict.
The Brain Connection: This way of building the map—adding notes and re-tieing strings as the talk goes on—is exactly how the human brain works during a conversation. Instead of starting over, our brains strengthen or reroute connections between ideas turn by turn, which is the core idea behind neuromorphic computing. That's why SKG-Eval is called a "brain-inspired" approach to tracking conversations.

3. The Three-Part Scorecard

Instead of giving one vague grade, SKG-Eval checks three specific things for every new sentence the AI says:

A. Did you answer the question? (Local Relevance)
- Analogy: Did you actually listen to what I just asked?
- It checks if the new sentence matches the current prompt. If you asked "What's the weather?" and the AI says "I like pizza," this score drops.
B. Are you remembering the past? (Historical Consistency)
- Analogy: Are you still talking about the same topic, or did you wander off?
- It checks if the new "sticky notes" connect to the old ones on the whiteboard. If the conversation was about "coffee" and suddenly the AI starts talking about "space rockets" without a bridge, the score drops.
C. Are you contradicting yourself? (Logical Coherence)
- Analogy: The "Gotcha!" moment.
- This is the superpower. It uses a Geometric Contradiction Engine. Imagine a robot that measures the "shape" of the facts. If the shape of "Coffee is hot" clashes with the shape of "Coffee is cold," the robot flags it.
- Crucial Detail: It knows the difference between a mistake and a correction. If you say, "Change the coffee to tea," the system understands you intentionally updated the board. It doesn't punish the AI for following your order to change the facts.

4. The "Recent Memory" Bonus

The system knows that conversations change over time. It uses a Recency-Weighted Trend.

Analogy: Think of a student's report card. If they get an A on Monday, a B on Tuesday, and an F on Friday, the teacher cares more about the F because it shows a trend of getting worse.
SKG-Eval calculates the final score by weighing the most recent turns more heavily, so it can tell if a conversation is getting better or slowly falling apart.

5. Why This Matters (The "Certificate")

When a standard AI judge says "This is bad," it's often a black box. You don't know why.
SKG-Eval gives you a Contradiction Certificate.

Analogy: Instead of just saying "You failed," it hands you a piece of paper that says: "You failed because in Turn 4, you said 'X is Y', but in Turn 1, you already established 'X is Z'. Here is the exact string on the whiteboard that proves it."

Summary

SKG-Eval is a tool that stops AI evaluators from being "amnesiacs." By turning conversations into a structured, visual map of facts and relationships, it can catch:

Contradictions (Saying opposite things).
Drift (Changing the subject without warning).
Forgetting (Ignoring rules set earlier).

It does this without needing a "magic black box" AI to guess the answer. Instead, it uses a clear, step-by-step logic system that produces a score you can actually trust and audit. It's the difference between a teacher who just glances at your homework and one who checks your work against your notes from the beginning of the semester.

Technical Summary: SKG-Eval

Problem Statement

Evaluating multi-turn dialogue systems presents a fundamental challenge: response quality is intrinsically stateful and temporal. A response may appear locally fluent and relevant but fail globally by contradicting prior commitments, drifting from the user's original intent, or silently forgetting established constraints. Existing automatic evaluation paradigms, including LLM-as-a-judge protocols and embedding-based metrics, largely operate on flat or turn-isolated representations. Consequently, they struggle to reliably detect cross-turn failure modes such as contradiction, topic drift, and entity inconsistency, particularly as conversations grow beyond a few turns. Furthermore, LLM judges suffer from non-determinism, unreliable attention patterns over long histories, and poor recall for paraphrased or numerical conflicts.

Methodology: SKG-Eval

The authors propose SKG-Eval, a quasi-deterministic and interpretable evaluation framework that models dialogue as an evolving Semantic Knowledge Graph (SKG). Instead of scoring a response against a flat text prefix, SKG-Eval incrementally updates a structured graph of entities, relations, and conversational commitments at each turn. The framework computes three complementary signals which are fused and aggregated to produce a session-level score.

1. Incremental Semantic Knowledge Graph (SKG)

The core state representation is a directed multigraph $G_t = (V_t, E_t)$ updated at every turn $t$ .

Nodes: Represent entities with attributes including normalized labels, entity types (e.g., PERSON, OBJECT), embeddings, and importance scores.
Edges: Represent factual claims with typed metadata (relation, attribute, intent, property type).
Update Mechanism: New triples are extracted via a deterministic LLM call. The graph performs cross-turn deduplication (merging nodes with high embedding similarity) and adds semantic edges between new and existing nodes based on embedding proximity.

2. Three-Component Scoring

At each turn, three scores are computed:

Local Relevance ( $S^{\text{loc}}_t$ ): Measures alignment with the current prompt and optional reference. It uses a "Semantic Triangle" approach, calculating the maximum cosine similarity between the response sentences and the prompt/reference, with adaptive handling for short responses or missing references.
Historical Consistency ( $S^{\text{cons}}_t$ ): Quantifies how new information connects to the prior state. It combines:
- Graph Anchor Score: Weighted by node importance, measuring if new nodes connect via factual edges (strongest), semantic edges, or are drifted (isolated).
- Session Anchor: A fallback mechanism using the similarity of the current response to the first turn's embedding to capture thematic continuity in Q&A sessions where graph disconnection is structurally expected.
Logical Coherence ( $S^{\text{log}}_t$ ): The primary innovation, computed by a Geometric Contradiction Engine. This engine detects inconsistencies without relying on NLI models or LLM judges for reasoning. It compares current edges against historical edges using a prioritized cascade of detectors:
- Symbolic Detectors: High-precision checks for negation flips, antonymic relations, and numeric mismatches.
- Geometric Detectors: Checks for exclusive-object conflicts and semantic drift using embedding similarities.
- Revision-Aware Filtering: Explicitly identifies user-authorized revisions (e.g., "change that to...") and excludes them from contradiction checks to avoid penalizing legitimate updates.

3. Fusion and Aggregation

Regime-Adaptive Fusion: The three scores are combined via a weighted sum where weights depend on the response regime (Short, Q&A, or General). Hard logic gates ensure that confirmed contradictions cannot be masked by high relevance scores.
Session-Level Aggregation: The final session score $S(D)$ is derived via a recency-weighted regression. This captures both the current quality level (weighted average) and the temporal trend (slope), ensuring the score reflects whether the conversation is degrading or improving over time, independent of session length.

Key Contributions

Stateful Dialogue Evaluation via Explicit Semantic Memory: Formulates evaluation as reasoning over an evolving SKG, enabling structured analysis of cross-turn dependencies and long-range consistency.
Geometric Contradiction Engine: A deterministic, revision-aware framework for detecting inconsistencies through structured comparison of relations and objects, producing interpretable contradiction certificates without NLI models.
Graph-Anchored Historical Consistency: Introduces a metric that evaluates semantic connectivity to prior states, augmented by a session-anchor mechanism for thematic continuity.
Robust Local Relevance: A triangulated metric that jointly considers prompt alignment and reference coverage with adaptive fallbacks.
Regime-Adaptive Fusion and Trend Analysis: A dynamic weighting strategy and a recency-weighted regression aggregator that captures quality trends across long conversations.
Interpretability and Quasi-Determinism: Provides explicit audit trails (contradiction certificates, semantic anchors) and deterministic scores given fixed inputs, contrasting with the non-determinism of LLM judges.

Experimental Results

The authors evaluated SKG-Eval on MT-Bench (short-horizon) and MultiChallenge (long-horizon), comparing it against baselines including ECoh, LLM-Eval, DeepEval, and various GPT-4o Judge configurations.

Alignment with Human Judgments: SKG-Eval achieved the highest correlation with human ratings on both benchmarks. The gains were most significant on MultiChallenge, where SKG-Eval outperformed the best history-aware LLM judge baseline by +0.13 in Spearman correlation for session-level scores.
Contradiction Detection: On a controlled diagnostic benchmark (SKG-PROBE) targeting specific failure modes (negation, antonyms, numeric mismatch, drift), SKG-Eval achieved a mean F1 of 79.8%, significantly outperforming LLM-based judges (60.4%) and other baselines. It demonstrated superior recall in detecting numeric substitutions and antonymic contradictions.
Length Invariance: While baseline evaluators degraded as session length increased, SKG-Eval maintained stable performance across all length bins due to its graph-indexed retrieval of historical claims.
Computational Efficiency: SKG-Eval is significantly cheaper than LLM-as-a-judge approaches (approx. $0.71 vs $27.1 for 1,000 turns) and is fully reproducible (deterministic), whereas LLM judges exhibit variance across decoding seeds.

Significance and Claims

The paper argues that externalized state tracking via structured representations is a principled alternative to the implicit reasoning used in LLM-based evaluators for long-horizon dialogue systems.

Addressing the Gap: SKG-Eval fills the gap of an evaluator that maintains an explicit, time-stamped state of factual commitments, detects cross-turn contradictions deterministically and interpretably, and aggregates quality in a length-invariant way.
Interpretability: Unlike "black-box" judges, SKG-Eval produces contradiction certificates that explicitly identify the conflicting edges, the detector type, and the confidence, enabling auditable evaluation and dataset curation.
Scalability: By decoupling state tracking from the scoring mechanism, the framework scales to long conversations where repeated LLM prompting becomes computationally prohibitive and prone to context-window limitations.
Limitations: The authors acknowledge that the framework relies on the quality of the upstream semantic triple extraction and is primarily optimized for explicit semantic inconsistency rather than deep pragmatic contradictions requiring external world knowledge.

In conclusion, the authors posit that SKG-Eval offers a scalable, reproducible, and interpretable method for evaluating the consistency and coherence of multi-turn dialogue systems, particularly in scenarios where long-range logical consistency is critical.

SKG-Eval: Stateful Evaluation of Multi-Turn Dialogue via Incremental Semantic Knowledge Graphs