FINEST: Improving LLM Responses to Sensitive Topics Through Fine-Grained Evaluation

This paper introduces FINEST, a fine-grained evaluation taxonomy that categorizes errors in LLM responses to sensitive topics into Content, Logic, and Appropriateness, demonstrating that using this framework to guide score-based improvements significantly enhances both the safety and helpfulness of model outputs.

Juhyun Oh, Nayeon Lee, Chani Jung, Jiho Jin, Junho Myung, Jongwon Lee, Taeui Song, Alice Oh

Published 2026-03-05
📖 4 min read☕ Coffee break read

Imagine you have a very smart, well-meaning robot assistant. You ask it a tricky question, like, "Is it okay for someone with a terminal illness to choose when they want to die?"

The robot, terrified of saying the wrong thing and getting in trouble, gives you a very safe, boring answer. It says, "Euthanasia is a complex topic with many opinions. Some people think X, others think Y," and then it lists definitions. It's safe, but it's also useless. It didn't actually answer your specific question; it just gave you a textbook summary to avoid taking a stance.

This paper introduces a new system called FINEST to fix this problem. Think of FINEST as a high-tech "Editor-in-Chief" for AI responses.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Safe but Boring" Trap

Currently, AI models are trained to be "harmless." When they hit a sensitive topic (like politics, religion, or ethics), they often get so scared of offending anyone that they become vague. They sacrifice being helpful just to stay safe.

Existing ways to check if an AI is doing a good job are like a teacher giving a student a grade of "C" with no comments. The student knows they did okay, but they have no idea how to get an "A."

2. The Solution: FINEST (The "Fine-Grained" Checklist)

The authors created a new evaluation system called FINEST. Instead of just giving a grade, FINEST acts like a forensic editor that dissects the AI's answer sentence by sentence.

Imagine the AI's answer is a cake. A normal editor might just say, "This cake is dry." FINEST says:

  • The Content (The Ingredients): "You used too much sugar (biased opinion) and forgot to mention the gluten-free option (not inclusive)."
  • The Logic (The Recipe): "You mixed the eggs before cracking them (missing step) and the instructions jump around (incoherent)."
  • The Appropriateness (The Presentation): "You served a wedding cake to a toddler (off-topic) and didn't answer the question about the flavor (unresponsive)."

The system breaks the answer down into three main buckets:

  1. Content: Is it harmful, biased, or predicting the future too confidently?
  2. Logic: Does the argument make sense, or is it just a list of random facts?
  3. Appropriateness: Did it actually answer the specific question asked, or did it just talk around it?

3. The Process: The "Coach and Player" Loop

The paper describes a pipeline (a step-by-step process) to improve the AI:

  1. The Player (The AI): The AI answers a sensitive question.
  2. The Coach (The Evaluator): FINEST reads the answer and gives a detailed report card. It can do this in two ways:
    • The "Error Report": "Sentence 3 is wrong because it's too biased. Sentence 7 is missing a logical step."
    • The "Scorecard": "You got a 4/7 on Logic and a 3/7 on Appropriateness. Here is why..."
  3. The Improvement: The AI reads the Coach's feedback and rewrites its answer to fix the specific mistakes.

4. The Results: From "Vague" to "Valuable"

The researchers tested this on 19,000 sensitive questions in Korean (like "Should same-sex marriage be legal?" or "Is the military draft fair?").

  • Before FINEST: The AI gave vague, evasive answers.
  • After FINEST: The AI gave answers that were still safe (didn't say anything hateful) but were much more helpful. They actually addressed the specific context of the question.

The "Scorecard" method (giving a number and a reason) worked the best. It reduced the number of "bad sentences" in the answers by about 33%. When humans looked at the before-and-after versions, they preferred the improved version 88% of the time.

The Big Picture Metaphor

Think of the AI as a newly hired diplomat.

  • Without FINEST: The diplomat is so afraid of saying something that causes an international incident that they just say, "We value peace and dialogue," and walk away. It's safe, but it solves nothing.
  • With FINEST: The diplomat has a smart advisor whispering in their ear. The advisor says, "Don't just say 'peace.' Acknowledge that Group A feels hurt, explain why Group B is worried, and then offer a specific compromise."

The diplomat still stays safe (no one gets offended), but now they are actually useful and helpful.

Why This Matters

This paper shows that we don't have to choose between "Safe AI" and "Helpful AI." By using a detailed, structured way to critique the AI's answers, we can teach it to be both. It turns a robot that just "plays it safe" into a robot that can navigate difficult conversations with nuance and clarity.