PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses

This paper introduces PEEM, a unified framework that employs a structured 9-axis rubric and LLM-based evaluators to provide interpretable, joint assessments of prompts and responses, enabling systematic diagnosis and significantly improving downstream accuracy through zero-shot prompt optimization.

Minki Hong, Eunsoo Lee, Sohyun Park, Jihie Kim

Published Thu, 12 Ma
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a very smart, but sometimes literal-minded, robot how to do a task. You give it a set of instructions (a prompt), and it gives you an answer (a response).

For a long time, people only cared if the robot's final answer was right or wrong. It was like grading a student's math test by only looking at the final number: "15." If the answer was 15, they got an A. If it was 14, they got an F. But this didn't tell you how the student got there, if they understood the concept, or if your instructions were confusing.

This is where PEEM comes in. The authors of this paper built a new "report card" system called PEEM (Prompt Engineering Evaluation Metrics).

Here is how PEEM works, explained through a simple analogy:

The "Double-Check" Inspector

Imagine a restaurant.

  • The Chef is the AI model.
  • The Recipe is the Prompt (what you tell the AI to do).
  • The Dish is the Response (what the AI actually makes).

In the old days, the food critic (the evaluator) would only taste the dish. If it tasted good, the chef got a 5-star rating. If it tasted bad, they got 1 star. They didn't care if the recipe was written in a language the chef couldn't understand, or if the recipe was missing ingredients.

PEEM is like a super-inspector who checks BOTH the Recipe and the Dish.

Part 1: Grading the Recipe (The Prompt)

Before the chef even starts cooking, PEEM looks at your instructions. It asks three questions:

  1. Is it clear? (Did you say "Add salt" or just "Make it tasty"?)
  2. Is it well-written? (Are there typos or confusing grammar?)
  3. Is it fair? (Does the recipe assume everyone likes spicy food, or is it inclusive?)

If your recipe is messy, PEEM gives it a low score, even before the food is cooked. This helps you realize, "Oh, I didn't explain the steps clearly enough!"

Part 2: Grading the Dish (The Response)

Once the chef cooks the meal, PEEM tastes it, but not just for "good or bad." It uses a 6-point checklist:

  1. Accuracy: Is the food actually edible and correct?
  2. Coherence: Does the meal make sense? (e.g., Did they put ice cream on a steak?)
  3. Relevance: Did the chef make what you asked for, or did they make a cake when you wanted soup?
  4. Objectivity: Is the chef being neutral, or are they adding their own weird opinions?
  5. Clarity: Is the presentation easy to understand?
  6. Conciseness: Did they serve a tiny bite or a mountain of food when you only wanted a snack?

The Magic Ingredient: The "Why"

The coolest part of PEEM isn't just the score (like a 4 out of 5). It's the commentary.

Instead of just saying "You got a 2," PEEM says: "You got a 2 because your recipe was confusing, and the chef forgot to turn on the oven."

This is like having a teacher who writes a detailed note on your homework instead of just circling the wrong answers. It tells you exactly what to fix.

Why is this a big deal?

The paper shows that PEEM is incredibly useful for three reasons:

  1. It's Honest: When PEEM says a model is doing well, it usually matches the old "right/wrong" scores. So, we know it's reliable.
  2. It's Robust: If you rewrite your instructions using different words but keep the same meaning, PEEM gives you the same score. But if you try to trick the AI with confusing or malicious instructions, PEEM catches it and lowers the score.
  3. It's a Self-Improvement Tool: The authors used PEEM to create a "feedback loop." They let PEEM grade the AI's instructions, then fed those comments back to the AI to rewrite its own instructions.
    • Result: The AI got significantly smarter just by listening to PEEM's advice, without needing any human teachers or expensive retraining. It improved its accuracy by up to 11.7 points just by following the "report card."

The Bottom Line

PEEM changes the game from "Did you get the right answer?" to "How well did you understand the question, and how well did you explain your answer?"

It turns the black box of AI into a transparent process where we can see exactly why an AI succeeds or fails, and gives us a clear map on how to make it better. It's the difference between a teacher who just hands back a test with a grade, and one who sits down with you to explain how to ace the next one.