Evaluation of LLMs in retrieving food and nutritional context for RAG systems

This paper evaluates four Large Language Models within a Retrieval-Augmented Generation system for food and nutrition data, finding that while they effectively translate natural language queries into structured metadata filters to reduce manual effort, their reliability diminishes when handling complex queries involving constraints that exceed the representational scope of the underlying metadata.

Maks Požarnik Vavken, Matevž Ogrinc, Tome Eftimov, Barbara Koroušic Seljak

Published Wed, 11 Ma
📖 5 min read🧠 Deep dive

Imagine you have a massive, dusty library containing every recipe, nutrition fact, and food label in Slovenia. This library is so huge and organized in such a complex way that only a librarian with a PhD in database management could find a specific book. If you asked a normal person, "Show me foods with more protein than cholesterol," they'd be lost.

This paper is about building a super-smart, magical translator to help regular people (like nutritionists or dietitians) talk to this library using plain English, without needing to learn the library's secret code.

Here is the breakdown of how they did it and what they found, using some everyday analogies:

1. The Problem: The "Locked" Library

The researchers started with a goldmine of food data (the Slovenian Food Composition Database). It's like having a library with 32,000 books, but the books are locked in a vault. To get a book, you usually need to know the exact "call number" (a complex code like protein > 12 AND group = 'Cheese').

Nutritionists and doctors don't want to learn these codes. They just want to ask, "What foods are high in protein?"

2. The Solution: The "Magic Translator" (RAG System)

The team built a system called RAG (Retrieval-Augmented Generation). Think of this system as a two-step process:

  • Step A: The Translator (The LLM): You ask the computer a question in plain English. A Large Language Model (like a super-smart AI) acts as a translator. It listens to your question and instantly writes the "secret code" (a database filter) that the library understands.
    • You say: "Show me cheeses with more than 12g of protein."
    • The AI translates to: {"food_group": "Cheeses", "protein": {"$gt": 12}}
  • Step B: The Librarian (The Vector Database): The computer takes that code, runs it through the library, and pulls out the exact books (foods) that match.

3. The Experiment: Testing the Translators

The researchers wanted to see if different AI "translators" (specifically Gemini, GPT, Claude, and Mistral) were good at this job. They tested them with 150 questions of varying difficulty:

  • Easy Questions: "Show me foods with high fat." (Like asking for "Red books.")
  • Medium Questions: "Show me foods with high protein AND low sugar." (Like asking for "Red books that are also heavy.")
  • Hard Questions: "Show me foods where protein is higher than cholesterol." (Like asking for "Books where the cover is heavier than the pages inside.")

4. The Results: The "Traffic Light" System

🟢 Green Light (Easy & Medium Questions)

The AI was amazing.
For simple and moderately complex questions, all the AI models got a near-perfect score (99%+ accuracy). They translated the English perfectly into the database code.

  • Analogy: If you asked for "Red books," the translator wrote the code perfectly, and the librarian found exactly what you wanted. Even the open-source model (Mistral) did just as well as the expensive, famous ones.

🟡 Yellow Light (Hard Questions)

The AI got confused, but had a backup plan.
When the questions got tricky (requiring math or comparisons like "more than cholesterol"), the AI sometimes messed up the code. It couldn't write the perfect filter.

  • The Fallback: When the AI failed to write the perfect code, the system didn't just give up. It switched to a "fuzzy search." Instead of looking for an exact match, it looked for things that sounded similar.
  • The Result: It wasn't perfect. The system only found about 40-45% of the right answers for these hard questions.
  • Analogy: If you asked for "Books heavier than their pages," the translator got confused and just shouted, "Look for heavy books!" The librarian brought back a pile of heavy books, but only half of them were actually the right ones.

5. The Big Takeaway

The Good News:
We can now let non-technical people (like dietitians) ask complex questions about food data using normal language. The AI is great at turning those questions into database searches for anything that can be clearly defined (like "high protein" or "low sugar").

The Bad News:
The AI still struggles with "brain-teaser" questions that require complex math or comparing two numbers against each other. If the question is too abstract for the database's structure, the AI gets lost.

Summary Analogy

Imagine you are trying to find a specific needle in a haystack.

  • Old Way: You had to know the exact GPS coordinates of the needle.
  • This Paper's Way: You tell a robot, "Find the needle."
    • If the needle is just "in the haystack," the robot finds it instantly (Easy/Medium).
    • If you ask, "Find the needle that is heavier than the straw next to it," the robot gets confused and might bring you a few needles, but it might miss the right one (Hard).

Conclusion: This technology is a huge step forward for making food data accessible to everyone, but we still need to teach the AI how to handle the really tricky math problems.