Evaluation of LLMs in retrieving food and nutritional context for RAG systems

Imagine you have a massive, dusty library containing every recipe, nutrition fact, and food label in Slovenia. This library is so huge and organized in such a complex way that only a librarian with a PhD in database management could find a specific book. If you asked a normal person, "Show me foods with more protein than cholesterol," they'd be lost.

This paper is about building a super-smart, magical translator to help regular people (like nutritionists or dietitians) talk to this library using plain English, without needing to learn the library's secret code.

Here is the breakdown of how they did it and what they found, using some everyday analogies:

1. The Problem: The "Locked" Library

The researchers started with a goldmine of food data (the Slovenian Food Composition Database). It's like having a library with 32,000 books, but the books are locked in a vault. To get a book, you usually need to know the exact "call number" (a complex code like protein > 12 AND group = 'Cheese').

Nutritionists and doctors don't want to learn these codes. They just want to ask, "What foods are high in protein?"

2. The Solution: The "Magic Translator" (RAG System)

The team built a system called RAG (Retrieval-Augmented Generation). Think of this system as a two-step process:

Step A: The Translator (The LLM): You ask the computer a question in plain English. A Large Language Model (like a super-smart AI) acts as a translator. It listens to your question and instantly writes the "secret code" (a database filter) that the library understands.
- You say: "Show me cheeses with more than 12g of protein."
- The AI translates to: {"food_group": "Cheeses", "protein": {"$gt": 12}}
Step B: The Librarian (The Vector Database): The computer takes that code, runs it through the library, and pulls out the exact books (foods) that match.

3. The Experiment: Testing the Translators

The researchers wanted to see if different AI "translators" (specifically Gemini, GPT, Claude, and Mistral) were good at this job. They tested them with 150 questions of varying difficulty:

Easy Questions: "Show me foods with high fat." (Like asking for "Red books.")
Medium Questions: "Show me foods with high protein AND low sugar." (Like asking for "Red books that are also heavy.")
Hard Questions: "Show me foods where protein is higher than cholesterol." (Like asking for "Books where the cover is heavier than the pages inside.")

4. The Results: The "Traffic Light" System

🟢 Green Light (Easy & Medium Questions)

The AI was amazing.
For simple and moderately complex questions, all the AI models got a near-perfect score (99%+ accuracy). They translated the English perfectly into the database code.

Analogy: If you asked for "Red books," the translator wrote the code perfectly, and the librarian found exactly what you wanted. Even the open-source model (Mistral) did just as well as the expensive, famous ones.

🟡 Yellow Light (Hard Questions)

The AI got confused, but had a backup plan.
When the questions got tricky (requiring math or comparisons like "more than cholesterol"), the AI sometimes messed up the code. It couldn't write the perfect filter.

The Fallback: When the AI failed to write the perfect code, the system didn't just give up. It switched to a "fuzzy search." Instead of looking for an exact match, it looked for things that sounded similar.
The Result: It wasn't perfect. The system only found about 40-45% of the right answers for these hard questions.
Analogy: If you asked for "Books heavier than their pages," the translator got confused and just shouted, "Look for heavy books!" The librarian brought back a pile of heavy books, but only half of them were actually the right ones.

5. The Big Takeaway

The Good News:
We can now let non-technical people (like dietitians) ask complex questions about food data using normal language. The AI is great at turning those questions into database searches for anything that can be clearly defined (like "high protein" or "low sugar").

The Bad News:
The AI still struggles with "brain-teaser" questions that require complex math or comparing two numbers against each other. If the question is too abstract for the database's structure, the AI gets lost.

Summary Analogy

Imagine you are trying to find a specific needle in a haystack.

Old Way: You had to know the exact GPS coordinates of the needle.
This Paper's Way: You tell a robot, "Find the needle."
- If the needle is just "in the haystack," the robot finds it instantly (Easy/Medium).
- If you ask, "Find the needle that is heavier than the straw next to it," the robot gets confused and might bring you a few needles, but it might miss the right one (Hard).

Conclusion: This technology is a huge step forward for making food data accessible to everyone, but we still need to teach the AI how to handle the really tricky math problems.

Here is a detailed technical summary of the paper "Evaluation of LLMs in retrieving food and nutritional context for RAG systems."

1. Problem Statement

Current food and nutrition databases (such as the Slovenian Food Composition Database) contain vast, multidimensional data that is difficult for domain experts (nutritionists, dietitians, food compilers) to access without technical skills. Existing systems often lack granularity, interactivity, or adaptability to local contexts. While Retrieval-Augmented Generation (RAG) offers a solution by enabling natural language querying, a critical bottleneck remains: accurately translating complex natural language queries into structured metadata filters required to query specialized vector databases. If the translation fails, the retrieval precision drops significantly, rendering the system ineffective for complex nutritional analysis.

2. Methodology

System Architecture

The authors developed a RAG pipeline designed to bridge natural language queries with a structured vector database (Chroma). The core workflow involves:

Query Input: A user provides a natural language query (e.g., "Which foods have more than 12g of protein?").
LLM Translation: A Large Language Model (LLM) is prompted to convert the query into Chroma metadata filters. These filters target specific attributes like food groups, component names (e.g., "protein, total"), and logical operators.
Two-Stage Retrieval:
- Stage 1 (Metadata Filtering): The generated filters restrict the search space to items satisfying explicit conditions.
- Stage 2 (Semantic Search): A vector similarity search is performed only within the filtered subset to rank results based on semantic relevance.
Fallback Mechanisms: If the LLM generates syntactically incorrect filters or fails to identify components:
- Loose Filtering: The system attempts to filter only by the most distinct attribute (Food Group).
- Pure Semantic Fallback: If filtering fails entirely, the system reverts to a pure semantic search across the entire database.

Data & Preprocessing

Source: Slovenian Food Composition Database (FCDB) managed via the NutriBase system.
Content: Includes ~32,000 items split into Branded foods (macro-nutrients, label data) and Generic foods (lab-analyzed, up to 366 components including micronutrients).
Embedding: Structured data was converted into natural language descriptions (e.g., "Food item 'Cheese' belongs to group 'Cheeses'..."). To enhance semantic representation, food group names were repeated within sentences (echo embeddings).
Model: gemini-embedding-001 was used to generate 3072-dimensional vectors.

Evaluation Framework

Models Tested: Four LLMs: Google Gemini, OpenAI GPT, Anthropic Claude, and Mistral AI.
Dataset: 150 manually curated queries categorized by difficulty:
- Easy (50): 1–2 conditions.
- Medium (50): 3–4 conditions with nested logic/ranges.
- Hard (50): Comparative reasoning (e.g., "Protein > Cholesterol") or aggregate calculations.
Metrics: Precision, Recall, and F1 Score were calculated by comparing retrieved items against a human-curated ground truth.
Thresholding: Three similarity thresholds were tested based on the distribution of cosine distances in the database ( $\mu-\sigma$ , $\mu$ , $\mu+\sigma$ ) to optimize the pure semantic fallback.

3. Key Contributions

Specialized RAG for Nutrition: Demonstrated a functional pipeline for querying complex food composition data using natural language, specifically tailored for non-technical domain experts.
Metadata Filter Generation Analysis: Provided a rigorous evaluation of how well un-fine-tuned LLMs can generate syntactically correct, domain-specific metadata filters for vector databases.
Fallback Strategy Validation: Validated a hierarchical retrieval strategy (Strict Filter $\to$ Loose Filter $\to$ Pure Semantic) that maintains system usability even when strict query generation fails.
Cross-Lingual Capability: Proved that state-of-the-art LLMs can effectively handle structured query generation in Slovene, a lower-resource language, without fine-tuning.

4. Results

Difficulty Level	Performance Summary
Easy Queries	Near Perfect: All models achieved F1 scores > 0.999 across all thresholds. LLMs reliably translate simple constraints into correct filters.
Medium Queries	High Performance: Most models achieved F1 > 0.99. Gemini and Claude reached F1 = 1.000 at the mean threshold ( $\mu$ ), indicating LLMs handle compound logic and ranges effectively.
Hard Queries	Significant Drop: Performance decreased sharply (F1 $\approx$ 0.37 – 0.45). Queries requiring comparative reasoning (e.g., "A > B") or aggregates exceeded the representational scope of the metadata filter format.
Best Model	Claude achieved the highest single run score (F1 = 0.450) on Hard queries at the middle threshold. However, Mistral and Gemini showed competitive robustness.
Threshold Impact	A more restrictive similarity threshold ( $\mu-\sigma$ ) generally improved the robustness of the fallback retrieval for Hard queries, reducing false positives.

Observation: A minor technical artifact was noted where Chroma occasionally failed to return the complete set of results when a filter matched thousands of entries, though this affected all models equally and did not skew comparative conclusions.

5. Significance and Limitations

Significance:

Accessibility: The system drastically reduces the barrier to entry for nutritionists and food compilers, allowing them to query complex databases without SQL or technical expertise.
Scalability: The approach is language-agnostic and scalable, as demonstrated by its success in Slovene.
Hybrid Retrieval: The study confirms that combining structured metadata filtering with semantic search is superior to pure semantic search for domain-specific data, provided the metadata generation is accurate.

Limitations & Future Work:

Complex Reasoning: The system struggles with queries requiring internal calculation or comparative logic (e.g., "Sum of fats > 80g") because the underlying database schema (Chroma metadata) cannot natively express these constraints.
Database Constraints: Reliance on a single vector database (Chroma) revealed index handling issues with large result sets. Future work should compare index efficiency across different vector DBs.
Model Iterations: Preliminary tests showed that newer model versions (e.g., Gemini-2.5-Pro) sometimes performed worse than predecessors, highlighting the need for continuous evaluation of model updates.
Cost vs. Performance: The economic feasibility of deploying high-cost proprietary models (like GPT-4o or Claude) for this specific task requires further analysis.

In conclusion, the paper establishes that LLM-driven metadata filtering is a highly effective tool for retrieving structured nutritional data for simple-to-moderate queries, but significant challenges remain in handling complex, non-expressible constraints within current vector database architectures.