DeepEval: How an LLM Evaluates (and Improves) Another LLM

TL;DR - DeepEval is an open source framework that uses a "judge" LLM to evaluate the responses of another LLM, like a professor grading a student's exam. I applied this approach to LidIA, a RAG bot for debt consulting with 12,600 chunks on ChromaDB, moving from 84% to 100% of tests passed in a few iterations. The ideal team: DeepEval identifies failures, Claude Code proposes fixes, I supervise. A true assembly line where bots take care of the growth of other bots.

There's a precise moment when building an AI stops being purely technical and becomes a matter of trust. How do you know your bot answers well? Not "well enough", but truly well, in a measurable, repeatable, verifiable way?

I found a concrete answer while working on LidIA, an AI assistant for debt consulting developed with Tommaso Andrea Smimmo, my client and partner in the Sole24Ore journey. The solution is called DeepEval. The principle is as simple as it is revolutionary: you send an LLM to examine another LLM.

Why Old Unit Tests Aren't Enough for AI

LidIA is a RAG bot (Retrieval Augmented Generation) built on LlamaIndex with ChromaDB as the vector database. It draws from 4 thematic collections of legal documents: over-indebtedness regulations, case law (judgments and doctrine), structured FAQs, and practical materials like guides and case studies. In total, roughly 12,600 chunks distributed across years of actual cases.

The bot retrieves the 4 most relevant chunks for each collection (16 total) and passes them to gpt-4.1-mini with a structured system prompt that defines 5 conversational phases: context elicitation, knowledge base response, response with disclaimers when the KB is incomplete, management of out-of-scope questions, and integration with the Revenue Agency.

Before DeepEval, I already had a two-level test system in place:

Static unit tests (with mocks): verify code structure without calling real APIs
Integration tests with keywords: 10 test classes, roughly 60 total cases, verifying if responses contain expected keywords

These tests were useful. But they weren't sufficient. The problem is structural: you can't test an LLM the way you'd test a mathematical function. There's no assertEquals("expected output", bot_response) that makes sense for natural language. The output is nuanced, contextual, variable.

You can verify that the response contains "Art. 67 CCII". You can't verify if that response is actually faithful to the document, relevant to the question, free of hallucinations. For that you need something smarter.

The Assembly Line: DeepEval and Claude Code as a Team

The most important discovery wasn't DeepEval itself. It was understanding how to combine it with Claude Code to create an automatic assembly line for bot improvement.

It works like this:

DeepEval runs the test suite and produces a detailed report: which test cases passed, which failed and why, with metrics for each response
Claude Code reads DeepEval's output and analyzes failure patterns: it understands if the problem is in the prompt, retrieval logic, chunking, or similarity thresholds
Claude Code proposes fixes directly on the bot's code, whether in the system prompt or the RAG pipeline
I supervise and decide which fixes to apply
DeepEval runs the tests again and verifies if the fixes improved results or introduced regressions

It's a continuous cycle. Bot judging bot, bot fixing bot, human maintaining strategic control. An assembly line where each component does what it does best.

How DeepEval Works: The Professor and Student Analogy

DeepEval's mechanism is intuitive. You have a student (your bot, in our case LidIA) and a professor (a second LLM called the judge) who reads each response and evaluates it on precise metrics.

The three main metrics I used for LidIA:

AnswerRelevancy (threshold: 0.7): Is the response relevant to the user's question?
FaithfulnessMetric (threshold: 0.7): Is the response grounded in the chunks retrieved by the RAG, or is the bot making things up?
HallucinationMetric (max threshold: 0.3): Does the bot introduce information not present in the documents?

The judge isn't human. It's gpt-4.1-mini, configured to reason about response quality as an expert in over-indebtedness law would. It evaluates each test case and produces a numeric score with reasoning.

The advantage over keyword-based testing is substantial: a keyword test tells you "the response contains the right word". DeepEval tells you "the response is conceptually correct, doesn't contradict the source documents, and actually answers what the user was asking".

From 84% to 100%: How We Improved LidIA with DeepEval

First evaluation: 84% of test cases passed.

It wasn't a terrible result, but it wasn't enough for a bot helping people in serious financial difficulty. In that context, a wrong or misleading answer has real consequences.

Claude Code analyzed DeepEval's output and identified the main failure patterns: some were about the system prompt (ambiguous instructions on when to cite sources, imprecise handling of out-of-scope cases), others about the RAG system (non-optimal retrieval thresholds, chunking that broke paragraphs at wrong points).

Prompt fixes and RAG pipeline optimization: 96%.

Second iteration with a new DeepEval cycle, Claude Code analysis, more targeted fixes: 100%.

Three numbers that tell a structured, not random, process. Each iteration was guided by precise data, not intuitions.

What Regressions Are and Why They Can Ruin Your AI Bot

There's a specific risk in iterative AI bot development: when you fix one problem, you can create another.

You improve the prompt to better handle questions about debt collection clearances, and inadvertently degrade responses about OCC procedures. DeepEval spots it immediately: in the next report, test cases that previously passed now fail. These are regressions.

Without this control, you optimize blindly. You think you're improving the bot while in some areas you're making it worse. With DeepEval every iteration is verifiable: you know exactly what you gained and what you lost.

It's the same principle as regression testing in classical software, applied to natural language. The difference is the "compiler" that finds errors is an LLM reasoning about semantics, not a syntax parser.

Does It Make Sense for Italian SMBs Now?

The technology to do it already exists and is accessible. But it makes sense to contextualize it to your project's development phase.

In the prototyping phase, static keyword tests are sufficient to verify basic structure. When the bot enters a production or pre-production phase where it responds to real users on questions with real impact, DeepEval becomes necessary. Not optional.

Tools are evolving rapidly. In a few months there will probably be something even more accessible for those who don't want to manage testing infrastructure. In the meantime, the smartest choice is to get guidance from someone who does this every day and knows both testing frameworks and the RAG pipelines they apply to.

Don't wait for problems to start testing. If you have a bot in production answering real customers, every wrong answer costs you in trust, relationships, and opportunity.

The Future: Bots Taking Care of the Growth of Other Bots

There's a larger vision behind this approach. We're moving toward a model where AI systems aren't just built by humans, but evaluated, corrected, and improved by other AI systems, with the human maintaining strategic oversight.

This isn't science fiction. It's what I did on LidIA in recent weeks, using tools already available and reasonable costs (the DeepEval judge with gpt-4.1-mini costs roughly 0.01-0.05 USD per test).

It's the shift from line-by-line unit tests to an evaluation suite that reasons about natural language. It's the shift from classical software engineering to AI systems engineering.

If you're building a bot for your company (whether it's a RAG assistant, a customer care chatbot, or an agent for internal processes) the question is no longer "does it work?", but "how do you know?".

Want to understand if your AI is production-ready? Start with a conversation.

DeepEval: How an LLM Evaluates (and Improves) Another LLM

DeepEval: How an LLM Evaluates (and Improves) Another LLM

Why Old Unit Tests Aren't Enough for AI

The Assembly Line: DeepEval and Claude Code as a Team

How DeepEval Works: The Professor and Student Analogy

From 84% to 100%: How We Improved LidIA with DeepEval

What Regressions Are and Why They Can Ruin Your AI Bot

Does It Make Sense for Italian SMBs Now?

The Future: Bots Taking Care of the Growth of Other Bots

Tags

Share

Read also

Testing in Vibe Coding: Why the Happy Path Blows Up in Production

What Are Claude Code Skills and How Do They Work

Claude Code Skills 2026: Complete Guide and Best Practices

Is Your Company Ready for AI?