Surpassing 80% quality with continuous AI evaluation

Written by
François De Fitte
Cofounder @Basalt
Published on
August 28, 2025
About Basalt

Unique team tool

Enabling both PMs to iterate on prompts and developers to run complex evaluations via SDK

Versatile

The only platform that handles both prompt experimentation and advanced evaluation workflows

Built for enterprise

Support for complex evaluation scenarios, including dynamic prompting

Manage full AI lifecycle

From rigorous evaluation to continuous monitoring

Discover Basalt

 

Introduction: the critical need for rigorous evaluation and monitoring


The rapid advancement of large language models (LLMs) has made rigorous performance evaluation and vigilant monitoring essential once these models are deployed in production environments. Evaluating an AI model is inherently multidimensional, encompassing technical, ethical, and practical criteria. Even a model initially demonstrating strong performance can deteriorate over time if left unsupervised in real-world conditions. To ensure reliability and positive impact, a comprehensive evaluation before deployment must be paired with continuous monitoring afterward.

Model monitoring involves the ongoing observation of inputs processed and outputs generated by a machine learning system post-deployment. This process is crucial for detecting anomalies, drifts, or degradation in quality early on. Effective monitoring reduces the detection time for issues, enabling rapid responses from engineering teams and safeguarding model integrity.

 

Overcoming the 80% quality barrier in production AI agents


Developers building AI agents with popular frameworks such as Basalt frequently achieve quality levels in the 70 to 80 percent range. However, crossing the 80% threshold often proves insufficient for delivering reliable client-facing features in production. Surpassing this level usually requires deep reverse engineering of the frameworks, prompt engineering, and control flow optimization. In some cases, this demands restarting the design from scratch.

Insights from the "12-factor agents" methodology reveal that many purported AI agents are, in reality, predominantly deterministic software systems with occasional LLM calls sprinkled in to create an illusion of intelligence. Truly effective agents diverge from the simplistic “prompt + tools + loop until goal” paradigm. Instead, they are chiefly composed of robust software modules with tightly controlled AI integration.

The fastest route to high-quality AI software delivery involves embedding modular agent construction principles into existing products. These modular concepts are accessible to experienced software engineers, even those lacking deep AI expertise, enabling rapid iteration and improved reliability without rewriting entire systems.

Striving for reliability and high-quality outputs


Technical evaluation aims to verify that models perform their intended tasks with accuracy, consistency, and relevance, especially in sensitive applications. Rigorous testing ensures models meet user expectations for reliability, while also addressing bias, security, and ethical considerations. Evaluating LLMs thus serves as both a performance benchmark and a responsibility safeguard.

The overarching goal of monitoring is to maintain a holistic view of model performance encompassing business metrics, ethical standards, and technical soundness. The “12 factors” framework supports this objective by promoting more reliable, scalable, and maintainable LLM-powered software.

Two critical aspects of agent design focus on prompt mastery and control flow management. “Own your prompts” and “own your control flow,” as highlighted in the 12-factor principles, underscore the importance of structuring prompts as functions and compactly managing errors within context windows to optimize performance and robustness.

Continuous evaluation: the cornerstone of sustained quality


Even models that perform well initially can suffer from data drift or concept drift as the real-world environment evolves. Without vigilant monitoring, these shifts can lead to unnoticed degradation and significant errors. Continuous monitoring serves as a real-time thermometer, alerting teams to performance drops, data changes, or misalignments with business and technical objectives.

Evaluation and monitoring form a virtuous cycle: insights from production monitoring feed back into offline evaluation criteria, improving model robustness iteratively. The emerging paradigm of Continuous Integration, Continuous Evaluation, and Continuous Deployment (CI/CE/CD) reflects the necessity of embedding ongoing testing and assessment throughout an LLM product’s lifecycle.

Comprehensive AI model evaluation includes three complementary dimensions. First, the assessment of knowledge and capabilities measures task performance across cognitive and linguistic benchmarks. Second, alignment evaluation ensures ethical and societal standards are met, examining adherence to moral guidelines, bias mitigation, and toxicity control. Third, security evaluation tests the model’s resilience to adversarial prompts, privacy risks, and error detection.

Offline evaluation—conducted on predefined datasets before deployment—is reproducible and cost-effective but lacks full insight into user experience. Online evaluation, leveraging A/B testing, user feedback, and production logs, captures real-world complexity. A combination of both provides a robust framework for understanding and improving model quality.

Automated metrics such as accuracy, perplexity, BLEU, ROUGE, and latency offer rapid quantitative assessment. However, human judgment remains indispensable for qualitative aspects like narrative coherence and user experience. Emerging approaches using LLMs to judge other models show promise but require caution to avoid reinforcing biases or creating feedback loops.

Metrics and best practices for production monitoring


Monitoring in production tracks a variety of indicators:

  • Technical performance metrics, aligned with those used in training and testing, including precision, recall, and error rates.
  • Data drift indicators, detecting shifts in input or prediction distributions through statistical tests.
  • Business and usage metrics linked to the model’s objectives, such as conversion rates and user engagement.
  • Confidence and ethical indicators, including prediction confidence scores, refusal rates, toxicity levels, and fairness indices.

Effective monitoring implements automated alert thresholds, anomaly detection models, and rolling windows to detect gradual shifts. When degradation is detected, retraining or recalibration with fresh data can restore performance. Documentation and historical tracking of metrics are essential for trend analysis and governance compliance.

Conclusion


The deployment of AI agents based on LLMs demands a rigorous, continuous evaluation and monitoring strategy to ensure reliable, high-quality performance beyond the 80% quality threshold. Combining upfront comprehensive testing with real-time production oversight addresses the dynamic nature of real-world data and user interaction. The integration of modular agent principles, continuous evaluation cycles, and multi-dimensional assessment frameworks forms the foundation for building trustworthy AI systems capable of meeting both technical and ethical standards in production environments. This ongoing vigilance is not only a technical necessity but a critical responsibility in the deployment of AI at scale.

Basalt - Integrate AI in your product in seconds | Product Hunt