We ran 10,000 evals. Here’s what we learned

Written by
Published on
October 3, 2025
About Basalt

Unique team tool

Enabling both PMs to iterate on prompts and developers to run complex evaluations via SDK

Versatile

The only platform that handles both prompt experimentation and advanced evaluation workflows

Built for enterprise

Support for complex evaluation scenarios, including dynamic prompting

Manage full AI lifecycle

From rigorous evaluation to continuous monitoring

Discover Basalt

Deploying Large Language Models at scale overcoming the challenges of production quality and reliability


Deploying large language models (LLMs) at scale presents significant challenges in maintaining consistent high performance and reliability. While initial testing phases can offer confidence, real-world applications often reveal unforeseen issues that demand ongoing attention. Through over 10,000 evaluations, we have gathered crucial insights into the realities of LLM deployment. These findings underscore the necessity of rigorous and continuous evaluation, paired with meticulous engineering of prompts and control flows to meet the stringent quality standards required in production environments.

A key insight concerns what we refer to as the "80% bar" for production quality. Agent frameworks can accelerate reaching a baseline performance level of roughly 70 to 80 percent. However, this level often falls short for critical customer-facing applications. Surpassing this threshold usually requires reverse-engineering the entire framework, including prompts, control flows, and integration with deterministic code, demanding deep expertise in how the LLM interacts with its operational environment.

The continuous, multidimensional nature of LLM evaluation


LLM evaluation is far from a one-off task. Instead, it represents a continuous and multidimensional process that encompasses multiple facets. These include:

  • Technical capabilities such as accuracy, robustness, context understanding, and generation quality.
  • Alignment criteria including adherence to values, avoidance of bias, and prevention of toxic content.
  • Security aspects covering attack resistance, confidentiality, and the reliability of the model’s reasoning.

To properly evaluate these dimensions, a comprehensive set of tests is essential. Established benchmarks like GLUE, SuperGLUE, MMLU, TruthfulQA, and HumanEval provide indispensable tools for reproducible and comparative assessment on well-defined tasks.

Moreover, an effective evaluation framework combines offline evaluation on curated “golden datasets” for consistency and reproducibility, alongside online evaluation methods such as A/B testing to capture real-world user satisfaction and the complexity of live usage scenarios. Automatic metrics, including:

  • Accuracy
  • Perplexity
  • BLEU and ROUGE scores
  • Precision, recall, and F1 score
  • Toxicity measures
  • Latency

complement human judgment, which remains critical for qualitative aspects like narrative coherence and contextual relevance. There is emerging interest in using AI as an evaluator to accelerate this process, though such approaches require careful calibration to avoid bias.

Mastering prompts and control flows the foundation for high-quality applications


One of the most important lessons is the need to treat prompts as critical pieces of code. This involves versioning prompts, integrating them into automated pipelines, and continuously refining them based on feedback and monitoring data. Simply relying on the LLM to autonomously decide each step in real-time rarely leads to optimal results.

Successful LLM-powered agents typically rely on deterministic code as their backbone, with LLM calls strategically embedded to enhance functionality. The core design pattern is:

  • The LLM determines the next action.
  • Deterministic code executes that action.
  • The results are appended back into the context.

This approach creates a seamless and reliable user experience.

The indispensable role of monitoring in production


Even models that perform well initially can degrade or drift once deployed in real-world environments. Monitoring acts as a vital "thermometer," alerting teams when issues arise. This includes detecting drops in performance and identifying data or concept drift.

Effective monitoring covers various metrics:

  • Technical metrics such as accuracy and F1 scores.
  • Data drift detection through statistical distribution tests.
  • Business and usage indicators, including conversion rates and click-through rates.
  • Trust and ethics measures such as toxicity levels, fairness, and refusal rates.

User feedback, both explicit and implicit, serves as a valuable signal to detect production problems early. Automated alerting systems should be configured to enable rapid response. When significant drift is detected, retraining or recalibrating the model often becomes necessary to restore quality.

A virtuous cycle of continuous improvement


Insights gained from production monitoring should inform and refine offline evaluation criteria, creating a virtuous cycle. This iterative feedback loop enables continuous strengthening of the model’s robustness by integrating real-world performance data and user experience into subsequent development phases.

Conclusion


The extensive evaluation of LLMs, spanning thousands of tests, demonstrates that consistent performance and reliability are not accidental. Achieving customer-grade quality requires rigorous engineering that combines thorough pre-deployment evaluation with continuous monitoring and iterative optimization of prompts and control flows. Recognizing that LLMs are powerful yet benefit significantly from deterministic scaffolding is fundamental to meeting the high standards expected in production.

Basalt - Integrate AI in your product in seconds | Product Hunt