The Evaluator
Your go-to blog for insights on AI observability and evaluation.
LLM-as-a-Judge: Example of How To Build a Custom Evaluator Using a Benchmark Dataset
When To Build Custom Evaluators Arize-Phoenix ships with pre-built evaluators that are tested against benchmark datasets and tuned for repeatability. They’re a fast way to stand up rigorous evaluation for common scenarios. In practice, though, many teams work in specialized domains — such as medicine, finance, and agriculture — where models depend on proprietary data…
ADB Database: Realtime Ingestion At Scale
We put out our first blog on the introducing the Arize database – ADB – in the beginning of July; this blog dives deeper into the realtime ingestion support of ADB. AI data use-cases require an interface that supports both large files (like custom datasets) and real-time events from traces. The Arize AX platform was…
New In Arize AX: Prompt Learning, Arize Tracing Assistant, and Multiagent Visualization
July was a big month for Arize AX, with updates to make AI and agent engineering much easier. From prompt learning to new skills for Alyx and OpenInference Java, there is a lot to dive into. Here are some highlights on what we shipped. Alyx Updates At Arize:Observe, we debuted Alyx: an AI-powered assistant with…
Sign up for our newsletter, The Evaluator — and stay in the know with updates and new resources:
A Watermark for Large Language Models
In our latest live AI research papers community reading, the primary author of the popular paper A Watermark For Large Language Models (John Kirchenbauer of University of Maryland) walked us through the thinking behind the paper, technical approach, and key takeaways. The paper’s proposed watermark can be embedded with negligible impact on text quality, and…
Unlocking Safer AI: Your Two-Part Field Guide
Large language models are reshaping how we build products — and how adversaries try to break them. To help teams stay ahead, Sofia Jakovcevic — AI Solutions Engineer at Arize AI and an alumna of OpenAI — wrote this two-part guide on how jailbreaks really work and how modern guardrails can shut them down. Skim…
LLM Observability for AI Agents and Applications
The era of single-turn LLM calls is behind us. Today’s AI products are powered by increasingly autonomous agents — multi-step systems that plan, reason, use tools, and adapt in real time — or by complex applications built from chained prompts, retrieval pipelines, and external APIs. These systems are dynamic and stateful. Since they handle memory,…
Prompt Learning: Using English Feedback to Optimize LLM Systems
Applications of reinforcement learning (RL) in AI model building has been a growing topic over the past few months. From Deepseek models incorporating RL mechanics into their training processes to other success stories of RL-based improvement, “AI Twitter” has been ablaze. As more agents get deployed, a question emerges: can reinforcement learning control systems be…
Self-Adapting Language Models: Paper Authors Discuss Implications
In a recent live AI research paper reading, the authors of the new paper Self-Adapting Language Models (SEAL) shared a behind-the-scenes look at their work, motivations, results, and future directions. The paper introduces a novel method for enabling large language models (LLMs) to adapt their own weights using self-generated data and training directives — “self-edits.”…
Meet Alyx: Arize’s Evolving AI Agent
We’re excited to introduce Alyx, the next evolution in Arize’s intelligent assistant. You might remember our first iteration — Copilot — launched last year as a set of tools to help users accomplish small tasks across the platform. Here’s the original blog I wrote about it. Copilot gave us a glimpse into the future, but…
Introducing ADB: Arize’s Proprietary OLAP Database
Earlier this month, we rolled out real‑time ingestion support to every Arize AX workspace—paid and free. With that launch, Arize now ingests terabytes of data every day across hundreds of customers and thousands of users, all at (near) real‑time speeds. How do we keep those speeds consistent across so many workloads? The answer is our…