DRAFT Agentic Design Patterns - Evaluation and Monitoring

aiagentsmonitoringevaluationtypescriptlangchainlanggraphvercel
By sko X opus 4.19/21/202516 min read

Learn how to implement comprehensive evaluation and monitoring systems for AI agents in production. We'll build observable, testable, and optimizable agent systems using TypeScript, LangChain, and LangGraph on the Vercel platform.

Mental Model: The Observatory Pattern

Think of agent evaluation and monitoring like running a space telescope observatory. You need multiple instruments (metrics) observing different wavelengths (aspects of behavior), continuous tracking (monitoring) to catch transient events (errors/anomalies), and calibration systems (evaluation) to ensure accuracy. Just as astronomers combine data from multiple telescopes to understand celestial objects, we combine multiple evaluation and monitoring approaches to understand agent behavior comprehensively.

Basic Example: Agent with Built-in Evaluation

Let's start with a simple customer support agent that includes basic evaluation and monitoring capabilities.

1. Define Evaluation Types and Metrics

// app/lib/evaluation/types.ts
import { z } from 'zod';

export const EvaluationMetricSchema = z.object({
  accuracy: z.number().min(0).max(1),
  relevance: z.number().min(0).max(1),
  coherence: z.number().min(0).max(1),
  latency: z.number(),
  tokenUsage: z.object({
    input: z.number(),
    output: z.number(),
    total: z.number(),
  }),
  cost: z.number(),
  timestamp: z.string().datetime(),
});

export type EvaluationMetric = z.infer<typeof EvaluationMetricSchema>;

export const AgentTraceSchema = z.object({
  traceId: z.string(),
  parentId: z.string().optional(),
  agentName: z.string(),
  input: z.any(),
  output: z.any(),
  metrics: EvaluationMetricSchema,
  errors: z.array(z.string()).default([]),
  metadata: z.record(z.any()).default({}),
});

export type AgentTrace = z.infer<typeof AgentTraceSchema>;

Types define the shape of our evaluation data with Zod for runtime validation.

2. Create Monitoring Callback Handler

// app/lib/monitoring/callback.ts
import { BaseCallbackHandler } from '@langchain/core/callbacks/base';
import { Serialized } from '@langchain/core/load/serializable';
import { ChainValues } from '@langchain/core/utils/types';
import { AgentTrace, EvaluationMetric } from '../evaluation/types';
import { v4 as uuidv4 } from 'uuid';

export class MonitoringCallbackHandler extends BaseCallbackHandler {
  name = 'MonitoringCallbackHandler';
  private traces: Map<string, Partial<AgentTrace>> = new Map();
  private startTimes: Map<string, number> = new Map();

  async handleChainStart(
    chain: Serialized,
    inputs: ChainValues,
    runId: string,
  ): Promise<void> {
    const traceId = uuidv4();
    this.startTimes.set(runId, Date.now());
    
    this.traces.set(runId, {
      traceId,
      agentName: chain.id?.[chain.id.length - 1] || 'unknown',
      input: inputs,
      timestamp: new Date().toISOString(),
      errors: [],
      metadata: {},
    });
  }

  async handleChainEnd(
    outputs: ChainValues,
    runId: string,
  ): Promise<void> {
    const trace = this.traces.get(runId);
    const startTime = this.startTimes.get(runId);
    
    if (trace && startTime) {
      const latency = Date.now() - startTime;
      
      // Calculate token usage (simplified - in production, get from LLM response)
      const tokenUsage = {
        input: JSON.stringify(trace.input).length / 4, // Rough estimate
        output: JSON.stringify(outputs).length / 4,
        total: 0,
      };
      tokenUsage.total = tokenUsage.input + tokenUsage.output;
      
      // Calculate cost (based on Gemini Pro pricing)
      const cost = (tokenUsage.input * 0.00025 + tokenUsage.output * 0.0005) / 1000;
      
      const metrics: EvaluationMetric = {
        accuracy: 0, // Will be calculated by evaluator
        relevance: 0,
        coherence: 0,
        latency,
        tokenUsage,
        cost,
        timestamp: new Date().toISOString(),
      };
      
      trace.output = outputs;
      trace.metrics = metrics;
      
      // Send to monitoring service
      await this.sendToMonitoring(trace as AgentTrace);
    }
    
    this.traces.delete(runId);
    this.startTimes.delete(runId);
  }

  async handleChainError(
    err: Error,
    runId: string,
  ): Promise<void> {
    const trace = this.traces.get(runId);
    if (trace) {
      trace.errors = [...(trace.errors || []), err.message];
      await this.sendToMonitoring(trace as AgentTrace);
    }
  }

  private async sendToMonitoring(trace: AgentTrace): Promise<void> {
    // In production, send to your monitoring service
    console.log('Trace:', JSON.stringify(trace, null, 2));
    
    // Example: Send to Langfuse, DataDog, or custom endpoint
    if (process.env.MONITORING_ENDPOINT) {
      await fetch(process.env.MONITORING_ENDPOINT, {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify(trace),
      });
    }
  }
}

Custom callback handler captures detailed metrics from agent execution.

3. Implement LLM-as-a-Judge Evaluator

// app/lib/evaluation/evaluator.ts
import { ChatGoogleGenerativeAI } from '@langchain/google-genai';
import { PromptTemplate } from '@langchain/core/prompts';
import { z } from 'zod';
import { StructuredOutputParser } from 'langchain/output_parsers';
import { EvaluationMetric } from './types';
import { memoize } from 'es-toolkit';

const EvaluationResultSchema = z.object({
  accuracy: z.number().min(0).max(1),
  relevance: z.number().min(0).max(1),
  coherence: z.number().min(0).max(1),
  reasoning: z.string(),
});

export class LLMEvaluator {
  private model: ChatGoogleGenerativeAI;
  private parser: StructuredOutputParser<z.infer<typeof EvaluationResultSchema>>;
  
  constructor() {
    this.model = new ChatGoogleGenerativeAI({
      modelName: 'gemini-2.5-pro',
      temperature: 0,
      apiKey: process.env.GOOGLE_API_KEY,
    });
    
    this.parser = StructuredOutputParser.fromZodSchema(EvaluationResultSchema);
  }
  
  // Memoize evaluation for identical inputs to reduce costs
  evaluate = memoize(
    async (input: string, output: string, expectedOutput?: string) => {
      const formatInstructions = this.parser.getFormatInstructions();
      
      const prompt = PromptTemplate.fromTemplate(`
        Evaluate the following agent response:
        
        Input: {input}
        Agent Output: {output}
        {expectedOutput}
        
        Evaluate based on:
        1. Accuracy: How factually correct is the response? (0-1)
        2. Relevance: How well does it address the input? (0-1)
        3. Coherence: How clear and well-structured is it? (0-1)
        
        {formatInstructions}
      `);
      
      const response = await this.model.invoke(
        await prompt.format({
          input,
          output,
          expectedOutput: expectedOutput 
            ? `Expected Output: ${expectedOutput}` 
            : '',
          formatInstructions,
        })
      );
      
      return this.parser.parse(response.content as string);
    },
    { 
      // Cache key based on input hash
      getCacheKey: (args) => JSON.stringify(args),
    }
  );
  
  async evaluateWithMetrics(
    input: string,
    output: string,
    metrics: Partial<EvaluationMetric>,
    expectedOutput?: string
  ): Promise<EvaluationMetric> {
    const evaluation = await this.evaluate(input, output, expectedOutput);
    
    return {
      ...metrics,
      accuracy: evaluation.accuracy,
      relevance: evaluation.relevance,
      coherence: evaluation.coherence,
    } as EvaluationMetric;
  }
}

LLM-based evaluator assesses response quality automatically.

4. Create Monitored Agent

// app/lib/agents/monitored-agent.ts
import { StateGraph, Annotation } from '@langchain/langgraph';
import { ChatGoogleGenerativeAI } from '@langchain/google-genai';
import { HumanMessage, AIMessage } from '@langchain/core/messages';
import { MonitoringCallbackHandler } from '../monitoring/callback';
import { LLMEvaluator } from '../evaluation/evaluator';
import { MemorySaver } from '@langchain/langgraph';

const StateAnnotation = Annotation.Root({
  messages: Annotation<(HumanMessage | AIMessage)[]>({
    reducer: (curr, next) => [...curr, ...next],
    default: () => [],
  }),
  evaluationResults: Annotation<any[]>({
    reducer: (curr, next) => [...curr, ...next],
    default: () => [],
  }),
});

export async function createMonitoredAgent() {
  const model = new ChatGoogleGenerativeAI({
    modelName: 'gemini-2.5-pro',
    temperature: 0.7,
    apiKey: process.env.GOOGLE_API_KEY,
  });
  
  const monitoringHandler = new MonitoringCallbackHandler();
  const evaluator = new LLMEvaluator();
  const memory = new MemorySaver();
  
  const workflow = new StateGraph(StateAnnotation)
    .addNode('process', async (state) => {
      const lastMessage = state.messages[state.messages.length - 1];
      
      const response = await model.invoke(
        state.messages,
        { callbacks: [monitoringHandler] }
      );
      
      // Auto-evaluate the response
      const evaluation = await evaluator.evaluateWithMetrics(
        lastMessage.content as string,
        response.content as string,
        {
          latency: 0, // Will be set by callback
          tokenUsage: { input: 0, output: 0, total: 0 },
          cost: 0,
          timestamp: new Date().toISOString(),
        }
      );
      
      return {
        messages: [response],
        evaluationResults: [evaluation],
      };
    })
    .addEdge('__start__', 'process')
    .addEdge('process', '__end__');
  
  return workflow.compile({
    checkpointer: memory,
  });
}

// Usage function
export async function handleAgentRequest(input: string, sessionId: string) {
  const agent = await createMonitoredAgent();
  
  const result = await agent.invoke(
    {
      messages: [new HumanMessage(input)],
    },
    {
      configurable: { thread_id: sessionId },
    }
  );
  
  // Access evaluation results
  const latestEvaluation = result.evaluationResults[result.evaluationResults.length - 1];
  
  return {
    response: result.messages[result.messages.length - 1].content,
    evaluation: latestEvaluation,
  };
}

Agent with integrated monitoring and evaluation capabilities.

5. Create API Endpoint

// app/api/agent/monitored/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { handleAgentRequest } from '@/lib/agents/monitored-agent';
import { z } from 'zod';

const RequestSchema = z.object({
  message: z.string().min(1),
  sessionId: z.string().default(() => `session-${Date.now()}`),
});

export async function POST(request: NextRequest) {
  try {
    const body = await request.json();
    const { message, sessionId } = RequestSchema.parse(body);
    
    const result = await handleAgentRequest(message, sessionId);
    
    // Log metrics for monitoring
    console.log('Evaluation Metrics:', {
      accuracy: result.evaluation.accuracy,
      relevance: result.evaluation.relevance,
      coherence: result.evaluation.coherence,
      cost: result.evaluation.cost,
    });
    
    return NextResponse.json(result);
  } catch (error) {
    console.error('Agent error:', error);
    return NextResponse.json(
      { error: 'Failed to process request' },
      { status: 500 }
    );
  }
}

API endpoint with built-in metrics logging.

Advanced Example: Production Monitoring System

Now let's build a comprehensive monitoring system with distributed tracing, semantic caching, and real-time dashboards.

1. Implement Distributed Tracing System

// app/lib/tracing/distributed-tracer.ts
import { trace, context, SpanStatusCode, SpanKind } from '@opentelemetry/api';
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { registerInstrumentations } from '@opentelemetry/instrumentation';
import { v4 as uuidv4 } from 'uuid';
import { memoize, debounce } from 'es-toolkit';

export class DistributedTracer {
  private tracer;
  private provider: NodeTracerProvider;
  
  constructor(serviceName: string = 'agent-system') {
    // Initialize provider
    this.provider = new NodeTracerProvider({
      resource: new Resource({
        [SemanticResourceAttributes.SERVICE_NAME]: serviceName,
        [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
      }),
    });
    
    // Configure exporter (in production, use your OTLP endpoint)
    const exporter = new OTLPTraceExporter({
      url: process.env.OTLP_ENDPOINT || 'http://localhost:4318/v1/traces',
    });
    
    this.provider.addSpanProcessor(new BatchSpanProcessor(exporter));
    this.provider.register();
    
    this.tracer = trace.getTracer('agent-tracer', '1.0.0');
  }
  
  traceAgent(agentName: string, operation: string) {
    return this.tracer.startSpan(`${agentName}.${operation}`, {
      kind: SpanKind.INTERNAL,
      attributes: {
        'agent.name': agentName,
        'agent.operation': operation,
        'agent.trace_id': uuidv4(),
      },
    });
  }
  
  traceMultiAgent(
    agents: string[],
    parentSpan?: any
  ) {
    const ctx = parentSpan 
      ? trace.setSpan(context.active(), parentSpan)
      : context.active();
      
    return agents.map(agent => 
      this.tracer.startSpan(`multi-agent.${agent}`, {
        kind: SpanKind.INTERNAL,
        attributes: {
          'agent.name': agent,
          'agent.type': 'multi-agent',
        },
      }, ctx)
    );
  }
  
  // Debounced metric aggregation for high-frequency operations
  recordMetrics = debounce(
    (span: any, metrics: Record<string, any>) => {
      Object.entries(metrics).forEach(([key, value]) => {
        span.setAttribute(`metric.${key}`, value);
      });
    },
    100 // Aggregate metrics every 100ms
  );
  
  endSpan(span: any, status: 'success' | 'error' = 'success', error?: Error) {
    if (status === 'error' && error) {
      span.recordException(error);
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error.message,
      });
    } else {
      span.setStatus({ code: SpanStatusCode.OK });
    }
    span.end();
  }
}

// Singleton instance
export const tracer = new DistributedTracer();

Distributed tracing for complex multi-agent workflows.

2. Build Semantic Caching Layer

// app/lib/cache/semantic-cache.ts
import { Redis } from '@upstash/redis';
import { GoogleGenerativeAIEmbeddings } from '@langchain/google-genai';
import { cosineSimilarity } from 'es-toolkit/compat';
import { LRUCache } from 'lru-cache';
import { z } from 'zod';

const CacheEntrySchema = z.object({
  key: z.string(),
  embedding: z.array(z.number()),
  response: z.string(),
  metadata: z.object({
    timestamp: z.string(),
    hitCount: z.number(),
    cost: z.number(),
  }),
});

type CacheEntry = z.infer<typeof CacheEntrySchema>;

export class SemanticCache {
  private redis: Redis;
  private embeddings: GoogleGenerativeAIEmbeddings;
  private localCache: LRUCache<string, CacheEntry>;
  private similarityThreshold = 0.95;
  
  constructor() {
    this.redis = new Redis({
      url: process.env.UPSTASH_REDIS_URL!,
      token: process.env.UPSTASH_REDIS_TOKEN!,
    });
    
    this.embeddings = new GoogleGenerativeAIEmbeddings({
      modelName: 'embedding-001',
      apiKey: process.env.GOOGLE_API_KEY,
    });
    
    // Local LRU cache for hot data
    this.localCache = new LRUCache<string, CacheEntry>({
      max: 100,
      ttl: 1000 * 60 * 5, // 5 minutes
    });
  }
  
  async get(query: string): Promise<string | null> {
    // Check local cache first
    const localHit = this.checkLocalCache(query);
    if (localHit) {
      console.log('Local cache hit');
      return localHit;
    }
    
    // Generate embedding for query
    const queryEmbedding = await this.embeddings.embedQuery(query);
    
    // Search in Redis
    const cacheKeys = await this.redis.keys('cache:*');
    
    for (const key of cacheKeys) {
      const entry = await this.redis.get<CacheEntry>(key);
      if (!entry) continue;
      
      const similarity = cosineSimilarity(queryEmbedding, entry.embedding);
      
      if (similarity >= this.similarityThreshold) {
        console.log(`Semantic cache hit (similarity: ${similarity})`);
        
        // Update hit count
        entry.metadata.hitCount++;
        await this.redis.set(key, entry);
        
        // Add to local cache
        this.localCache.set(query, entry);
        
        return entry.response;
      }
    }
    
    return null;
  }
  
  async set(query: string, response: string, cost: number = 0): Promise<void> {
    const embedding = await this.embeddings.embedQuery(query);
    
    const entry: CacheEntry = {
      key: query,
      embedding,
      response,
      metadata: {
        timestamp: new Date().toISOString(),
        hitCount: 0,
        cost,
      },
    };
    
    // Store in Redis with expiration
    const key = `cache:${Date.now()}`;
    await this.redis.set(key, entry, {
      ex: 60 * 60 * 24, // 24 hours
    });
    
    // Also store in local cache
    this.localCache.set(query, entry);
  }
  
  private checkLocalCache(query: string): string | null {
    const entry = this.localCache.get(query);
    return entry?.response || null;
  }
  
  async getCacheStats(): Promise<{
    totalEntries: number;
    totalHits: number;
    costSaved: number;
  }> {
    const keys = await this.redis.keys('cache:*');
    let totalHits = 0;
    let costSaved = 0;
    
    for (const key of keys) {
      const entry = await this.redis.get<CacheEntry>(key);
      if (entry) {
        totalHits += entry.metadata.hitCount;
        costSaved += entry.metadata.cost * entry.metadata.hitCount;
      }
    }
    
    return {
      totalEntries: keys.length,
      totalHits,
      costSaved,
    };
  }
}

Semantic caching reduces costs and latency significantly.

3. Create Real-time Monitoring Dashboard

// app/lib/monitoring/dashboard.ts
import { EventEmitter } from 'events';
import { groupBy, meanBy, sumBy, maxBy } from 'es-toolkit';
import { AgentTrace } from '../evaluation/types';

interface DashboardMetrics {
  avgLatency: number;
  avgAccuracy: number;
  totalCost: number;
  errorRate: number;
  throughput: number;
  activeAgents: number;
}

export class MonitoringDashboard extends EventEmitter {
  private traces: AgentTrace[] = [];
  private metricsWindow = 60000; // 1 minute window
  private updateInterval: NodeJS.Timeout;
  
  constructor() {
    super();
    
    // Update metrics every 5 seconds
    this.updateInterval = setInterval(() => {
      this.calculateMetrics();
    }, 5000);
  }
  
  addTrace(trace: AgentTrace) {
    this.traces.push(trace);
    
    // Keep only recent traces
    const cutoff = Date.now() - this.metricsWindow;
    this.traces = this.traces.filter(t => 
      new Date(t.metrics.timestamp).getTime() > cutoff
    );
    
    // Emit real-time update
    this.emit('trace', trace);
  }
  
  private calculateMetrics() {
    if (this.traces.length === 0) return;
    
    const metrics: DashboardMetrics = {
      avgLatency: meanBy(this.traces, t => t.metrics.latency),
      avgAccuracy: meanBy(this.traces, t => t.metrics.accuracy),
      totalCost: sumBy(this.traces, t => t.metrics.cost),
      errorRate: this.traces.filter(t => t.errors.length > 0).length / this.traces.length,
      throughput: this.traces.length / (this.metricsWindow / 1000),
      activeAgents: new Set(this.traces.map(t => t.agentName)).size,
    };
    
    this.emit('metrics', metrics);
  }
  
  getAggregatedMetrics(groupByField: 'agentName' | 'hour' = 'agentName') {
    if (groupByField === 'hour') {
      const grouped = groupBy(this.traces, t => 
        new Date(t.metrics.timestamp).getHours().toString()
      );
      
      return Object.entries(grouped).map(([hour, traces]) => ({
        hour,
        avgLatency: meanBy(traces, t => t.metrics.latency),
        totalRequests: traces.length,
        errorRate: traces.filter(t => t.errors.length > 0).length / traces.length,
      }));
    }
    
    const grouped = groupBy(this.traces, 'agentName');
    
    return Object.entries(grouped).map(([agent, traces]) => ({
      agent,
      avgLatency: meanBy(traces, t => t.metrics.latency),
      avgAccuracy: meanBy(traces, t => t.metrics.accuracy),
      totalCost: sumBy(traces, t => t.metrics.cost),
      requestCount: traces.length,
    }));
  }
  
  getAlerts(thresholds: {
    maxLatency?: number;
    minAccuracy?: number;
    maxErrorRate?: number;
  }) {
    const alerts = [];
    
    const avgLatency = meanBy(this.traces, t => t.metrics.latency);
    if (thresholds.maxLatency && avgLatency > thresholds.maxLatency) {
      alerts.push({
        type: 'latency',
        severity: 'warning',
        message: `Average latency (${avgLatency}ms) exceeds threshold (${thresholds.maxLatency}ms)`,
      });
    }
    
    const avgAccuracy = meanBy(this.traces, t => t.metrics.accuracy);
    if (thresholds.minAccuracy && avgAccuracy < thresholds.minAccuracy) {
      alerts.push({
        type: 'accuracy',
        severity: 'critical',
        message: `Average accuracy (${avgAccuracy}) below threshold (${thresholds.minAccuracy})`,
      });
    }
    
    const errorRate = this.traces.filter(t => t.errors.length > 0).length / this.traces.length;
    if (thresholds.maxErrorRate && errorRate > thresholds.maxErrorRate) {
      alerts.push({
        type: 'errors',
        severity: 'critical',
        message: `Error rate (${errorRate * 100}%) exceeds threshold (${thresholds.maxErrorRate * 100}%)`,
      });
    }
    
    return alerts;
  }
  
  destroy() {
    clearInterval(this.updateInterval);
  }
}

// Global dashboard instance
export const dashboard = new MonitoringDashboard();

Real-time dashboard aggregates and monitors agent metrics.

4. Implement A/B Testing Framework

// app/lib/testing/ab-testing.ts
import { StateGraph, Annotation } from '@langchain/langgraph';
import { ChatGoogleGenerativeAI } from '@langchain/google-genai';
import { sample, shuffle } from 'es-toolkit';
import { z } from 'zod';

const ABTestConfigSchema = z.object({
  testId: z.string(),
  variants: z.array(z.object({
    id: z.string(),
    model: z.string().optional(),
    temperature: z.number().optional(),
    systemPrompt: z.string().optional(),
    weight: z.number().default(1),
  })),
  metrics: z.array(z.enum(['accuracy', 'latency', 'cost', 'user_satisfaction'])),
  minSampleSize: z.number().default(100),
});

type ABTestConfig = z.infer<typeof ABTestConfigSchema>;

const TestResultSchema = z.object({
  variantId: z.string(),
  metrics: z.record(z.number()),
  sampleSize: z.number(),
  confidence: z.number(),
});

type TestResult = z.infer<typeof TestResultSchema>;

export class ABTestingFramework {
  private activeTests = new Map<string, ABTestConfig>();
  private results = new Map<string, Map<string, TestResult>>();
  
  createTest(config: ABTestConfig) {
    this.activeTests.set(config.testId, config);
    this.results.set(config.testId, new Map());
    
    // Initialize results for each variant
    config.variants.forEach(variant => {
      this.results.get(config.testId)!.set(variant.id, {
        variantId: variant.id,
        metrics: {},
        sampleSize: 0,
        confidence: 0,
      });
    });
  }
  
  selectVariant(testId: string): string | null {
    const test = this.activeTests.get(testId);
    if (!test) return null;
    
    // Weighted random selection
    const weights = test.variants.map(v => v.weight);
    const totalWeight = weights.reduce((a, b) => a + b, 0);
    
    let random = Math.random() * totalWeight;
    for (let i = 0; i < test.variants.length; i++) {
      random -= weights[i];
      if (random <= 0) {
        return test.variants[i].id;
      }
    }
    
    return test.variants[0].id;
  }
  
  async runVariant(testId: string, variantId: string, input: string) {
    const test = this.activeTests.get(testId);
    const variant = test?.variants.find(v => v.id === variantId);
    
    if (!variant) throw new Error('Variant not found');
    
    // Create model with variant configuration
    const model = new ChatGoogleGenerativeAI({
      modelName: variant.model || 'gemini-2.5-pro',
      temperature: variant.temperature ?? 0.7,
      apiKey: process.env.GOOGLE_API_KEY,
    });
    
    const startTime = Date.now();
    
    const messages = variant.systemPrompt 
      ? [
          { role: 'system', content: variant.systemPrompt },
          { role: 'user', content: input },
        ]
      : [{ role: 'user', content: input }];
    
    const response = await model.invoke(messages);
    
    const latency = Date.now() - startTime;
    
    return {
      response: response.content,
      metrics: {
        latency,
        // Other metrics would be calculated based on evaluation
      },
    };
  }
  
  recordResult(
    testId: string,
    variantId: string,
    metrics: Record<string, number>
  ) {
    const results = this.results.get(testId);
    const variantResult = results?.get(variantId);
    
    if (!variantResult) return;
    
    // Update running averages
    const n = variantResult.sampleSize;
    Object.entries(metrics).forEach(([key, value]) => {
      const current = variantResult.metrics[key] || 0;
      variantResult.metrics[key] = (current * n + value) / (n + 1);
    });
    
    variantResult.sampleSize++;
    
    // Calculate confidence (simplified - use proper stats in production)
    variantResult.confidence = Math.min(
      variantResult.sampleSize / 100,
      0.95
    );
  }
  
  getResults(testId: string): TestResult[] {
    const results = this.results.get(testId);
    if (!results) return [];
    
    return Array.from(results.values());
  }
  
  determineWinner(testId: string): string | null {
    const results = this.getResults(testId);
    const test = this.activeTests.get(testId);
    
    if (!test || results.length === 0) return null;
    
    // Check if all variants have minimum sample size
    const allHaveMinSamples = results.every(
      r => r.sampleSize >= test.minSampleSize
    );
    
    if (!allHaveMinSamples) return null;
    
    // Find winner based on primary metric (first in list)
    const primaryMetric = test.metrics[0];
    
    const winner = results.reduce((best, current) => {
      const bestScore = best.metrics[primaryMetric] || 0;
      const currentScore = current.metrics[primaryMetric] || 0;
      
      // For latency and cost, lower is better
      if (primaryMetric === 'latency' || primaryMetric === 'cost') {
        return currentScore < bestScore ? current : best;
      }
      
      return currentScore > bestScore ? current : best;
    });
    
    // Check if confidence is high enough
    if (winner.confidence >= 0.95) {
      return winner.variantId;
    }
    
    return null;
  }
}

A/B testing framework for comparing agent configurations.

5. Create Production Monitoring API

// app/api/monitoring/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { dashboard } from '@/lib/monitoring/dashboard';
import { SemanticCache } from '@/lib/cache/semantic-cache';
import { ABTestingFramework } from '@/lib/testing/ab-testing';
import { tracer } from '@/lib/tracing/distributed-tracer';

const cache = new SemanticCache();
const abTesting = new ABTestingFramework();

// Initialize A/B test
abTesting.createTest({
  testId: 'model-comparison',
  variants: [
    { id: 'gpt4', model: 'gpt-4-turbo-preview', weight: 1 },
    { id: 'gpt35', model: 'gpt-3.5-turbo', weight: 1 },
  ],
  metrics: ['accuracy', 'latency', 'cost'],
  minSampleSize: 50,
});

export async function GET(request: NextRequest) {
  const { searchParams } = new URL(request.url);
  const view = searchParams.get('view') || 'metrics';
  
  switch (view) {
    case 'metrics':
      const metrics = dashboard.getAggregatedMetrics('agentName');
      return NextResponse.json({ metrics });
      
    case 'alerts':
      const alerts = dashboard.getAlerts({
        maxLatency: 5000,
        minAccuracy: 0.8,
        maxErrorRate: 0.1,
      });
      return NextResponse.json({ alerts });
      
    case 'cache':
      const cacheStats = await cache.getCacheStats();
      return NextResponse.json({ cacheStats });
      
    case 'ab-test':
      const results = abTesting.getResults('model-comparison');
      const winner = abTesting.determineWinner('model-comparison');
      return NextResponse.json({ results, winner });
      
    default:
      return NextResponse.json({ error: 'Invalid view' }, { status: 400 });
  }
}

export async function POST(request: NextRequest) {
  const span = tracer.traceAgent('monitoring-api', 'process-request');
  
  try {
    const { message, testId } = await request.json();
    
    // Check cache first
    const cached = await cache.get(message);
    if (cached) {
      tracer.recordMetrics(span, { cache_hit: 1 });
      tracer.endSpan(span, 'success');
      
      return NextResponse.json({
        response: cached,
        source: 'cache',
      });
    }
    
    // Select A/B test variant
    const variantId = abTesting.selectVariant(testId || 'model-comparison');
    
    if (variantId) {
      const result = await abTesting.runVariant(
        testId || 'model-comparison',
        variantId,
        message
      );
      
      // Record metrics
      abTesting.recordResult(
        testId || 'model-comparison',
        variantId,
        result.metrics
      );
      
      // Cache the response
      await cache.set(message, result.response as string);
      
      tracer.recordMetrics(span, {
        variant: variantId,
        latency: result.metrics.latency,
      });
      tracer.endSpan(span, 'success');
      
      return NextResponse.json({
        response: result.response,
        variant: variantId,
        metrics: result.metrics,
      });
    }
    
    tracer.endSpan(span, 'error', new Error('No variant selected'));
    return NextResponse.json(
      { error: 'Test configuration error' },
      { status: 500 }
    );
    
  } catch (error) {
    tracer.endSpan(span, 'error', error as Error);
    return NextResponse.json(
      { error: 'Processing failed' },
      { status: 500 }
    );
  }
}

Production API with caching, A/B testing, and distributed tracing.

6. Create Frontend Dashboard Component

// app/components/monitoring-dashboard.tsx
'use client';

import { useQuery } from '@tanstack/react-query';
import { useState } from 'react';
import { Line, Bar } from 'recharts';

interface DashboardData {
  metrics?: any[];
  alerts?: any[];
  cacheStats?: any;
  results?: any[];
  winner?: string;
}

export default function MonitoringDashboard() {
  const [view, setView] = useState<'metrics' | 'alerts' | 'cache' | 'ab-test'>('metrics');
  
  const { data, isLoading } = useQuery<DashboardData>({
    queryKey: ['monitoring', view],
    queryFn: async () => {
      const response = await fetch(`/api/monitoring?view=${view}`);
      return response.json();
    },
    refetchInterval: 5000, // Refresh every 5 seconds
  });
  
  if (isLoading) return <div className="loading loading-spinner" />;
  
  return (
    <div className="p-6">
      <div className="tabs tabs-boxed mb-4">
        <button 
          className={`tab ${view === 'metrics' ? 'tab-active' : ''}`}
          onClick={() => setView('metrics')}
        >
          Metrics
        </button>
        <button 
          className={`tab ${view === 'alerts' ? 'tab-active' : ''}`}
          onClick={() => setView('alerts')}
        >
          Alerts
        </button>
        <button 
          className={`tab ${view === 'cache' ? 'tab-active' : ''}`}
          onClick={() => setView('cache')}
        >
          Cache
        </button>
        <button 
          className={`tab ${view === 'ab-test' ? 'tab-active' : ''}`}
          onClick={() => setView('ab-test')}
        >
          A/B Test
        </button>
      </div>
      
      {view === 'metrics' && data?.metrics && (
        <div className="grid grid-cols-2 gap-4">
          {data.metrics.map((metric: any) => (
            <div key={metric.agent} className="card bg-base-200 p-4">
              <h3 className="text-lg font-bold">{metric.agent}</h3>
              <div className="stats stats-vertical">
                <div className="stat">
                  <div className="stat-title">Avg Latency</div>
                  <div className="stat-value text-2xl">{metric.avgLatency.toFixed(0)}ms</div>
                </div>
                <div className="stat">
                  <div className="stat-title">Accuracy</div>
                  <div className="stat-value text-2xl">{(metric.avgAccuracy * 100).toFixed(1)}%</div>
                </div>
                <div className="stat">
                  <div className="stat-title">Total Cost</div>
                  <div className="stat-value text-2xl">${metric.totalCost.toFixed(2)}</div>
                </div>
              </div>
            </div>
          ))}
        </div>
      )}
      
      {view === 'alerts' && data?.alerts && (
        <div className="space-y-2">
          {data.alerts.length === 0 ? (
            <div className="alert alert-success">
              <span>All systems operating normally</span>
            </div>
          ) : (
            data.alerts.map((alert: any, idx: number) => (
              <div key={idx} className={`alert alert-${alert.severity === 'critical' ? 'error' : 'warning'}`}>
                <span>{alert.message}</span>
              </div>
            ))
          )}
        </div>
      )}
      
      {view === 'cache' && data?.cacheStats && (
        <div className="stats shadow">
          <div className="stat">
            <div className="stat-title">Cache Entries</div>
            <div className="stat-value">{data.cacheStats.totalEntries}</div>
          </div>
          <div className="stat">
            <div className="stat-title">Total Hits</div>
            <div className="stat-value">{data.cacheStats.totalHits}</div>
          </div>
          <div className="stat">
            <div className="stat-title">Cost Saved</div>
            <div className="stat-value">${data.cacheStats.costSaved.toFixed(2)}</div>
          </div>
        </div>
      )}
      
      {view === 'ab-test' && data?.results && (
        <div className="space-y-4">
          <div className="grid grid-cols-2 gap-4">
            {data.results.map((result: any) => (
              <div key={result.variantId} className="card bg-base-200 p-4">
                <h3 className="text-lg font-bold">Variant: {result.variantId}</h3>
                <div className="stats stats-vertical">
                  <div className="stat">
                    <div className="stat-title">Sample Size</div>
                    <div className="stat-value">{result.sampleSize}</div>
                  </div>
                  <div className="stat">
                    <div className="stat-title">Confidence</div>
                    <div className="stat-value">{(result.confidence * 100).toFixed(1)}%</div>
                  </div>
                </div>
              </div>
            ))}
          </div>
          {data.winner && (
            <div className="alert alert-success">
              <span>Winner determined: {data.winner}</span>
            </div>
          )}
        </div>
      )}
    </div>
  );
}

Real-time dashboard component for monitoring agent performance.

Conclusion

We've built a comprehensive evaluation and monitoring system that handles the complexities of non-deterministic agent behaviors. The basic example demonstrates fundamental monitoring with LLM-based evaluation, while the advanced example shows production-ready patterns including distributed tracing, semantic caching, A/B testing, and real-time dashboards. These patterns enable you to deploy reliable agent systems with confidence, maintain visibility into their behavior, and continuously optimize performance based on real-world data.