Agreement Observability for AI Moderation

March 11, 2026

Courtney Matera

In This Guide:

When running AI moderation in production, it’s relatively simple to make sure that the model works, but much more difficult to make sure it’s still working three months later. Patterns shift over time. The gap between what the model learned during training and what it encounters in production widens, often without anyone noticing until something breaks. By the time the problem is visible, it's already been compounding for a while. We wanted to catch it sooner.

Our AiMod product is an ML model that adapts continuously by learning from the decisions moderators make, which means the quality of those decisions and time to make a decision directly affect the model and how quickly we can improve it. This creates a collaborative relationship: our customers train moderators and ensure consistent, high-quality decisions, while our team monitors, evaluates, and improves model performance. Because both the model and moderators are making decisions rapidly while user behavior evolves in real time, we need a shared way to see when things drift. We landed on agreement with moderators as the clearest signal of production health. If AiMod and human reviewers are making the same calls on the same accounts, it’s very likely that things are working. If they start diverging, something changed, and the next step is figuring out what.

To create more visibility into this agreement metric, we built an Agreement Observability tool that does two things: monitoring and simulation.

Monitoring

The monitoring view tracks day-over-day agreement between AiMod decisions and moderator decisions. Stable agreement means the system is healthy. A sudden drop points to something specific that changed, like a new abuse pattern, a policy update, or a data problem. A slow decline over weeks is trickier because it suggests gradual drift that needs investigation before it gets worse.

The top-line agreement number is useful, but we also wanted to understand where disagreements concentrate. Not all mismatches mean the same thing. If one moderator disagrees with AiMod far more often than their peers, that's probably a calibration issue on the moderator side, not a model problem. If mismatches cluster around a specific policy area, the model likely needs attention there. We added filtering by moderator and policy category so these patterns show up fast.

We also track pending decisions, meaning accounts that received a decision recommendation but haven't been reviewed yet. Pending volume on its own is really just a capacity metric. But combined with agreement trends, it tells you more. If agreement is dropping at the same time the backlog is growing, you've got two problems feeding each other: the model is making worse decisions and the queue is getting longer. On the flip side, if agreement is high and there's no pending backlog, the system is in a healthy state.

Simulation

The second tab is for threshold simulation. Most teams pick a confidence threshold at launch and don't touch it again. But the right threshold depends on how much moderator capacity you have, how much risk you're comfortable with, and how the model's confidence distribution looks right now. All of those things change over time. We wanted a way to explore "what if" scenarios without touching anything in production.

The simulation lets you set new thresholds and immediately see the tradeoffs: what percentage of decisions would be auto-actioned, and what precision and recall would look like at that cutoff. There's no single "correct" threshold. The point is to make the tradeoffs visible so teams can choose deliberately instead of guessing.

What We're Learning

A few things we've noticed from using this internally:

Agreement is a lagging indicator, but it lags a lot less than the alternatives. User complaints and appeals take even longer to surface. Tracking agreement continuously compresses the feedback loop from weeks down to days.

The mismatch patterns often tell you more than the agreement number by itself. A 90% agreement rate sounds fine, but if that remaining 10% is concentrated in one policy area or around one moderator, you've found something you can actually act on. The filtering is what makes this useful.

Threshold simulation has changed how teams talk about tradeoffs. Instead of abstract back-and-forth about precision vs. recall, you can show exactly what happens at different cutoffs using real data. It moves conversations away from opinion and toward evidence.

Next Steps

We're continuing to explore how agreement observability can support AiMod in production. Some open questions we're thinking about:

How do agreement patterns differ across policy areas, and can we use that to prioritize where we improve the model first?
What's the right cadence for threshold review? Should it be continuous monitoring, or are periodic check-ins enough?
Can we catch drift earlier by watching confidence distributions alongside agreement?

We hope this gives a useful picture of how we're approaching production monitoring for AI moderation. We'd love to hear what you think.

‍

Book a demo

How to audit your fixed ML classifier

Four signs that a fixed ML Classifier might not be working for your Trust & Safety operations. What the symptoms are, how to diagnose, and what might work instead.

Rule-Based vs. Fixed ML vs. LLM Content Moderation: How to Choose

A practical comparison of rule-based, ML, and LLM-based content moderation for T&S teams — how each approach works, where each breaks down, and how to think about the decision in 2026.