How Zapier Used Lab to Build AutomationBench
Zapier connects over 9,000 apps into automated workflows, helping millions of businesses replace repetitive manual work with reliable, trigger-based automation. AI agents increasingly need to take real actions inside these workflows: reading data from one app, making decisions, and writing results to another. Zapier's platform sits at the center of what enterprise AI actually looks like in practice.
Existing benchmarks for long-horizon tool use evaluate models on isolated capabilities like calling APIs, following multi-step instructions, and navigating single applications. But real automation requires all of these at once. An agent has to discover and authenticate against unfamiliar APIs, coordinate actions across multiple applications, and follow business-specific policies about what it can and cannot do, all within a single workflow. No existing benchmark tested this real world combination. Zapier's launch of AutomationBench fills that gap.
The Problem
A good benchmark needs more than a set of tasks and expected answers. Each task requires a realistic environment: sandboxed applications, APIs, and verifiers that can tell the difference between a correct solution and one that just looks correct. Building this infrastructure from scratch would take weeks of engineering before a team can even start experimentation.
Zapier chose Verifiers, Prime Intellect's open-source framework for agentic evaluations, because it provides the right abstractions: tool environments, reward functions, multi-turn rollout execution, and structured output parsing. These map directly onto the structure of automation tasks, so the Zapier team could focus on encoding their domain expertise (like what makes a workflow correct, what constitutes a realistic edge case, what policies an agent should follow) rather than rebuilding evaluation infrastructure from the ground up.
Stress Testing
Designing tasks is just the beginning; the harder problem is making sure those tasks are correct. That is, we need to ensure that the reward function actually rewards the right behavior, that edge cases don't produce false positives, and that the benchmark is robust against reward hacking.
Because AutomationBench is built on Verifiers, Zapier got access to Prime Intellect Lab's full suite of tools with no additional infrastructure setup. Lab handles GPU orchestration, model serving, RL training loops, and monitoring, which would otherwise require dedicated infra engineering and ML expertise.
This turned out to be critical for catching bugs during benchmark development. Lab's dashboard and rollouts viewer shows per-step training statistics as live charts that update with each training step. During one of their RL runs, the Zapier team spotted something off: the api_fetch_calls metric dropped to near-zero, but the reward curve stayed flat instead of dropping with it. In a correct benchmark, fewer API calls should mean fewer correct answers and lower reward. The fact that these two signals diverged meant the reward function was giving credit to attempts that never actually completed the task. Without the ability to move between real-time, aggregate metrics and individual rollouts, this kind of bug can go unnoticed and undermine the entire benchmark.

Feedback Loops
AutomationBench is a live environment that plugs directly into the RL training loop. The same Verifiers environment that defines the benchmark tasks and reward functions can be used to train models with reinforcement learning on Lab, creating a tight loop between evaluation and improvement.
With Lab, environments are first-class objects that work as both evals and training. Teams can build a benchmark, run baseline evaluations against frontier models, identify capability gaps, and then train a custom model to close those gaps, all on the same platform using the same environment definition. For Zapier, this means AutomationBench isn't just measuring where models fall short on business automation but also training based on that.
Next Steps
Our collaboration with Zapier goes beyond AutomationBench. Over the past several months, Zapier has been a key design partner across multiple parts of our stack:
- Integrating Verifiers for internal evals and environments used across Zapier;
- Working together to add GEPA support for environment-based prompt optimization;
- Battle-testing our Lab platform to train powerful customized agents with RL.
We're working toward a world where any team can use the tools previously only available to frontier labs to build, evaluate, and improve AI agents.
Prime Intellect Lab is a major unlock for our RL workflows. We spun up experiments with very little setup to pressure-test our new AutomationBench benchmark, which surfaced reward-hacking opportunities we could then fortify against. Kicking off a training run was a single command. We built our benchmark from the ground up on their Verifiers framework, and we're using it increasingly across our other evals. If you can measure something, you can train a model at it. You get to pick exactly what it's good at. We're doing it at Zapier. You should try it too!
— Daniel Shepard, Zapier