Software Development Automation Benchmark
Evaluating the automation of software roles
April 2026 · Emulated, Inc.
- 1Claude Opus 4.613.9%
- 2GPT-5.47.5%
- 3Gemini 3.1 Pro7.5%
- 4Grok 4.20 Beta4.0%
AI coding agents have become remarkably capable at generating code and fixing bugs in open-source repositories. On SWE-bench Verified, frontier models now exceed 80% solve rates. Yet these same models perform dramatically worse on real software engineering, when the work moves beyond generating a code patch to debugging production traffic, migrating without downtime, and handling regressions during deployments.
We're introducing the Software Development Automation Benchmark: 80 tasks that grade not just code generation but the ability to deploy, debug, migrate, and maintain production systems, establishing a baseline for how independently models can move from coding assistants to virtual coworkers.
Why this benchmark
Existing coding benchmarks measure code generation and don't capture the full software development lifecycle. They proxy whether a model can produce a patch given a description, but not whether that patch can be safely deployed, survive under load, or if legacy infrastructure continues to work.
In other words, this benchmark is centered on:
Production systems: Tasks run against live infrastructure with real services, databases, and traffic flowing through them. Agents modify systems that are running.
Engineering best practices: Grading is not just on code structure and generation, but also evaluates the model's abilities to recover traffic, create pipelines that build and deploy changes, and handle data/infrastructure migrations.
Long-horizon tasks: Tasks run up to 12 hours.
We believe AI's economic impact is bottlenecked by its ability to productionize code and manage it autonomously. Code that cannot be shipped doesn't generate value no matter how well it's written.
Our approach
Environments

Each environment provides the agent with a complete, runnable system and the tooling a real engineer would use. The grading infrastructure is separate and inaccessible, preventing reward hacking.
An environment consists of:
- A workspace containing the source the agent can read, edit, and execute commands against.
- Running infrastructure. The system under test is live when the agent begins. Services are serving traffic, databases are accepting connections, metrics are being collected. The agent operates on a running system, not a cold start.
- Operational tooling including metrics collection, CI/CD pipelines, monitoring and alerting, load generators, and deployment systems. The specific tooling reflects the environment variant: Docker environments typically use self-hosted dashboards and metric exporters; cloudboxes use managed cloud observability and CI/CD services.
- A traffic generator driving synthetic but realistic load through the system to surface behavioral issues that only appear under production conditions.
Each task is a self-contained system with its own codebase, infrastructure, and validation. Environments are seeded with deterministic state to ensure reproducibility across runs.
Grading
The benchmark uses two complementary grading signals that combine into a single composite score per task.
Behavioral tests verify observable outcomes. Did traffic recover? Does the pipeline build and deploy? Do connections reuse properly? Is the migrated data intact?
These tests run automatically, produce binary pass/fail results, and are resistant to reward hacking because they validate system behavior, not code structure. Tests are designed to accept multiple valid solutions. For example, several different approaches can resolve a circuit breaker livelock, and we grade all of them equally, as long as traffic recovers.
Engineering quality rubrics evaluate whether the agent understood the system it was modifying and maintained sound engineering practices. Each task defines weighted criteria that an LLM judge evaluates against the agent's code and execution trace. Rubrics target the errors that separate careful engineering from surface-level attempts:
- Did the agent handle edge cases that only manifest under specific conditions, or assume the happy path?
- Did the agent maintain the codebase's existing conventions, or introduce inconsistencies?
- Did the agent verify that its changes worked under realistic conditions, or stop at compilation?
The per-task composite score combines three weighted components:
- Feature correctness (60%): behavioral tests that verify the system produces the required outcomes.
- Operational correctness (30%): rubrics on deployment, verification, and monitoring behavior.
- Engineering correctness (10%): rubrics on code hygiene, conventions, and regression-test quality.
The benchmark-level score is the average composite across all tasks, scaled to 0–100.
What we measure
The benchmark evaluates agents across five categories of production software engineering work. Each category targets activities that require reasoning about live systems, not just editing code.
- Infrastructure debugging. Diagnosing and resolving failures in network infrastructure, load balancers, connection pools, and distributed systems.
- Migrations and upgrades. Replacing, upgrading, or porting production systems between technologies while maintaining correctness and minimizing disruption to live traffic.
- CI/CD and deployment. Constructing deployment pipelines, performing safe rollouts, detecting regressions from live metrics, rolling back, and redeploying fixes.
- Observability and incident response. Triaging degraded systems using metrics, dashboards, and logs.
- Distributed systems. Working with real multi-node systems where correctness and performance emerge only under genuine concurrency, network partitions, and failure modes.
Tasks
The benchmark evaluates agents across 80 tasks organized into the five categories above. Tasks are derived from real production incidents, performance optimization projects, and feature development work, not contrived scenarios. All codebases are from production environments; we reconstruct real engineering situations and calibrate them for difficulty.
Representative tasks
Connection pooler migration. Migrate a PostgreSQL connection pooling layer from one technology to another where configuration models are semantically different (time units differ, authentication models differ, and pool identity is defined differently).
The agent must understand these differences, stop the original pooler, deploy the replacement with a correct configuration, and verify that the application maintains connection pooling semantics, session affinity, and authentication under production traffic. Naive syntactic translation produces configs that look correct but fail behavioral tests.
SDK code generator migration. Port a production OpenAPI code generator across languages, preserving behavioral equivalence across cumulative tiers that progressively expand scope from core generation logic through CLI tooling, mock infrastructure, and package workflow orchestration.
Each tier is validated against reference implementations compiled from the original source. The source and target languages have fundamentally different type systems, concurrency models, and code generation idioms, so the agent must reason about what the original code achieves and reproduce that behavior natively in the target. In the final tier, the generated SDK must successfully call against a running API endpoint.
Database engine feature and live-traffic cutover. Implement timing statistics in a database engine's source code, build a patched binary, deploy it to a standby node, set up replication to migrate data while the primary continues serving a mixed read-write workload, and cut the application over to the new node without data loss.
In harder variants, the agent receives only a vague ticket and must discover the infrastructure topology, determine connection parameters, and decide on a migration strategy independently. Agents routinely fail by skipping verification before cutover, ignoring write traffic during migration, or misconfiguring remote access on the target node.
Distributed training system deployment. Implement an asynchronous reinforcement learning training algorithm, then deploy the three-component distributed system with an orchestrator, inference server, and trainer to a GPU-enabled Kubernetes cluster on real cloud infrastructure.
The agent must provision the cluster, configure GPU scheduling, set up shared storage, deploy via Helm, and verify that the training loop converges, producing a model that demonstrably learned the target task. This environment runs on actual cloud resources, not containers on a single machine.
Slow query performance incident. The agent receives a PagerDuty-style alert about elevated API latency and must determine whether the bottleneck is in the application layer or the database layer before acting. The environment exposes both API-level and database-level latency metrics side by side, testing the agent's ability to compare signals across layers rather than jumping straight to query optimization.
Once the database is confirmed as the bottleneck, the agent must identify the specific slow query through execution statistics and query plans, implement the correct fix, and verify the improvement without touching application code. Harder variants require diagnosing multiple simultaneous issues where fixing one reveals another.
Deployment failure with data integrity issue. The agent is tasked with rolling back a recent CI/CD pipeline run that deployed a database migration that introduced silent data corruption. The application returns wrong results, but the pipeline is incorrectly green and doesn't raise errors.
The agent must trace the symptom back through the application layer to the database, then through the database state to the specific migration in the pipeline history that caused it. Engineering best practices are enforced: after the fix is made, it must be delivered through the pipeline, not applied directly to the database.
Example: Database engine feature and live-traffic cutover
To illustrate our evaluation methodology concretely, we trace a single task that spans the full deployment lifecycle: implementing a database engine feature, deploying it to production infrastructure, and cutting over a live workload.
Environment

The agent begins with a running source database actively serving a mixed read-write workload, a target database built from source but not yet running, and the PostgreSQL 17.4 source tree in its workspace.
The task is broken up into multiple steps. After implementing four cumulative timing columns in a system statistics view (derived from a real PostgreSQL 18 commit touching ~130 lines across 8 files), the agent must build a patched binary, deploy it to the target, migrate all data (including ongoing writes), then cut the workload over.
In its native form the task is intractable for current models, so we expose four difficulty variants that progressively withhold information (prompt detail, deployment hints, topology discovery, migration strategy), giving a gradient models can climb.
Condensed agent trajectory
A strong solution mirrors real operational deployment behavior:
- Discover the environment. The agent inspects the containers and SSH connectivity. Confirms the source database is healthy and serving traffic.
- Read existing source. Examines the PostgreSQL stats infrastructure to understand how per-relation statistics are stored, reported, and exposed through system views.
- Implement the feature. Modifies the stats struct, reporting functions, SQL function definitions, system view definitions, and catalog metadata. Writes regression tests.
- Build. Compiles the patched binary with the correct configuration flags and version suffix.
- Deploy to target. Copies the binary to the target container via SCP, initializes a cluster, configures remote access and authentication. Starts Postgres.
- Set up replication. Creates a logical publication on the source and subscription on the target. The subscription automatically copies existing data and begins streaming ongoing changes, including the workload's writes and CDC marker rows.
- Wait for sync. Monitors replication lag until the target has caught up.
- Cut over. Updates the workload's endpoint configuration to point at the target. The workload begins sending traffic to the patched database.
- Verify. Confirms the workload is healthy on the target, timing columns return correct values, and CDC marker rows written during migration are present.
Grading
Behavioral tests
| Test | What it verifies |
|---|---|
| Patched binary deployed | The target is running a Postgres binary with the correct version suffix. |
| Timing fields exist | Four timing columns are queryable in the system statistics view. |
| Timing values accurate | VACUUM/ANALYZE operations produce non-zero, consistent timing values. |
| Data integrity | Row counts match between source and target; foreign key relationships intact. |
| CDC markers replicated | Marker rows written during migration are present on the target, proving ongoing writes were captured. |
| Workload healthy | The application is successfully running queries against the target. |
| Old node not routing | Traffic has fully migrated; no queries are hitting the source. |
Engineering quality rubric
| Criterion | What it evaluates |
|---|---|
| Maintained codebase conventions | Did the implementation follow existing patterns (naming, macro style, function signatures) in the Postgres source? |
| Verified before cutover | Did the agent confirm the target was healthy and data was intact before redirecting traffic? |
| Handled write traffic correctly | Did the agent use a replication mechanism that captures ongoing writes, or default to a point-in-time snapshot that misses changes? |
| Regression tests meaningful | Did the regression tests actually exercise the feature (run VACUUM, check timing > 0), or just check column existence? |
This task exists as both dockerized and cloudbox variants (see Cloudboxes). In the cloudbox variant, the agent must discover compute instances in an isolated, sandboxed cloud account and use tools to interact with them.
Example: CI/CD pipeline and canary rollout
To illustrate our evaluation across operational environments, we trace a second task: building a CI/CD pipeline and safely deploying a new version of an adtech-optimized load balancer.
Environment

The agent begins with a running Kubernetes cluster, a deployed load balancer actively serving traffic, and observability tooling collecting metrics. Tekton is installed but no CI/CD pipeline is defined. A self-hosted Git repository contains the load balancer source, including a branch (v2-timeout-refactor) with a subtle bug:
// v1 (correct): timeout wraps the actual network call
let resp = tokio::time::timeout(Duration::from_millis(100),
client.send_request(req) // ← real I/O happens here
).await;
// v2 (broken): timeout wraps a completed value
let resp = client.send_request(req).await; // ← I/O already done
let resp = tokio::time::timeout(Duration::from_millis(100),
async { Ok(resp) } // ← resolves instantly, timeout is no-op
).await;This bug only manifests under load when some backends respond slowly. At low QPS with fast backends, everything looks fine but under production load with variable latency, requests to slow backends hang indefinitely, causing p99 latency spikes.
Task prompt
We need a CI/CD pipeline for our load balancer. The cluster already has CI/CD infrastructure installed, but nothing is wired together. Build a pipeline that can build, test, and deploy the service from our git repository.
Once the pipeline is operational:
1. Verify the system is healthy under load.
2. Deploy a new version of the load balancer on thev2-timeout-refactorbranch.
3. Monitor the deployment and handle any issues that arise.
Condensed agent trajectory
A strong solution mirrors real on-call deployment behavior:
- Understand the environment. The agent inspects the cluster, repository, and existing observability. Confirms the baseline system is healthy by querying Prometheus for latency and error rate metrics.
- Build the pipeline. Defines CI pipeline resources that clone the repository, build a container image, and deploy to the cluster. Configures triggers from git events.
- Deploy baseline. Runs the pipeline against the known-good revision. Confirms traffic is flowing and metrics are stable.
- Deploy the refactor. Rolls out the v2 branch using the pipeline. Observes live metrics during the rollout.
- Detect the regression. Notices p99 latency diverging under load. Correlates the timing with the deployment. Identifies that tail latency is climbing while throughput and error rates look normal.
- Roll back. Restores the previous version through the pipeline and confirms recovery in metrics.
- Diagnose and fix. Inspects the code diff between v1 and v2. Identifies the async timeout wrapping issue: the timeout wraps a value that has already been awaited, making it a no-op.
- Redeploy and verify. Pushes the fix through the pipeline. Validates clean behavior under the same load conditions, specifically checking that slow backends are properly timed out.
Grading
Behavioral tests
| Test | What it verifies |
|---|---|
| Pipeline operational | The agent built a working CI/CD pipeline that can build and deploy from the git repository. |
| Regression detected | The agent identified a p99 latency regression and took corrective action (rollback, fix, or both). |
| Rollback restores health | After corrective action, the load balancer returns to healthy metrics (p99 <50ms, error rate <1%). |
| Fix deployed and healthy | The agent deployed a corrected v2 that meets the same health criteria as v1 under load. |
| Timeouts enforced | Under a grading harness that adds 500ms backend latency, the fixed v2 properly times out instead of hanging. |
Engineering quality rubric
| Criterion | What it evaluates |
|---|---|
| Safe rollout strategy | Did the agent use a canary, staged, or monitoring-gated rollout, or deploy directly to 100% traffic? |
| Metrics monitored during rollout | Did the agent actively watch Prometheus/Grafana during and after deployment? |
| Correct root cause identified | Did the agent identify the async timeout wrapping issue specifically, not just surface symptoms? |
| Tested fix under failure conditions | Did the agent verify the fix by re-triggering the failure scenario (slow backends) and confirming recovery? |
Example: Distributed training system on cloud infrastructure
This example runs on one of our cloudboxes, a sandboxed cloud environment provisioned from scratch for each agent run. The agent must implement a distributed RL training algorithm, provision a GPU-enabled Kubernetes cluster, and deploy the system end-to-end.
Environment

The agent begins with the source code for a distributed RL training framework, a Helm chart for Kubernetes deployment, and AWS credentials, but no running infrastructure and no algorithm implementation.
Four source files are stubs: the loss function, data loader, sequence packer, and training loop. The agent must implement the algorithm from a paper description, provision cloud infrastructure from scratch, deploy the system, and verify that it trains successfully.
The training framework uses asynchronous off-policy RL, where the inference server can be multiple steps ahead of the trainer. This means the agent must handle importance ratio correction, multi-level masking, and gradient routing. These are details that matter for training convergence but produce no errors if implemented incorrectly.
Condensed agent trajectory
- Read the codebase. The agent examines existing implementations (supervised fine-tuning trainer, other loss functions) to understand the framework's patterns, interfaces, and conventions.
- Understand the algorithm. Reads the paper description provided in the prompt. Identifies the key distinction from standard approaches: sequence-level importance ratios computed as a geometric mean of token-level ratios, with a stop-gradient trick for correct gradient routing.
- Implement the algorithm. Writes the loss function, data loader, sequence packer, and training loop, following existing codebase conventions. Handles numerical stability (log-space computation, clamping before exponentiation) and edge cases (empty sequences, variable-length packed inputs).
- Verify locally. Runs syntax checks and basic tests, confirms imports resolve, shapes are correct, gradients flow through the right variables.
- Provision infrastructure. Creates an EKS cluster with CPU and GPU node groups. Installs the NVIDIA GPU operator. Sets up EFS shared storage with the correct access mode.
- Deploy. Installs the training system via Helm. Verifies all three pods reach running state and GPUs are allocated correctly.
- Monitor training. Watches the training loop for 20 steps. Confirms loss decreases and the model is learning.
- Evaluate. Runs the evaluation benchmark against the final checkpoint. Confirms the model achieves the minimum performance threshold, demonstrating it actually learned the target task, not just that the code ran without errors.
Grading
Behavioral tests
| Test | What it verifies |
|---|---|
| Cluster provisioned | EKS cluster is active with CPU and GPU node groups. |
| GPU operator installed | NVIDIA GPU operator is running and GPUs are allocatable. |
| Storage configured | EFS filesystem with ReadWriteMany PVC is accessible to all pods. |
| All pods running | Orchestrator, inference, and trainer pods are in Running state. |
| Trainer unit tests pass | Loss function, data loader, packer, and training loop pass unit tests covering correctness, shapes, and edge cases. |
| Training completes | 20 training steps complete without failure. |
| Model learned | Final checkpoint achieves minimum reward threshold on the evaluation benchmark, proving the model learned the target task. |
Engineering quality rubric
| Criterion | What it evaluates |
|---|---|
| Algorithm understanding | Did the agent identify the core algorithmic distinction before implementing, or treat it as a generic loss function? |
| Gradient routing correct | Did the agent use the stop-gradient trick for importance ratios, or write a naive implementation that computes different gradients? |
| Codebase conventions maintained | Does the implementation follow the same patterns as existing loss functions (parameter names, return types, utility usage)? |
| Cluster state verified before deploy | Did the agent confirm nodes were healthy and GPUs were available before deploying the training workload? |
Key findings
- Model capability doesn't transfer to cloud. We evaluated models based on their ability to solve a containerized variant and cloudbox variant of the same task. These tasks are based on challenges IaaS companies face, and models struggle to understand and work cloud infrastructure.
- Models wait for failures they could have seen coming. Agents often miss on querying monitoring tooling before making changes and polling it during a rollout. The agent misses latency or error-rate regressions that would be obvious to a human watching a dashboard.
- Verification is a weak link. Agents frequently fix the core issue and then stop. They rarely re-run the scenario that exposed the bug, rarely check that metrics have recovered, and often claim success against tests they never ran.
Cloudboxes
On top of docker environments, Emulated provides full-fidelity simulations using cloudboxes: sandboxed clouds that reproduce production infrastructure on AWS, Azure, and GCP.
Cloudboxes support real SaaS and IaaS services: ecommerce apps, distributed databases, network infra, serverless runtimes. A task in a cloudbox runs against real cloud APIs, not mocked ones.
Cloudboxes introduce stateful, evolving environments that require long-horizon reasoning about distributed services. Models must track service state as it evolves during the task, parse cloud API responses rather than container output, and operate cloud-native tooling.
Get in touch
If you're interested in partnering, evaluating your models, or contributing environments, reach out to josephgenw@gmail.com. Samples available upon request.