Launching FrontierSWE on the Environments Hub

Today, FrontierSWE is launching on the Environments Hub. FrontierSWE is a coding benchmark by Proximal, containing ultra-long-horizon, open-ended technical challenges such as optimizing compilers or training SOTA models for protein prediction. On average, agents run for 11 hours per task and fail to solve almost all of them.

The goal of the Environments Hub is to give environments a true home that is open and, most importantly, powered by the community. Since its launch, researchers, tinkerers, and companies have contributed over 1,000 environments spanning all kinds of domains and problems.

Beyond the integration in the Environments Hub, we collaborated with Proximal on the granite_inf task for FrontierSWE. This task measures the ability of an agent to optimize a model implementation inside of an inference engine end to end. At Prime Intellect, we have been interested in long-horizon environments like this one to train and evaluate the capabilities of general scaffolds.

Performance Engineering as a Long-Horizon Task

While LLM-based ML performance engineering is prominently studied in the context of kernel engineering benchmarks like KernelBench, optimizing entire end-to-end model implementations remains understudied. These optimization problems are particularly impactful for real-world applications: for example, GPU MODE has an ongoing end-to-end model optimization competition with AMD and a $1M+ prize pool.

From an environment perspective, we are interested in adding new tasks to the Environments Hub that are inherently long-horizon. Recent methods have emerged to automate longer-horizon tasks, and providing new environments to train and evaluate these capabilities is important.

Furthermore, we helped design the granite_inf task on FrontierSWE, which involves optimizing the speed of a forward pass of IBM’s Granite-Mamba-2 layer. We intentionally chose a model that was optimized by human engineers using existing specialized kernels to measure novel performance engineering solutions beyond low-hanging fruit.

Which Models Did Well

Only two granite runs reached near parity with the hidden optimized baseline: one Codex run, and one Gemini run close behind. The rest of the field remained well below it.

The strongest runs converged on the same broad strategy: prefill and decode had to be treated as separate problems, and both had to be moved off the eager reference path onto fused Mamba kernels. The strongest Codex run used a more integrated timed path, routing prefill through mamba_split_conv1d_scan_combined(...) and routing decode through causal_conv1d_update(...) and selective_state_update(...).

Gemini got close by rebuilding much of the same kernel family more directly. It reconstructed the varlen metadata path, pushed prefill into the chunked scan interface, and routed decode through the recurrent selective update kernel. Agents that could identify and assemble the right kernel path from the visible building blocks, and keep it numerically stable end to end, were the ones that made real progress on granite.

Failure Cases

Most granite failures were not about architecture comprehension. The Granite task was not prescriptive about exact precision choices. Before speed was measured, though, each candidate still had to match a trusted reference on hidden prefill and decode workloads, including hidden states, cache state, and final logits. Agents generally understood the Mamba layer, the cache structure, and the need for specialized kernels. The hard part was turning that understanding into a fast path that stayed numerically acceptable under hidden scoring. Many runs found local improvements, especially on decode, but could not extend those gains to prefill or could not preserve the exact behavior needed to hold the full pipeline together.

Missed Opportunities

The Granite task ran in a full CUDA development environment, not just a Triton-only sandbox. Raw PTX, custom CUDA extensions, and deeper low-level kernel work were available in principle. None of the final granite runs used them. Across the final artifacts, no submission shipped .cu, .ptx, or torch.utils.cpp_extension machinery, and among the recoverable trajectories no run invoked the low-level CUDA toolchain directly. The closest any agent came was a strong Claude run that experimented with Triton inline assembly to force mul.rn.f32 and avoid FMA drift. This seems like one of the clearest places granite can still be hill-climbed. The agents mostly stopped at the Triton and wrapper layer, while the environment left room for more aggressive Blackwell-specific optimization.

Evaluating FrontierSWE Using the Environments Hub

As FrontierSWE is hosted on the Environments Hub, the fastest way to get started is using the Prime CLI:

# Install uv (if not installed already)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install the Prime CLI
uv tool install prime
# Authenticate
prime login
# Run an evaluation on FrontierSWE
prime eval run proximal/frontier-swe

For more information, see our documentation and the FrontierSWE environment. We cannot wait to see the progress on this benchmark!