Today we are launching INTELLECT-2: the first 32B parameter globally decentralized Reinforcement Learning training run where anyone can permissionlessly contribute their heterogeneous compute resources.
This new test-time scaling paradigm offers, for the first time, the possibility for decentralized training to reach state-of-the-art performance.
With the release of OpenAI's O1 and DeepSeek's R1, a second scaling paradigm beyond pre-training emerged last year—one that allows models to spend more time reasoning, optimized through reinforcement learning. In our previous release, we argued why we believe these reasoning models, trained via reinforcement learning, are even better suited for decentralized training approaches than standard LLM pre-training. Today, we are proving this with INTELLECT-2.
Over the past year, we’ve been building all the crucial open-source components needed to scale INTELLECT-2 to a distributed 32B-parameter reinforcement learning training run—with frontier reasoning performance, heterogeneous compute nodes, and permissionless contributions:
Our INTELLECT-2 infrastructure mainly consists of the three components:
This infrastructure enables a decentralized training setup with the following unprecedented features:
Reinforcement learning is inherently more asynchronous than traditional large language model pre-training. In distributed RL, data collection can be decoupled from network training. Multiple workers operate in parallel environments, each collecting experiences asynchronously, while a central learner receives and processes these experiences. Because experiences arrive at different times and from different parts of the state space, each step can happen at different rates.
Successful applications of asynchronous RL can be seen in Tulu 3 or Llama 4, where a one-step asynchronous RL approach was used to improve training efficiency.
Our own ablation experiments have shown that we can reproduce the results of DeepScaleR without model degradation—even when running with four-step asynchrony (i.e., the policy model used by inference workers is four steps behind). This level of asynchrony enables us to fully hide communication behind computation, even with the relatively weak global interconnect in our decentralized RL training setup.
Furthermore, beyond this enabling a decentralized training setup, the asynchronous RL approach also provides additional efficiency gains by allowing separate optimization of the training and inference engines. For example, in our prime-rl library, rollout workers can take advantage of vLLM and its full suite of inference optimizations.
Our fully asynchronous online RL training framework prime-rl is open-source, allowing anyone to start globally distributed reinforcement learning runs.
A critical component of our infrastructure is the mechanism for broadcasting new policy models from training workers to all decentralized inference workers as quickly as possible. To achieve this, we built Shardcast—a library for distributing large files via a HTTP-based tree-topology network. It consists of the following components:
A few weeks ago, we released TOPLOC: a locality-sensitive hashing scheme for verifiable inference, designed to detect malicious modifications during inference. It enables us to:
We are excited to test TOPLOC in production with INTELLECT-2. This allows us to truly open up participation—enabling anyone in the world to contribute GPU resources in a fully permissionless way.
A few weeks ago, we announced our public protocol testnet to enable a truly sovereign open-source AI ecosystem.
Today, we are opening up the first permissionless Compute Pool, enabling anyone to run a protocol testnet worker on their GPUs to contribute to INTELLECT-2. Registration, compute resource attestation, slashing of malicious behavior, and more is all settled on the public Ethereum Base testnet. This allows for:
Beyond a number of improvements to our infrastructure to enable permissionless joining of our compute pools, we also have other key developments on the protocol.
We integrate our TOPLOC verification in our worker so that contributors can attest to the work they’ve produced, and we can efficiently verify the legitimacy of their attestations. This allows us to identify attempts to contribute fake GPUs or poison the dataset.
To mitigate dishonest behaviour, we’re experimenting with economic incentives that discourage malicious actions like faking GPUs or submitting fake data. This is done by requiring workers to stake an upfront amount of capital — in the form of tokens on Base/Ethereum — that can be slashed (taken away) if a worker is deemed to have acted dishonestly.
In addition, we leave a 24 hour window for a node’s work to be validated, or invalidated and slashed — which allows us to slash at most 24 hours worth of rewards if a node acts maliciously or tries to game the mechanisms.
Since we are still in testnet and building on top of Base’s testnet chain, these rewards and tokens do not translate to any real value or money, but they show the way for how payment and economics can be incorporated in a real-world decentralized training setting.
The goal of INTELLECT-2 is to train a state-of-the-art reasoning model with a controllable thinking budget. This means that users and developers can, through its system prompt, specify for how many tokens the model should think about a problem before arriving at its final solution.
This approach makes it much more efficient to serve the resulting model in production: recent work such as ThinkPrune, L1 and Deepscaler have demonstrated that models specifically trained to reason under tight constraints can solve nearly all problems solvable by unconstrained reasoning models, but do so faster and at substantially lower inference cost. Controlling the reasoning budget via a prompt lets users take advantage of this while selectively enabling them to choose longer inference times for highly challenging problems.
To train a model capable of this, we use QwQ-32B as our base model and follow Deepseek-R1’s approach of applying GRPO with verifiable rewards from the domains of mathematics and coding. During our preliminary experiments, the following components were crucial to achieve good performance while controlling our model’s thinking budget:
On top of task rewards that grade model outputs according to the correctness of their responses, we incorporate length rewards in order to teach the model to adhere to thinking budgets specified in user prompts. We largely follow the work on L1, which samples target lengths from a specified range, injects them into the prompt, and consequently assigns rewards according to the difference between the target length and actual response length. Contrary to L1, we do not sample target lengths from a range of values, but a small set of predefined target lengths, as we have observed that this is an easier objective to learn for the model.
Beyond making the model more useful, training with length control lets us use heterogenous inference hardware in a more efficient way: for each rollout, we assign small thinking budgets to problems handled on inference workers with less GPU memory and compute power and large thinking budgets to problems on higher capacity inference workers. This way, we can set a lower maximum generation length on slower nodes and thus get equal processing times for our rollouts despite using heterogenous hardware.
During our experiments, we found careful data filtering to be crucial for model performance. We found that DeepSeek-R1-Distill-Qwen-7B does not improve when training it using the original Deepscaler dataset and recipe. This was fixed by heavy filtering for difficulty - by only keeping problems in the training dataset that the model did not solve 100% of the time, rewards improved during training and the resulting model improved on math benchmarks.
To filter the training dataset for INTELLECT-2, we sampled eight responses to all of our problems using DeepSeek-R1-Distill-Qwen-7B in order to get an estimate of its difficulty. To ensure that we only keep challenging problems for training, we only use problems with a solve rate of 75% or lower.
Online Advantage Filtering: During training, problems for which all completions have received the same reward carry no training signal, as their advantages (and therefore losses) are zero. We filter out these problems and continue running inference until we have a full batch with non-zero advantages. This makes training more efficient, as we don’t waste training compute on meaningless samples, and furthermore requires us spend more time on inference than training, making it highly suitable for leveraging decentralized inference workers.
For INTELLECT-2, we largely focus on verifiable mathematics and coding problems. As proven effective in prior works such as DeepCoder, we use a subset of tasks from SYNTHETIC-1 that was highly filtered for quality and difficulty. Our full training dataset can be found on Huggingface.
INTELLECT-2 is our first truly permissionless run where anyone can join with their own compute resources.
Check out the video for a detailed onboarding flow:
Dashboard: https://app.primeintellect.ai/intelligence (watch the run and contribute compute)
The launch of INTELLECT-2 marks the beginning of large-scale decentralized Reinforcement Learning. Now that the foundational infrastructure is in place, it’s up to all of us to scale it to the highest-impact domains.
Over the coming months, we’ll be:
Check out our team’s presentation from the Decentralized AI Day, which we hosted in San Francisco, for more details on where we're headed next.
A future of truly open superintelligence demands shared effort. Let’s build it together.