INTELLECT-2: Launching the First Globally Distributed Reinforcement Learning Training of a 32B Parameter Model

Today we are launching INTELLECT-2: the first 32B parameter globally decentralized Reinforcement Learning training run where anyone can permissionlessly contribute their heterogeneous compute resources.

This new test-time scaling paradigm offers, for the first time, the possibility for decentralized training to reach state-of-the-art performance.

Paradigm Shift for Decentralized Training

With the release of OpenAI's O1 and DeepSeek's R1, a second scaling paradigm beyond pre-training emerged last year—one that allows models to spend more time reasoning, optimized through reinforcement learning. In our previous release, we argued why we believe these reasoning models, trained via reinforcement learning, are even better suited for decentralized training approaches than standard LLM pre-training. Today, we are proving this with INTELLECT-2.

INTELLECT-2

Over the past year, we’ve been building all the crucial open-source components needed to scale INTELLECT-2 to a distributed 32B-parameter reinforcement learning training run—with frontier reasoning performance, heterogeneous compute nodes, and permissionless contributions:

  • prime-RL: Our new open-source library for fully async distributed reinforcement learning, built on top of our fault-tolerant decentralized training framework, prime.
  • SYNTHETIC-1 & GENESYS: Our library for crowdsourcing task and verifier environments for RL.
  • TOPLOC: Our approach to efficient, verifiable inference—used to validate the computations of all decentralized rollout workers in INTELLECT-2.
  • Protocol Testnet: Provides the infrastructure and economic incentives to aggregate and coordinate global compute resources, enabling a truly sovereign open-source AI ecosystem.
INTELLECT-2 Distributed RL Training Infrastructure

Prime-RL: Our Decentralized Training Framework

Our INTELLECT-2 infrastructure mainly consists of the three components:

  • Inference Rollout Workers: A decentralized swarm of nodes that collect reasoning rollouts from their environment using the latest policy model and compute the corresponding rewards.
  • TOPLOC Validators: Efficiently verifies the inference computations of our permissionless rollout workers, enabling a trustless system.
  • GRPO Training Workers: Once we’ve collected newly generated data from the decentralized rollout workers, we train on it using DeepSeek’s GRPO training approach. After training, these trainer nodes broadcast the updated weights to all inference worker via our shardcast library to enable the next round of data collection.

This infrastructure enables a decentralized training setup with the following unprecedented features:

  • Completely removes communication overhead: By leveraging asynchronous reinforcement learning, the broadcast of new policy models is fully overlapped with ongoing inference and training—eliminating communication as a bottleneck.
  • Supports heterogeneous inference nodes: Anyone can generate reasoning traces at their own pace—there’s no requirement for uniform processing speeds across nodes.
  • Low resource requirements: Inference workers, which constitute the majority of compute in this setup, can run on consumer-grade GPUs; for example, a machine with 4×RTX 3090 GPUs is sufficient to contribute to a 32B-parameter model training run.
  • Enables efficient validation: Inference computations can be verified without introducing training bottlenecks.

Asynchronous Reinforcement Learning

Reinforcement learning is inherently more asynchronous than traditional large language model pre-training. In distributed RL, data collection can be decoupled from network training. Multiple workers operate in parallel environments, each collecting experiences asynchronously, while a central learner receives and processes these experiences. Because experiences arrive at different times and from different parts of the state space, each step can happen at different rates.

Successful applications of asynchronous RL can be seen in Tulu 3 or Llama 4, where a one-step asynchronous RL approach was used to improve training efficiency.

Our own ablation experiments have shown that we can reproduce the results of DeepScaleR without model degradation—even when running with four-step asynchrony (i.e., the policy model used by inference workers is four steps behind). This level of asynchrony enables us to fully hide communication behind computation, even with the relatively weak global interconnect in our decentralized RL training setup.

Comparison of synchronous DeepScaleR training vs asynchronous Prime-RL under varying asynchrony levels. Even with increased delay (up to four steps), Prime-RL matches the performance of synchronous baselines.

Furthermore, beyond this enabling a decentralized training setup, the asynchronous RL approach also provides additional efficiency gains by allowing separate optimization of the training and inference engines. For example, in our prime-rl library, rollout workers can take advantage of vLLM and its full suite of inference optimizations.

Our fully asynchronous online RL training framework prime-rl is open-source, allowing anyone to start globally distributed reinforcement learning runs.

Shardcast

A critical component of our infrastructure is the mechanism for broadcasting new policy models from training workers to all decentralized inference workers as quickly as possible. To achieve this, we built Shardcast—a library for distributing large files via a HTTP-based tree-topology network. It consists of the following components:

  1. Origin Server: The root node that shards a large file and serves the shards via HTTP.
  2. Middle Nodes: Intermediate servers that download shards from upstream servers and re-serve them in a pipelined fashion.
  3. Client Nodes: Leaf nodes that download and reassemble the shards into the original file.
Overview of Shardcast

TOPLOC Validation

A few weeks ago, we released TOPLOC: a locality-sensitive hashing scheme for verifiable inference, designed to detect malicious modifications during inference. It enables us to:

  • Detect modifications to models, prompts, or precision during inference.
  • Remain robust to GPU hardware non-determinism—one of the main challenges in verifiable computing, as GPU operations are typically non-deterministic. TOPLOC works reliably across GPU types, tensor parallel configurations, and attention kernels.
  • Perform validation significantly faster than generation.

We are excited to test TOPLOC in production with INTELLECT-2. This allows us to truly open up participation—enabling anyone in the world to contribute GPU resources in a fully permissionless way.

Protocol Testnet

A few weeks ago, we announced our public protocol testnet to enable a truly sovereign open-source AI ecosystem.

Today, we are opening up the first permissionless Compute Pool, enabling anyone to run a protocol testnet worker on their GPUs to contribute to INTELLECT-2. Registration, compute resource attestation, slashing of malicious behavior, and more is all settled on the public Ethereum Base testnet. This allows for:

  • Global scale compute aggregation: Our worker is designed for anyone to be able to run it on any compute in the world, join the distributed network, and ultimately be rewarded for the work the node contributes. This allows us to scale and permissionlessly onboard datacenters from around the world.
  • Sets the foundation for fully decentralized training: All workers joining the compute pool communicate and coordinate in a peer-to-peer manner. This sets the foundation for fully decentralized and permissionless training and fine-tuning of open models, which is essential to enable a truly sovereign open-source AI ecosystem.

Beyond a number of improvements to our infrastructure to enable permissionless joining of our compute pools, we also have other key developments on the protocol.

  1. Mechanisms to detect and mitigate attacks & fraud

We integrate our TOPLOC verification in our worker so that contributors can attest to the work they’ve produced, and we can efficiently verify the legitimacy of their attestations. This allows us to identify attempts to contribute fake GPUs or poison the dataset.

  1. Incentive design to encourage honest behaviour

To mitigate dishonest behaviour, we’re experimenting with economic incentives that discourage malicious actions like faking GPUs or submitting fake data. This is done by requiring workers to stake an upfront amount of capital — in the form of tokens on Base/Ethereum — that can be slashed (taken away) if a worker is deemed to have acted dishonestly.

In addition, we leave a 24 hour window for a node’s work to be validated, or invalidated and slashed — which allows us to slash at most 24 hours worth of rewards if a node acts maliciously or tries to game the mechanisms.

Since we are still in testnet and building on top of Base’s testnet chain, these rewards and tokens do not translate to any real value or money, but they show the way for how payment and economics can be incorporated in a real-world decentralized training setting.

INTELLECT-2 Model & Training Details

The goal of INTELLECT-2 is to train a state-of-the-art reasoning model with a controllable thinking budget. This means that users and developers can, through its system prompt, specify for how many tokens the model should think about a problem before arriving at its final solution.

This approach makes it much more efficient to serve the resulting model in production: recent work such as ThinkPrune, L1 and Deepscaler have demonstrated that models specifically trained to reason under tight constraints can solve nearly all problems solvable by unconstrained reasoning models, but do so faster and at substantially lower inference cost. Controlling the reasoning budget via a prompt lets users take advantage of this while selectively enabling them to choose longer inference times for highly challenging problems.

Results of “L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning”, demonstrating that reasoning models can be trained to adhere to token counts specified in their prompt and model performance predictably improves with higher reasoning budgets. We replicated the paper’s results independently using our framework prime-rl.

To train a model capable of this, we use QwQ-32B as our base model and follow Deepseek-R1’s approach of applying GRPO with verifiable rewards from the domains of mathematics and coding. During our preliminary experiments, the following components were crucial to achieve good performance while controlling our model’s thinking budget:

Controllable Thinking Budgets via Length Rewards

On top of task rewards that grade model outputs according to the correctness of their responses, we incorporate length rewards in order to teach the model to adhere to thinking budgets specified in user prompts. We largely follow the work on L1, which samples target lengths from a specified range, injects them into the prompt, and consequently assigns rewards according to the difference between the target length and actual response length. Contrary to L1, we do not sample target lengths from a range of values, but a small set of predefined target lengths, as we have observed that this is an easier objective to learn for the model.

Beyond making the model more useful, training with length control lets us use heterogenous inference hardware in a more efficient way: for each rollout, we assign small thinking budgets to problems handled on inference workers with less GPU memory and compute power and large thinking budgets to problems on higher capacity inference workers. This way, we can set a lower maximum generation length on slower nodes and thus get equal processing times for our rollouts despite using heterogenous hardware.

Offline Data Filtering:

During our experiments, we found careful data filtering to be crucial for model performance. We found that DeepSeek-R1-Distill-Qwen-7B does not improve when training it using the original Deepscaler dataset and recipe. This was fixed by heavy filtering for difficulty - by only keeping problems in the training dataset that the model did not solve 100% of the time, rewards improved during training and the resulting model improved on math benchmarks.

Reward trajectory of DeepSeek-R1-Distill-Qwen-7B trained on the unfiltered (left) and difficulty-filtered (right) version of the Deepscaler dataset.

To filter the training dataset for INTELLECT-2, we sampled eight responses to all of our problems using DeepSeek-R1-Distill-Qwen-7B in order to get an estimate of its difficulty. To ensure that we only keep challenging problems for training, we only use problems with a solve rate of 75% or lower.

Online Advantage Filtering: During training, problems for which all completions have received the same reward carry no training signal, as their advantages (and therefore losses) are zero. We filter out these problems and continue running inference until we have a full batch with non-zero advantages. This makes training more efficient, as we don’t waste training compute on meaningless samples, and furthermore requires us spend more time on inference than training, making it highly suitable for leveraging decentralized inference workers.

Training Tasks & Verifiers:

For INTELLECT-2, we largely focus on verifiable mathematics and coding problems. As proven effective in prior works such as DeepCoder, we use a subset of tasks from SYNTHETIC-1 that was highly filtered for quality and difficulty. Our full training dataset can be found on Huggingface.

How To Contribute Compute

INTELLECT-2 is our first truly permissionless run where anyone can join with their own compute resources.

Check out the video for a detailed onboarding flow:

Dashboard: https://app.primeintellect.ai/intelligence (watch the run and contribute compute)

Next Steps

The launch of INTELLECT-2 marks the beginning of large-scale decentralized Reinforcement Learning. Now that the foundational infrastructure is in place, it’s up to all of us to scale it to the highest-impact domains.

Over the coming months, we’ll be:

  • Training agents end-to-end with RL: To fully leverage inference-time compute for scientific and research progress, we need to teach reasoning models to use tools such as code interpreters and other software. Reinforcement learning offers a powerful framework to optimize agent tool use in an end-to-end fashion.
  • Crowdsourcing tasks & verifier environments: We believe open-source has a unique advantage here. Distributed RL is still in its early days, and with the right community and contributions, open-source AI can outpace the closed labs.

Check out our team’s presentation from the Decentralized AI Day, which we hosted in San Francisco, for more details on where we're headed next.

Join Us

A future of truly open superintelligence demands shared effort. Let’s build it together.

Eurus-2-PRIME
System Prompt
When tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process. [ASSESS] [ADVANCE] [VERIFY] [SIMPLIFY] [SYNTHESIZE] [PIVOT] [OUTPUT] You should strictly follow the format below: [ACTION NAME] # Your action step 1 # Your action step 2 # Your action step 3 ... Next action: [NEXT ACTION NAME]
User
Quadratic polynomials $P(x)$ and $Q(x)$ have leading coefficients $2$ and $-2,$ respectively. The graphs of both polynomials pass through the two points $(16,54)$ and $(20,53).$ Find $P(0) + Q(0).$
Eurus-2-PRIME
[ASSESS]

We are given two quadratic polynomials $P(x)$ and $Q(x)$ with leading coefficients $2$ and $-2$, respectively. The graphs of both polynomials pass through the points $(16,54)$ and $(20,53)$. We need to find the value of $P(0) + Q(0)$.

[ADVANCE]

Let's express the quadratic polynomials $P(x)$ and $Q(x)$ in the general form:
$ P(x) = 2x^2 + bx + c $
$ Q(x) = -2x^2 + dx + e $

Since the graphs of both polynomials pass through the points $(16,54)$ and $(20,53)$, we can substitute these points into the equations of $P(x)$ and $Q(x)$ to form a system of equations.

For $P(x)$:
$ P(16) = 2(16)^2 + 16b + c = 54 $
$ 512 + 16b + c = 54 $
$ 16b + c = -458 \quad \text{(Equation 1)} $

$ P(20) = 2(20)^2 + 20b + c = 53 $
$ 800 + 20b + c = 53 $
$ 20b + c = -747 \quad \text{(Equation 2)} $

For $Q(x)$:
$ Q(16) = -2(16)^2 + 16d + e = 54 $
$ -512 + 16d + e = 54 $
$ 16d + e = 566 \quad \text{(Equation 3)} $

$ Q(20) = -2(20)^2 + 20d + e = 53 $
$ -800 + 20d + e = 53 $
$ 20d + e = 853 \quad \text{(Equation 4)} $

[ADVANCE]

Now, we can solve the system of equations for $P(x)$:
Subtract Equation 1 from Equation 2:
$ (20b + c) - (16b + c) = -747 - (-458) $
$ 4b = -289 $
$ b = -\frac{289}{4} $

Substitute $b = -\frac{289}{4}$ into Equation 1:
$ 16\left(-\frac{289}{4}\right) + c = -458 $
$ -1156 + c = -458 $
$ c = 698 $

So, for $P(x)$, we have:
$ P(x) = 2x^2 - \frac{289}{4}x + 698 $
$ P(0) = 698 $

Next, we can solve the system of equations for $Q(x)$:
Subtract Equation 3 from Equation 4:
$ (20d + e) - (16d + e) = 853 - 566 $
$ 4d = 287 $
$ d = \frac{287}{4} $

Substitute $d = \frac{287}{4}$ into Equation 3:
$ 16\left(\frac{287}{4}\right) + e = 566 $
$ 1148 + e = 566 $
$ e = -582 $

So, for $Q(x)$, we have:
$ Q(x) = -2x^2 + \frac{287}{4}x - 582 $
$ Q(0) = -582 $

[ADVANCE]

Finally, we can find $P(0) + Q(0)$:
$ P(0) + Q(0) = 698 + (-582) = 116 $

So, the value of $P(0) + Q(0)$ is $\boxed{116}$.
INTELLECT-MATH-7B-SFT (ours)
System Prompt
Solve the following math problem efficiently and clearly. Think carefully and step by step about your response and reason before providing a final response. Conclude your response with: \n\nTherefore, the final answer is: $\\boxed{answer}$. I hope it is correct.\n\nWhere [answer] is just the final number or expression that solves the problem.
User
Quadratic polynomials $P(x)$ and $Q(x)$ have leading coefficients $2$ and $-2,$ respectively. The graphs of both polynomials pass through the two points $(16,54)$ and $(20,53).$ Find $P(0) + Q(0).$
INTELLECT-MATH-7B-SFT
So I've got this problem here about quadratic polynomials $P(x)$ and $Q(x)$. $P(x)$ has a leading coefficient of $2$, and $Q(x)$ has a leading coefficient of $-2$. Both polynomials pass through the points $(16,54)$ and $(20,53)$. I need to find $P(0) + Q(0)$.

First, since both $P(x)$ and $Q(x)$ are quadratic polynomials, I can express them in the general form:

$P(x) = 2x^2 + bx + c$

$Q(x) = -2x^2 + dx + e$

Wait, actually, the leading coefficient is the coefficient of the highest power of $x$, which for a quadratic is $x^2$. So yes, $P(x) = 2x^2 + bx + c$ and $Q(x) = -2x^2 + dx + e$.

Now, both polynomials pass through the points $(16,54)$ and $(20,53)$. That means when $x=16$, $P(16)=54$, and when $x=20$, $P(20)=53$. Similarly for $Q(x)$.

So, I can set up equations using these points.

First, for $P(x)$:

$P(16) = 2(16)^2 + b16 + c = 54$

Similarly, $P(20) = 2(20)^2 + b20 + c = 53$

Let me calculate $16^2$ and $20^2$:

$16^2 = 256$

$20^2 = 400$

So,

$P(16): 2256 + 16b + c = 54 \Rightarrow 512 + 16b + c = 54 \Rightarrow 16b + c = 54 - 512 \Rightarrow 16b + c = -458$

$P(20): 2400 + 20b + c = 53 \Rightarrow 800 + 20b + c = 53 \Rightarrow 20b + c = 53 - 800 \Rightarrow 20b + c = -747$

Now, I have two equations:

1) $16b + c = -458$

2) $20b + c = -747$

I can subtract equation 1 from equation 2 to eliminate $c$:

$(20b + c) - (16b + c) = -747 - (-458)$

$20b + c - 16b - c = -747 + 458$

$4b = -289$

So, $b = -289 / 4 = -72.25$

Wait, but coefficients are usually integers, but maybe not. Let me check.

Now, plug $b$ back into one of the equations to find $c$. Let's use equation 1:

$16(-72.25) + c = -458$

$-1156 + c = -458$

$c = -458 + 1156 = 698$

So, $P(x) = 2x^2 - 72.25x + 698$

Hmm, $P(0)$ would be $c$, which is $698$.

Now, let's do the same for $Q(x)$.

$Q(16) = -2*(16)^2 + d16 + e = 54 \Rightarrow -2256 + 16d + e = 54 \Rightarrow -512 + 16d + e = 54 \Rightarrow 16d + e = 54 + 512 \Rightarrow 16d + e = 566$

$Q(20) = -2*(20)^2 + d20 + e = 53 \Rightarrow -2400 + 20d + e = 53 \Rightarrow -800 + 20d + e = 53 \Rightarrow 20d + e = 53 + 800 \Rightarrow 20d + e = 853$

Now, equations for $Q(x)$:

1) $16d + e = 566$

2) $20d + e = 853$

Subtract equation 1 from equation 2:

$(20d + e) - (16d + e) = 853 - 566$

$4d = 287$

$d = 287 / 4 = 71.75$

Now, plug $d$ back into equation 1:

$1671.75 + e = 566 \Rightarrow 1148 + e = 566 \Rightarrow e = 566 - 1148 = -582$

So, $Q(x) = -2x^2 + 71.75x - 582$

Therefore, $Q(0) = e = -582$

Now, $P(0) + Q(0) = 698 + (-582) = 116$

Wait, but the problem says "present the answer in LaTeX format: \boxed{Your answer}"

So, I should write $\boxed{116}$

But let me double-check my calculations to make sure I didn't make any mistakes.

First, for $P(x)$:

$P(16) = 2(256) + 16b + c = 512 + 16b + c = 54 \Rightarrow 16b + c = -458$

$P(20) = 2*(400) + 20b + c = 800 + 20b + c = 53 \Rightarrow 20b + c = -747$

Subtracting: $4b = -289 \Rightarrow b = -72.25$

Then, $c = -458 - 16*(-72.25) = -458 + 1156 = 698$

Similarly for $Q(x)$:

$Q(16) = -2*(256) + 16d + e = -512 + 16d + e = 54 \Rightarrow 16d + e = 566$

$Q(20) = -2*(400) + 20d + e = -800 + 20d + e = 53 \Rightarrow 20d + e = 853$

Subtracting: $4d = 287 \Rightarrow d = 71.75$

Then, $e = 566 - 16*71.75 = 566 - 1148 = -582$

So, $P(0) = 698$, $Q(0) = -582$, sum is $116$.

Seems correct.

Final Answer

$\boxed{116}$
INTELLECT-MATH-7B (ours)
System Prompt
Solve the following math problem efficiently and clearly. Think carefully and step by step about your response and reason before providing a final response. Conclude your response with: \n\nTherefore, the final answer is: $\\boxed{answer}$. I hope it is correct.\n\nWhere [answer] is just the final number or expression that solves the problem.
User
Quadratic polynomials $P(x)$ and $Q(x)$ have leading coefficients $2$ and $-2,$ respectively. The graphs of both polynomials pass through the two points $(16,54)$ and $(20,53).$ Find $P(0) + Q(0).$
INTELLECT-MATH-7B
So I have two quadratic polynomials, $P(x)$ and $Q(x)$, with leading coefficients 2 and -2, respectively. Both pass through the points $(16,54)$ and $(20,53)$. I need to find $P(0) + Q(0)$.

First, since both are quadratic polynomials, I can express them in the general form:

$P(x) = 2x² + bx + c$

$Q(x) = -2x² + dx + e$

But since both pass through the points $(16,54)$ and $(20,53)$, I can set up equations using these points.

For $P(x)$:

$P(16) = 2*(16)² + b \cdot 16 + c = 54$

$P(20) = 2(20)² + b \cdot 20 + c = 53$

Similarly for $Q(x)$:

$Q(16) = -2(16)² + d \cdot 16 + e = 54$

$Q(20) = -2(20)² + d \cdot 20 + e = 53$

Let's compute the squares:

$16² = 256$

$20² = 400$

So for $P(x)$:

$2256 + 16b + c = 54 \Rightarrow 512 + 16b + c = 54 \Rightarrow 16b + c = 54 - 512 = -458 \quad ...(1)$

$2400 + 20b + c = 53 \Rightarrow 800 + 20b + c = 53 \Rightarrow 20b + c = 53 - 800 = -747 \quad ...(2)$

For $Q(x)$:

$-2256 + 16d + e = 54 \Rightarrow -512 + 16d + e = 54 \Rightarrow 16d + e = 54 + 512 = 566 \quad ...(3)$

$-2400 + 20d + e = 53 \Rightarrow -800 + 20d + e = 53 \Rightarrow 20d + e = 53 + 800 = 853 \quad ...(4)$

Now, I have two systems of linear equations:

For $P(x)$:

$16b + c = -458 \quad ...(1)$

$20b + c = -747 \quad ...(2)$

Subtract equation (1) from equation (2):

$(20b + c) - (16b + c) = -747 - (-458)$

$4b = -289 \Rightarrow b = -\frac{289}{4}$

Now plug $b$ back into equation (1):

$16\left(-\frac{289}{4}\right) + c = -458 \Rightarrow -1156 + c = -458 \Rightarrow c = -458 + 1156 = 698$

So $P(x) = 2x² - \frac{289}{4}x + 698$

For $Q(x)$:

$16d + e = 566 \quad ...(3)$

$20d + e = 853 \quad ...(4)$

Subtract equation (3) from equation (4):

$(20d + e) - (16d + e) = 853 - 566$

$4d = 287 \Rightarrow d = \frac{287}{4}$

Now plug $d$ back into equation (3):

$16\left(\frac{287}{4}\right) + e = 566 \Rightarrow 1148 + e = 566 \Rightarrow e = 566 -1148= -582$

So $Q(x)=-2x^{2}+\frac{287}{4}x-582$.

Now, $P(0)=c=698$.

$Q(0)=e=-582$.

Therefore, $P(0)+Q(0)=698+(-582)=116$.

Final Answer

$\boxed{116}$.
Sami, Justus, Jackmin, Apaz, Felix, Mike, Kushal, Grad, Johannes for their work on the decentralized RL research. Manveer, Matthew, Jannik, and Kemal for their work on the protocol and platform. Michael Luo for his advice on replicating the DeepScaleR results.