Decentralized Training in the Inference-Time-Compute Paradigm
We have started working on LLM reasoning and have a few updates to share:
Research preview: By fine-tuning on synthetic reasoning traces, we can provide better policy initializations for online reinforcement learning. We use this insight to train INTELLECT-MATH, a state-of-the-art model for mathematical reasoning (prior to the release of Deepseek-R1) in the 7B parameter regime that reduces the training time required to match the existing state-of-the-art by 10x. Additionally, we release NuminaMath-QwQ-CoT-5M, a dataset containing five million reasoning traces across 860k mathematics questions verified for correctness. All of our code as well as weights and data are open source.
An Outlook on Decentralized Training in the Inference-Compute Paradigm: Huge amounts of forward passes during reinforcement learning will result in lower communication requirements and fundamentally shift compute infrastructure requirements, making globally distributed training the path forward. We explain why and share our research plans for the coming months.
We are hiring: This year, we will train a fully open reproduction of o3, enabling everyone globally to contribute compute resources permissionlessly. If you want to be part of this, join our research team.
Research Preview: 10x Faster Training for State-of-the-Art Mathematical Reasoning
OpenAI’s o1 and o3 represent a major shift in how language models are trained and used - by spending more time thinking about problems during inference, they can better handle complex tasks like solving math problems or writing code.
Reasoning models like these are trained differently compared to traditional chat models: Given a set of problems, they generate candidate solutions that are assigned rewards based on the solutions’ correctness, which allows the models to learn from their own answers and iteratively self-improve without training on human task demonstrations. This iterative process is called online reinforcement learning (RL).
PRIME-RL, A recent piece of work from Tsinghua University and UIUC, has introduced an online RL algorithm that leverages rewards from implicit process reward models. Using this approach, the authors trained Eurus-2-7B-PRIME, which (prior to the release of Deepseek-R1) substantially outperformed all other 7B parameter models as well as bigger models like GPT-4o on mathematical reasoning benchmarks.
While PRIME-RL and related methods lead to great performance, online reinforcement learning tends to be unstable and quite expensive compared to supervised training: since we have to sample solutions from our model at every training step, the whole training process takes very long.
Better Initializations with Synthetic Reasoning Data
The online RL procedure that led to Eurus-2-7B-PRIME was initialized with Eurus-2-7B-SFT, a model that was fine-tuned on 230k math questions and solutions in a supervised manner. The solutions for this dataset were generated with Llama-3.1-70B-Instruct, with a specific system prompt that asks the model to reason step by step and indicate what action it is currently taking by outputting strings such as “[ASSESS]”, “[ADVANCE]” or “[VERIFY]” as part of its chain of thought.
While this system prompt helps generate data that nudges the model towards structured reasoning, it is clearly possible to generate higher-quality SFT data with the help of existing reasoning models and thus train a better SFT model. We hypothesize that this can be very impactful: by starting RL from a better policy that already imitates desired reasoning behaviors such as self-correction, we won’t have to wait for these behaviors to emerge from reinforcement learning alone, and will thus get to better model performance faster.
To obtain a synthetic fine-tuning dataset, we used QwQ-32B, the best open reasoning model at the time of writing this article. We started from a set of mathematics questions and responses with ground truth solutions obtained from the NuminaMath dataset and consequently used QwQ to generate responses. The responses were then filtered for correctness (by comparing to the ground truth response) and length (as our base model has a context window of 4096) - no further post-processing was applied.
The resulting SFT dataset consisted of 733k questions with corresponding responses. Following the work on Eurus, we fine-tuned the base model Qwen-7B-Math on this dataset and consequently applied PRIME-RL using the exact same setup and RL data as the original work.
Performance
Using the described methodology, we train INTELLECT-MATH, a model that outperforms Eurus-2-7B-PRIME on several mathematical reasoning benchmarks and is the best 7B parameter model for mathematical reasoning (prior to the release of Deepseek-R1). Furthermore, we show that an earlier checkpoint of the model was able to match the performance of Eurus-2-7B-PRIME with 10x less GPU hours (47 RL steps for INTELLECT-MATH vs. 592 for Eurus-2-PRIME).
Compared to Eurus-2-PRIME, INTELLECT-MATH-7B-SFT and INTELLECT-MATH-7B exhibit more typical reasoning behavior that can be seen in models like o1 or Deepseek-R1: they naturally summarize the question at hand, verify their reasoning and “think out loud”, whereas Eurus-2-7B-PRIME has to be specifically instructed to do so in its system prompt. Notably, the main difference between INTELLECT-MATH-7B-SFT and INTELLECT-MATH-7B is that the former is much more verbose and contains too much unnecessary self-questioning, which often leads to it getting stuck in infinite reasoning loops. During reinforcement learning, we saw its response length decrease, which indicates that the RL stage refines the reasoning behavior learned through SFT to avoid pure imitation.
Eurus-2-PRIME
System Prompt
When tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process. [ASSESS] [ADVANCE] [VERIFY] [SIMPLIFY] [SYNTHESIZE] [PIVOT] [OUTPUT] You should strictly follow the format below: [ACTION NAME] # Your action step 1 # Your action step 2 # Your action step 3 ... Next action: [NEXT ACTION NAME]
User
Quadratic polynomials $P(x)$ and $Q(x)$ have leading coefficients $2$ and $-2,$ respectively. The graphs of both polynomials pass through the two points $(16,54)$ and $(20,53).$ Find $P(0) + Q(0).$
Eurus-2-PRIME
[ASSESS]
We are given two quadratic polynomials $P(x)$ and $Q(x)$ with leading coefficients $2$ and $-2$, respectively. The graphs of both polynomials pass through the points $(16,54)$ and $(20,53)$. We need to find the value of $P(0) + Q(0)$.
[ADVANCE]
Let's express the quadratic polynomials $P(x)$ and $Q(x)$ in the general form:
$ P(x) = 2x^2 + bx + c $
$ Q(x) = -2x^2 + dx + e $
Since the graphs of both polynomials pass through the points $(16,54)$ and $(20,53)$, we can substitute these points into the equations of $P(x)$ and $Q(x)$ to form a system of equations.
For $P(x)$:
$ P(16) = 2(16)^2 + 16b + c = 54 $
$ 512 + 16b + c = 54 $
$ 16b + c = -458 \quad \text{(Equation 1)} $
$ P(20) = 2(20)^2 + 20b + c = 53 $
$ 800 + 20b + c = 53 $
$ 20b + c = -747 \quad \text{(Equation 2)} $
For $Q(x)$:
$ Q(16) = -2(16)^2 + 16d + e = 54 $
$ -512 + 16d + e = 54 $
$ 16d + e = 566 \quad \text{(Equation 3)} $
$ Q(20) = -2(20)^2 + 20d + e = 53 $
$ -800 + 20d + e = 53 $
$ 20d + e = 853 \quad \text{(Equation 4)} $
[ADVANCE]
Now, we can solve the system of equations for $P(x)$: Subtract Equation 1 from Equation 2:
$ (20b + c) - (16b + c) = -747 - (-458) $
$ 4b = -289 $
$ b = -\frac{289}{4} $
Substitute $b = -\frac{289}{4}$ into Equation 1:
$ 16\left(-\frac{289}{4}\right) + c = -458 $
$ -1156 + c = -458 $
$ c = 698 $
So, for $P(x)$, we have:
$ P(x) = 2x^2 - \frac{289}{4}x + 698 $
$ P(0) = 698 $
Next, we can solve the system of equations for $Q(x)$: Subtract Equation 3 from Equation 4:
$ (20d + e) - (16d + e) = 853 - 566 $
$ 4d = 287 $
$ d = \frac{287}{4} $
Substitute $d = \frac{287}{4}$ into Equation 3:
$ 16\left(\frac{287}{4}\right) + e = 566 $
$ 1148 + e = 566 $
$ e = -582 $
So, for $Q(x)$, we have:
$ Q(x) = -2x^2 + \frac{287}{4}x - 582 $
$ Q(0) = -582 $
[ADVANCE]
Finally, we can find $P(0) + Q(0)$:
$ P(0) + Q(0) = 698 + (-582) = 116 $
So, the value of $P(0) + Q(0)$ is $\boxed{116}$.
INTELLECT-MATH-7B-SFT (ours)
System Prompt
Solve the following math problem efficiently and clearly. Think carefully and step by step about your response and reason before providing a final response. Conclude your response with: \n\nTherefore, the final answer is: $\\boxed{answer}$. I hope it is correct.\n\nWhere [answer] is just the final number or expression that solves the problem.
User
Quadratic polynomials $P(x)$ and $Q(x)$ have leading coefficients $2$ and $-2,$ respectively. The graphs of both polynomials pass through the two points $(16,54)$ and $(20,53).$ Find $P(0) + Q(0).$
INTELLECT-MATH-7B-SFT
So I've got this problem here about quadratic polynomials $P(x)$ and $Q(x)$. $P(x)$ has a leading coefficient of $2$, and $Q(x)$ has a leading coefficient of $-2$. Both polynomials pass through the points $(16,54)$ and $(20,53)$. I need to find $P(0) + Q(0)$.
First, since both $P(x)$ and $Q(x)$ are quadratic polynomials, I can express them in the general form:
$P(x) = 2x^2 + bx + c$
$Q(x) = -2x^2 + dx + e$
Wait, actually, the leading coefficient is the coefficient of the highest power of $x$, which for a quadratic is $x^2$. So yes, $P(x) = 2x^2 + bx + c$ and $Q(x) = -2x^2 + dx + e$.
Now, both polynomials pass through the points $(16,54)$ and $(20,53)$. That means when $x=16$, $P(16)=54$, and when $x=20$, $P(20)=53$. Similarly for $Q(x)$.
So, I can set up equations using these points.
First, for $P(x)$:
$P(16) = 2(16)^2 + b16 + c = 54$
Similarly, $P(20) = 2(20)^2 + b20 + c = 53$
Let me calculate $16^2$ and $20^2$:
$16^2 = 256$
$20^2 = 400$
So,
$P(16): 2256 + 16b + c = 54 \Rightarrow 512 + 16b + c = 54 \Rightarrow 16b + c = 54 - 512 \Rightarrow 16b + c = -458$
$P(20): 2400 + 20b + c = 53 \Rightarrow 800 + 20b + c = 53 \Rightarrow 20b + c = 53 - 800 \Rightarrow 20b + c = -747$
Now, I have two equations:
1) $16b + c = -458$
2) $20b + c = -747$
I can subtract equation 1 from equation 2 to eliminate $c$:
$(20b + c) - (16b + c) = -747 - (-458)$
$20b + c - 16b - c = -747 + 458$
$4b = -289$
So, $b = -289 / 4 = -72.25$
Wait, but coefficients are usually integers, but maybe not. Let me check.
Now, plug $b$ back into one of the equations to find $c$. Let's use equation 1:
$16(-72.25) + c = -458$
$-1156 + c = -458$
$c = -458 + 1156 = 698$
So, $P(x) = 2x^2 - 72.25x + 698$
Hmm, $P(0)$ would be $c$, which is $698$.
Now, let's do the same for $Q(x)$.
$Q(16) = -2*(16)^2 + d16 + e = 54 \Rightarrow -2256 + 16d + e = 54 \Rightarrow -512 + 16d + e = 54 \Rightarrow 16d + e = 54 + 512 \Rightarrow 16d + e = 566$
$Q(20) = -2*(20)^2 + d20 + e = 53 \Rightarrow -2400 + 20d + e = 53 \Rightarrow -800 + 20d + e = 53 \Rightarrow 20d + e = 53 + 800 \Rightarrow 20d + e = 853$
Now, equations for $Q(x)$:
1) $16d + e = 566$
2) $20d + e = 853$
Subtract equation 1 from equation 2:
$(20d + e) - (16d + e) = 853 - 566$
$4d = 287$
$d = 287 / 4 = 71.75$
Now, plug $d$ back into equation 1:
$1671.75 + e = 566 \Rightarrow 1148 + e = 566 \Rightarrow e = 566 - 1148 = -582$
So, $Q(x) = -2x^2 + 71.75x - 582$
Therefore, $Q(0) = e = -582$
Now, $P(0) + Q(0) = 698 + (-582) = 116$
Wait, but the problem says "present the answer in LaTeX format: \boxed{Your answer}"
So, I should write $\boxed{116}$
But let me double-check my calculations to make sure I didn't make any mistakes.
First, for $P(x)$:
$P(16) = 2(256) + 16b + c = 512 + 16b + c = 54 \Rightarrow 16b + c = -458$
$P(20) = 2*(400) + 20b + c = 800 + 20b + c = 53 \Rightarrow 20b + c = -747$
$Q(16) = -2*(256) + 16d + e = -512 + 16d + e = 54 \Rightarrow 16d + e = 566$
$Q(20) = -2*(400) + 20d + e = -800 + 20d + e = 53 \Rightarrow 20d + e = 853$
Subtracting: $4d = 287 \Rightarrow d = 71.75$
Then, $e = 566 - 16*71.75 = 566 - 1148 = -582$
So, $P(0) = 698$, $Q(0) = -582$, sum is $116$.
Seems correct.
Final Answer
$\boxed{116}$
INTELLECT-MATH-7B (ours)
System Prompt
Solve the following math problem efficiently and clearly. Think carefully and step by step about your response and reason before providing a final response. Conclude your response with: \n\nTherefore, the final answer is: $\\boxed{answer}$. I hope it is correct.\n\nWhere [answer] is just the final number or expression that solves the problem.
User
Quadratic polynomials $P(x)$ and $Q(x)$ have leading coefficients $2$ and $-2,$ respectively. The graphs of both polynomials pass through the two points $(16,54)$ and $(20,53).$ Find $P(0) + Q(0).$
INTELLECT-MATH-7B
So I have two quadratic polynomials, $P(x)$ and $Q(x)$, with leading coefficients 2 and -2, respectively. Both pass through the points $(16,54)$ and $(20,53)$. I need to find $P(0) + Q(0)$.
First, since both are quadratic polynomials, I can express them in the general form:
$P(x) = 2x² + bx + c$
$Q(x) = -2x² + dx + e$
But since both pass through the points $(16,54)$ and $(20,53)$, I can set up equations using these points.
For $P(x)$:
$P(16) = 2*(16)² + b \cdot 16 + c = 54$
$P(20) = 2(20)² + b \cdot 20 + c = 53$
Similarly for $Q(x)$:
$Q(16) = -2(16)² + d \cdot 16 + e = 54$
$Q(20) = -2(20)² + d \cdot 20 + e = 53$
Let's compute the squares:
$16² = 256$
$20² = 400$
So for $P(x)$:
$2256 + 16b + c = 54 \Rightarrow 512 + 16b + c = 54 \Rightarrow 16b + c = 54 - 512 = -458 \quad ...(1)$
$2400 + 20b + c = 53 \Rightarrow 800 + 20b + c = 53 \Rightarrow 20b + c = 53 - 800 = -747 \quad ...(2)$
For $Q(x)$:
$-2256 + 16d + e = 54 \Rightarrow -512 + 16d + e = 54 \Rightarrow 16d + e = 54 + 512 = 566 \quad ...(3)$
$-2400 + 20d + e = 53 \Rightarrow -800 + 20d + e = 53 \Rightarrow 20d + e = 53 + 800 = 853 \quad ...(4)$
Now, I have two systems of linear equations:
For $P(x)$:
$16b + c = -458 \quad ...(1)$
$20b + c = -747 \quad ...(2)$
Subtract equation (1) from equation (2):
$(20b + c) - (16b + c) = -747 - (-458)$
$4b = -289 \Rightarrow b = -\frac{289}{4}$
Now plug $b$ back into equation (1):
$16\left(-\frac{289}{4}\right) + c = -458 \Rightarrow -1156 + c = -458 \Rightarrow c = -458 + 1156 = 698$
So $P(x) = 2x² - \frac{289}{4}x + 698$
For $Q(x)$:
$16d + e = 566 \quad ...(3)$
$20d + e = 853 \quad ...(4)$
Subtract equation (3) from equation (4):
$(20d + e) - (16d + e) = 853 - 566$
$4d = 287 \Rightarrow d = \frac{287}{4}$
Now plug $d$ back into equation (3):
$16\left(\frac{287}{4}\right) + e = 566 \Rightarrow 1148 + e = 566 \Rightarrow e = 566 -1148= -582$
So $Q(x)=-2x^{2}+\frac{287}{4}x-582$.
Now, $P(0)=c=698$.
$Q(0)=e=-582$.
Therefore, $P(0)+Q(0)=698+(-582)=116$.
Final Answer
$\boxed{116}$.
Revisiting our Results After the Release of Deepseek-R1
With the release of Deepseek-R1, we now know much more about how to train a strong reasoning model. For the most part, our results are in line with the findings from yesterday’s paper: The Deepseek team reports that incorporating cold start data for SFT was important for model performance and made R1 a much better model than R1-zero. Furthermore, distillation alone from a strong teacher model without further RL is highly effective.
One result that is different to the findings of the Deepseek team is the effect of reinforcement learning on the length of model responses. Deepseek-R1’s responses became longer as RL training progressed, whereas our RL stage made the model outputs less verbose and shorter on average. Whether this is due to the use PRIME-RL compared to GRPO for reinforcement learning, differences in our SFT data or another reason entirely is something we will have to study.
NuminaMath-QwQ-CoT-5M: A large synthetic reasoning dataset
Motivated by the good results of our approach, we have decided to go beyond the 766k samples from our SFT data and use QwQ-32B to generate a larger open dataset for the research community. NuminaMath-QwQ-CoT-5M contains five million reasoning traces across 860k mathematics questions in NuminaMath that are verified for correctness. Along with our checkpoints, we hope that this data will be helpful for researchers fine-tuning models and working on other aspects of LLM reasoning such as process reward models.
Going forward, we are particularly excited about leveraging globally distributed compute for generating trillion-token-scale synthetic datasets: Models such as Qwen-Coder, WizardLM, Phi-4, and most notably models distilled from Deepseek-R1, were trained on very large synthetic datasets from strong teacher models and have shown that this is extremely effective for enhancing capabilities.
Dzmitry Bahdanau, the inventor of the attention, thinks that “mining verifiable reasoning problems […] in massive numbers” is an important contribution for progress in reasoning - with Deepseek-V3 and Deepseek-R1, we now have open weight models that are competitive with leading closed chat and reasoning models, which means that we are well positioned to leverage our decentralized network of GPUs to generate and openly release huge synthetic datasets. Stay tuned for announcements if you want to contribute compute and collaboratively advance open-source AI progress.
5 reasons why open-source reasoning will progress fast: 1⃣ evaluation is easy and quick 2⃣ verifiable reasoning problems can be mined in massive numbers 3⃣ H100 hour now costs as low as 2$ 4⃣ we have scalable open training frameworks 5⃣ we have great open inference software
An Outlook: Decentralized Training in the Inference-Compute Paradigm
With INTELLECT-1, we’ve demonstrated that large-scale model training is no longer confined to large corporations but can be achieved through distributed, community-driven approaches.
Our research on scaling decentralized training to frontier model sizes is making significant progress, and recent developments in inference-time scaling may make this goal even more attainable. We believe that the methods used to train reasoning models are even better suited for decentralized approaches than the standard pre-training paradigm.
In many cases, these models will run multiple times per backwards pass depending on how much data needs to be pruned or rejected. This leads to an extremely high ratio of forward passes versus backwards passes for post-training whereas in pre-training the ratio is 1 to 1.
This changes the infrastructure requirements heavily for training. For example, having a single large fully connected scale out fabric may not be as necessary as it was in the past. One big positive is that training across geographically distributed datacenters is easier because they can focus purely on data generation and pruning rather than updating the model.
To scale decentralized training even further to frontier model sizes, current approaches are constrained by the amount of communication required to synchronize gradients across model replicas. By increasing the ratio of forward passes to backward passes in the inference-compute paradigm, communication becomes less of a bottleneck.
Concretely, to compute a single gradient update, we first perform a rollout phase in which we generate large volumes of samples from our model and assign rewards to them using a reward model or automated verifiers. This rollout phase is time-consuming, leading to far fewer gradient updates than in pre-training.
For example, using PRIME-RL, we sampled four responses for 256 questions at each rollout step to obtain rewards and compute gradients. In this setup, performing a single training step took us more than 10 minutes! Through hyperparameter adjustments (particularly larger rollout phases), it is entirely feasible to design a decentralized RL run in a way that is equivalent to a centralized setup.
This means one thing: there’s no inherent disadvantage to a decentralized training setup. With access to sufficient datasets, enough crowdsourced GPUs and the same knowledge as frontier labs, it will be possible to train frontier reasoning models in a decentralized fashion and make them accessible to everyone.
What’s Next?
We are continuing to work on research with the goal of building open frontier models. Here are some of the problems we are currently tackling to improve model’s reasoning capabilities:
Scaling Distributed Reinforcement Learning to o3-scale: To replicate o3, we will need to scale reinforcement learning to larger base models and datasets, and train for a much longer time. The compute required to train reasoning models will likely exceed even pre-training requirements, necessitating the pooling of global compute resources. This comes with many interesting design challenges, especially in a low-communication regime. Therefore, we are currently extending our training framework prime to support globally distributed reinforcement learning.
Trillion-Token Synthetic Datasets: We want to collect more synthetic data to improve model capabilities. To do so, we need to design new recipes to enhance data quality and set up infrastructure that enables data generation in a distributed network of GPUs.
Training Agents end-to-end: To fully leverage inference-time compute ****for scientific progress, we need to teach reasoning models to use tools such as code interpreters and other software. Reinforcement learning offers a great framework to optimize agents with tool-use end to end.
If this sounds exciting to you, join us. We are a small, talent-dense and highly ambitious team and are looking for engineers and researchers to help us build open source AGI.