Authors

Sami

Johannes

Announcements

April 28, 2025

Planetary-Scale Inference: Previewing our Peer-To-Peer Decentralized Inference Stack

We are excited to share a preview of our peer-to-peer decentralized inference stack — engineered for consumer GPUs and the 100ms latencies of the public internet—plus a research roadmap that scales it into a planetary-scale inference engine.

At Prime Intellect, we’re building towards an open and decentralized AGI future—one where anyone with consumer-grade hardware and a network connection can meaningfully contribute to and benefit from AGI. This means designing for the real world: heterogeneous GPUs, public internet latency, and unreliable but abundant FLOPs. With the rise of reinforcement learning for reasoning models like DeepSeek R1, inference has moved to center stage, and is now a core component of the entire AI stack:

Training: Generate rollouts during reinforcement learning (e.g. INTELLECT-2)
Distillation: Creating synthetic data at scale (e.g. SYNTHETIC-1)
Evaluation: Benchmarking model performance and safety

That’s why our next step is decentralizing inference itself.

Designing for High-Latency Networks

However, decentralizing inference is a non-trivial problem, as it is heavily constrained by network communication. In the remainder of the blog post, we go into technical detail of why this is and share ideas on how to circumvent key bottlenecks that we identify. Here is the TL;DR:

Because of its low communication requirements, pipeline parallelism is best suited for decentralized environments.
However, naive pipeline parallelism suffers from GPU idle time caused by sequential processing.
Surprisingly, asynchronous micro-batch pipeline schedules (like ZeroBubble) are effective for compute-bound workloads like training but not for inference workloads, which are typically bounded by memory bandwidth and total memory
- When not compute-bound, processing one or two sequences takes roughly the same time — we gain "free" throughput by parallel decoding.
- The KV cache's high memory demands grow with parallel sequence decoding, preventing compute-bound operation (which would occur at $B_{crit}$). Thus, we stay in the "~linear throughput scaling" regime until reaching maximum batch size $B_{max}\ll B_{crit}$.
- Consider $N$ devices running either a synchronous schedule with $B_{max}$ or an asynchronous schedule with $N$ micro-batches of size $B_{max}/N$: Due to linear throughput scaling, processing either the full batch or a micro-batch takes time $T$. The total time is $NT$ whether a device processes all micro-batches or processes the full batch and waits $(N-1)\cdot T$ for pipeline completion.
- Increasing the number of pipeline stages doesn't improve throughput because doubling the stages only doubles the maximum batch size — if $B_{max}$ sequences fit on $N$ devices, then $2B_{max}$ fit on $2N$ devices. For pipelines with $N$ and $2N$ devices, the micro-batch sizes are equivalent ($B_{max}/N=$ $2B_{max}/2N$), yielding no throughput gains.
This makes synchronous pipeline schedules, as implemented in vLLM, a strong baseline.
Achieving higher throughput in high-latency environments requires reimagining the inference process. The key will be converting memory requirements into compute requirements—a viable approach in decentralized settings where network-induced idle time can be used for asynchronous computation.

Today's Releases

Alongside the blog post, we are open-sourcing three research codebases that build towards latency-aware pipeline parallelism:

PRIME-IROH: A peer-to-peer communication backend designed for pipeline parallel communication
PRIME-VLLM: A pipeline-parallel vLLM integration that runs over public networks
PRIME-PIPELINE: A minimal research sandbox for quickly validating research ideas, such as novel cache mechanisms and pipeline schedules

With today's releases you can already connect private hardware to run inference over public networks. We are actively working on integrating these components into our own stack, like our Protocol testnet and TOPLOC verification scheme, to enable our largest public synthetic data run, SYNTHETIC-2.

Traditional Inference

In this section, we introduce mathematical notation to reason about the runtime of operations on GPUs to analyze bottlenecking factors during inference workloads on a single node. Much of the notation is borrowed from How To Scale Your Model by Google DeepMind.

If you are already familiar with traditional inference math, skip to Decentralized Inference.

Accelerator Math

Whenever we run an operation on an accelerator, there are two things that take time:

Computation: How long it takes to compute all the floating-point operations (FLOPs) required for the operation

$$T_{\text{comp}}=\frac{\text{Compute Volume [FLOPs]}}{\text{Compute Speed [FLOPs/s]}}$$

Communication: How long it takes to move all bytes required for the operation from an external device to the compute cores of the accelerator

$$T_{\text{comm}}=\text{Communication Latency}+\frac{\text{Communication Volume [Bytes]}}{\text{Communication Bandwidth [Bytes/s]}}$$

Depending on the communication workload, $T_{\text{comm}}$ is typically dominated by one of the two terms, i.e., it is well approximated by either the latency or bandwidth term. Further, communication happens through all stages of the memory and network hierarchy, but in general we focus on two types of communication.

On-Chip Communication: Data transfers between on-chip memory and compute cores

$$T_{\text{mem}}=\text{Memory Latency}+\frac{\text{Communication Volume [Bytes]}}{\text{Memory Bandwidth [Bytes/s]}}$$

Note that on-chip memory moves for inference workloads are highly bandwidth-bound, so this term is well approximated by the bandwidth term.

$$T_{\text{mem}}\approxeq \frac{\text{Communication Volume [Bytes]}}{\text{Memory Bandwidth [Bytes/s]}}$$

Network Communication: Data transfers between multiple hosts

$$T_{\text{net}}=\text{Network Latency}+\frac{\text{Communication Volume [Bytes]}}{\text{Network Bandwidth [Bytes/s]}}$$

Depending on the communication requirements of the operation, this term may be dominated by either latency and/or bandwidth. We will revisit it in the section Decentralized Inference.

When running operations on accelerators, three factors determine the total latency: computation time for processing FLOPs, memory communication time for moving data between memory and compute cores, and network communication time between hosts.

Roofline Analysis

How does compute and communication interact? In the worst case, computation and data transfer cannot be overlapped, and so the total time for an operation to complete is the time it takes for each part to complete, i.e.

$$T=T_{\text{comp}}+T_{\text{mem}}+T_{\text{net}}$$

In the best case, all computation and communication can be overlapped, and so the total time for the operation to complete is equivalent to the part that takes the longest, i.e.

$$T=\max(T_{\text{comp}},T_{\text{mem}},T_{\text{net}})$$

For the rest of our discussion, we'll use this latter notation of taking the maximum time. This allows us to identify different performance regimes based on which factor becomes the bottleneck - this type of analysis is called roofline analysis:

Compute-Bound ($T_{\text{comp}}>T_{\text{mem}},T_{\text{net}}$): Most time is spent computing - memory and network sit idle
Memory Bandwidth-Bound ($T_{\text{mem}}>T_{\text{comp}}, T_{\text{net}}$): Most time is spent moving data to and from memory - compute and network sit idle
Network-Bound ($T_{\text{net}}>T_{\text{comp}}, T_{\text{mem}}$): Most time is spent waiting for the network - compute and memory sit idle

Generally, we always want to be compute-bound. If not, we are wasting FLOPs, which often equates to wasting money.

To run an operation (like a forward pass) we have to do computation and move bytes on-device and between devices (if multi-node). Assuming perfect overlap between all quantities, we can lower bound the total time by the largest part. We generally want the computation time to be the bottlenecking factor.

Operation vs. Accelerator Intensity

For operations running on a single accelerator (no network communication), a good way to tell if an operation is likely compute or memory-bandwidth bound is to look at its FLOPS per byte, sometimes called the operation intensity, defined as

$$\text{Intensity(OP)}=\frac{\text{Computation Volume [FLOPs]}}{\text{Communication Volume [Bytes]}}$$

It’s useful to think of this as the amount of computation done per memory move. Intuitively, if this ratio is high, we are likely compute-bound, else memory bandwidth-bound. We can quantify precisely in which regime we are by comparing to the accelerator intensity, defined as

$$\text{Intensity}(\textsf{Accelerator})=\frac{\text{Compute Speed [FLOPs/s]}}{\text{Memory Bandwidth [Bytes/s]}}$$

This constant is the peak FLOPs per byte that we can expect from our hardware. For example, for a H100 the accelerator intensity is $989e^{12} / 3.35e^{12} \approx 295$ (GitHub Gist). By simple re-arrangement we get that we are in the compute-bound regime for operation intensities that exceed our accelerator intensity.

$$\begin{align*}T_{\text{comp}}>T_{\text{mem}}&\iff \frac{\text{Compute Volume [FLOPs]}}{\text{Compute Speed [FLOPs/s]}}>\frac{\text{Communication Volume [Bytes]}}{\text{Memory Bandwidth [Bytes/s]}} \\&\iff \frac{\text{Compute Volume [FLOPs]}}{\text{Communication Volume [Bytes]}}>\frac{\text{Compute Speed [FLOPs/s]}}{\text{Memory Bandwidth [Bytes/s]}} \\&\iff \text{Intensity(OP)} > \text{Intensity(Accelerator)}\end{align*}$$

An operation on a single accelerator is compute-bound if the operation’s FLOPs per byte moved is larger than the ratio of peak compute and memory bandwidth of the accelerator.

Example: Dot Product

Let’s look at a simple example: Consider the dot product ($\circ$) between two $n$-dimensional vectors $u, v\in \mathbb{R}^{n}$ defined as $u\circ v=\sum_i u_i v_i$. We have the following tensor signature:

$$\textsf{DotProduct}= \text{bf16}[N], \text{bf16}[N]\rightarrow \text{bf16}[1]$$

How many FLOPs do we need to compute and bytes do we need to move? Let’s count:

Load both vectors (Total Bytes: $2N+2N$)
Compute element-wise products (Total FLOPs: $N$ multiplications)
Compute the sum of all elements (Total FLOPs: $N-1$ additions)
Save the scalar result (Total Bytes: $2$)

Counting up all the FLOPs and all the bytes, we get the following operation intensity as $N\rightarrow\infty$

$$\text{Intensity}(\textsf{DotProduct})=\frac{N+N-1}{2N+2N+2}=\frac{2N-1}{4N+2}\rightarrow\frac{2N}{4N}=\frac{1}{2}$$

We only do half a FLOP per byte moved. This is pretty bad - we do more byte loads than FLOPs for each dot product, but GPUs are designed for massive parallel processing. Since, $1/2<\text{Intensity}(\textrm{Accelerator})$ for almost all accelerators, dot products are notoriously memory bandwidth-bound (GitHub Gist).

Inference Process

Naive Sampling

All frontier models generate text auto-regressively: The model processes a context of tokens to obtain a probability distribution over all possible next tokens. It then samples from this distribution, appends the new token, and repeats the process. Naively implemented, the entire context is passed at every decoding step. Generating n tokens is a $\mathcal{O}(n^2)$ operation for the feed-forward block and $\mathcal{O}(n^3)$ operation for self-attention block. Such order-of-growth is prohibitive for most practical contexts (read: long generations).

Sampling with KV Cache

To address this, it is standard practice to use a KV cache: instead of recomputing keys and values for all context tokens at each step, we store them in memory. This works efficiently because at each decode step, we only need to compute two dot products: one between the current token's query vector and the past keys, and another between the current token's attention vector and the past values. By avoiding the costly recomputation of keys and values for past tokens, we reduce the order-of-growth by a linear factor to $\mathcal{O}(n)$ for the feed-forward and $\mathcal{O}(n^2)$ for the attention block.

‍

**Sampling with KV cache.** At each step we only process the current token and read keys and values of past tokens from the KV cache during self-attention. Figure adapted from How To Scale Your Model.

With caching, the inference phase can be split into two distinct phases:

Prefill: Given a prompt, we populate the KV cache and decode the next token‍
Decode: Given the KV cache and previous token, update the cache and decode the next token

As we will see, these two operations are vastly different in terms of their computational needs. To see this, we analyze the operation intensities of one prefill and decode step.

It is standard practice to cache past keys and values during inference to avoid costly re-computation. This splits inference into two distinct phases: prefill and decoding.

‍Inference Intensity

Pre-filling and decoding are forward passes through the model, involving token embedding, positional encoding, self-attention blocks, feed-forward networks, and a language model head. We can focus primarily on matrix multiplications in linear layers since they account for most FLOPs. During training, modeling these matrix multiplications provides accurate estimates for a training step time. For (long-context) inference, we must also consider the self-attention block because of heavy memory moves due to the KV cache. This leaves two main operations to focus on:

Linear layers: We spend the vast majority of FLOPs on matrix multiplications between activations and weight matrices: In the MLP block there are huge matrices $W_{in}$, $W_{out}$ (sometimes also $W_{gate}$), and in the self-attention block we have the projection matrices $W_Q$, $W_K$, $W_V$, $W_O$.
Self-Attention: Given the current query, key and value vectors $Q$, $K$ and $V$, we compute an attention matrix $A=QK^T$ and output $AV$, with an interleaved softmax, and masking operation. This operation requires all past keys and values, stored in the potentially large $K$ and $V$ matrices.

Getting an idea of whether we are compute or memory bandwidth-bound in these two operations will allow us to lower bound the time it takes for one prefill and decode step.

Linear Layers

All linear layers are matrix multiplications $XW$ between activations $X\in\mathbb{R}^{B\times T\times D}$ and weights $W\in\mathbb{R}^{D\times F}$. In a forward pass, we typically have the following tensor signature:

$$\textsf{MatMul}: \text{bf16}[B, T, D], \text{bf16}[D,F] \rightarrow \text{bf16}[B,T,F]$$

Like with the dot product, let’s check the FLOPs and memory moves required:

Load both input matrices (Total Bytes: $2BTD+2DF$)
Compute matrix product (Total FLOPs: $2BTDF$)
Save the output matrix (Total Bytes: $2BTF$)

This gives us the following operation intensity of a matrix multiplication:

$$\text{Intensity(MatMul)=}\frac{2BTDF}{2BTD+2DF+2BTF}\approxeq_{BT\ll D,F} \frac{BTDF}{DF}=BT$$

We get a nice simplification of the denominator under the assumption that the total number of batched tokens (product of the batch and sequence dimension) is very small in comparison to the input and output dimension (which is reasonable for Transformers). The operation intensity scales with the number of tokens that are being processed in parallel, i.e., the larger $BT$, the higher the compute density (GitHub Gist). This can be intuited by the fact that the cost of matrix multiplication scales as $\mathcal{O}(n^3)$, while the memory cost only scales as $\mathcal{O}(n^2)$. Recall that we have $T_{\text{comp}}>T_{\text{mem}} \iff \text{Intensity(MatMul)} > \text{Intensity(Accelerator))}$. Assuming an H100, this means that as long as $BT>295:=B_{\text{crit}}$, we are in our happy zone and compute-bound. Now, how likely are we to get into this regime during our two phases in inference?

Prefill: We process multiple tokens, possibly even multiple sequences, in parallel. Even if $B=1$, it is quite realistic that we already have $T>B_{crit}$ if simply the number of prompt tokens exceeds the critical batch size.
Decode: We cannot process multiple tokens in the same sequence in parallel because of auto-regressiveness, i.e., $T=1$. This means our only way of parallel processing is across the batch dimension. We need $B>B_{crit}$ to be in the compute-bound regime.

A matrix multiplication becomes compute-bound if $BT>B_{crit}$. Prefill operations are often compute-bound because all prompt tokens are processed in parallel. During decode, being compute-bound is challenging because the only way to batch decode is to process multiple sequences in parallel.

Self-Attention

In self-attention, we compute attention scores between $T$ query tokens and $S$ key-value tokens and produce an output given by $\text{softmax}(QK^T/{\sqrt{D}})V$. Typically, we have multiple attention heads, but we focus on a single head for simplicity. The tensor signature is given by

$$\textsf{Attention}: \text{bf16}[B, T, D], \text{bf16}[B,S,D], \text{bf16}[B,S,D] , \rightarrow \text{bf16}[B,T,D]$$

Let's count FLOPs and bytes for a single head. We will omit element-wise operations, like the scaling factor and softmax, which are negligible and often fused into other operations.

Read the $Q$ activations (Total Bytes: $2BTD$)
Read the $KV$ cache (Total Bytes: $2BSD + 2BSD$)
Compute $A=QK^T$ (FLOPs: $2BSTD$)
Compute $AV$ (FLOPs: $2BSTD$)
Save the result (Total Bytes: $2BTD$)

Putting this together, we get the following operation intensity:

$$\text{Intensity(Attention)}=\frac{4BSTD}{4BSD+4BTD}=\frac{ST}{S+T}$$

The FLOPs to bytes ratio depends on the ratio of key to query tokens $S$ and $T$ — the ratio of key-value to query tokens:

Prefill: During prefill, we perform regular self-attention, i.e., $S=T$. Substituting in, we get $T^2/2T=T/2=\mathcal{O}(T)$. This means that as long as the number of prompt tokens is proportional to our critical batch size $\mathcal{O}(T)>B_{crit}$, we are compute-bound.
Decoding: During decoding, we have $T=1$, which plays out like this: $S\gg T=1\implies ST/S+T\approx 1$. We are fundamentally memory bandwidth-bound (!). No matter how many sequences we are batching together in parallel, we always spend the majority of time moving the heavy KV cache

Similar to linear layers, self-attention is compute-bound if the number of prompt tokens exceeds the critical batch size. During decoding, however, we are fundamentally memory bandwidth-bound — most of the time is spent moving the KV cache.

Memory Capacity

During decoding, we are memory-bandwidth bound. We can move closer to the compute-bound regime for matrix multiplications by increasing the number of sequences decoded in parallel. However, unlike in training, memory pressure grows with larger batches because each sequence in a batch has a unique KV cache. Let’s compute the maximum batch size that we can fit.

Model weights: We have to store each parameter in the model. Assuming half precision, we get:

$$\text{Model Size}:= 2\cdot P$$

KV cache: Assuming regular multi-head attention (MHA), we have to store $H$-dimensional keys and values across all $L$ layers and $K$ key-value heads for each token. Again, assuming half precision, we get:

$$\text{KV Cache Size} := 2\cdot 2\cdot H\cdot K\cdot L$$

Let’s check what this means for LLama-2 13B. The memory for weights is trivial - we have $P=13B$ and so we need roughly $26\text{ GB}$ for storing the weights. What about the KV cache? The model uses regular MHA and has $H=128$, $K=40$ and $L=40$. Plugging in, we get:

$$2\cdot2 \cdot 128 (H) \cdot 40 (K)\cdot 40 (L)\approx 0.82 \text{ MB}$$

We need almost one megabyte for every token that we cache! If we assume that we generate sequences of length $4096$ (the maximum context of the model), we need $3.35\text{ GB}$ for each sequence that we decode. This is bad news! We would like to decode many sequences in parallel (large $B$) to make better use of our FLOPs but our maximum batch size $B_{max}$ is limited by the KV cache. For some constant $C$ reserved for intermediate activations, we get:

$$B_{max}:=\left\lfloor\frac{\text{Accelerator Memory}-\text{Model Size}-C}{T\cdot \text{KV Cache Size}}\right\rfloor$$

Let’s check the maximum batch sizes for each model in the Llama-2 series for different accelerator memory capacities (GitHub Gist).

Memory	LLaMA-2 7B	LLaMA-2 13B	LLaMA-2 70B
24GB	4	OOM	OOM
40GB	11	3	OOM
80GB	30	15	OOM

No single GPU can serve the 70B model (without quantization) because just storing the weights requires 140GB of accelerator memory. All GPUs can run the 7B model, but even for the smallest model we are nowhere near the critical batch size required to be compute-bound on modern accelerators, i.e. $B_{max}\ll B_{crit}$. Thus, we are fundamentally memory bandwidth-bound during auto-regressive decoding.

This precise finding has put a lot of optimization pressure on improving the memory usage of LLMs during inference, which has led to many innovations, such as MQA, GQA, MLA, Sliding Window Attention, Paged Attention, and many more. We will not cover these in detail here, but consider traditional (unoptimized) architectures using multi-head attention (MHA).

Decoding is memory bandwidth-bound unless many sequences are decoded in parallel. Unfortunately, the accelerator memory is quickly exhausted by the KV cache which grows proportionally to the batch size. Being total memory bound forces us into the memory bandwidth-bound regime.

Inference Throughput

Given these findings, how can we get an approximate final inference performance? For the type of workload that we are interested in — synthetic data generation, reinforcement learning rollouts and evals — we have a large batch of "requests" available at all times and wish to optimize for the total token throughput, typically measured in tokens per second. How can we upper bound this property? With each prefill or decoding step, we are generating one new token for each sequence in the batch. Thus, we are generating $B$ tokens in the time it takes for a prefill or decode step, i.e

$$\text{Throughput}=\frac{B}{\text{Step Time}}$$

By obtaining a lower bound on the step time we can upper bound throughput. We have already worked towards lower bounding this property in the previous section. Recall, that the time for a single decode step is dominated by the time spent on matrix multiplications and during self-attention, i.e.

$$\text{Decode Time}\approxeq T(\textsf{Linear})+T(\textsf{Attention})$$

Further, because $T\geq 1\implies BT\geq B$ we always have

$$\text{Prefill Time} \geq \text{Decode Time}$$

And so we can lower bound the step time as

$$\text{Step Time}\geq \text{Decode Time}\approxeq T(\textsf{Linear})+T(\textsf{Attention})$$

As we have seen, $T(\textsf{Linear})$ may be compute-bound or memory-bandwidth bound during decode, depending on the batch size $B$ used. Therefore, we have

$$T(\textsf{Linear})=\max(T_{\text{comp}}(\textsf{Linear}), T_{\text{mem}}(\textsf{Linear}))$$

How many FLOPs and bytes do we spend on matrix multiplications for a full forward pass? A handy rule of thumb is that we roughly do $2BP$ FLOPs and $2P$ byte moves (for a thorough analysis, see Transformer Inference Arithmetic). Given this, we get pretty simple formulas:

$$T_{\text{comp}}(\textsf{Linear})=\frac{2BP}{\text{Compute Speed [FLOPs/s]}}$$

$$T_{\text{mem}}(\textsf{Linear})=\frac{2P}{\text{Memory Bandwidth [Bytes/s]}}$$

Approximating $T(\textsf{Attention})$ is even easier because we have already shown that this operation is fundamentally bandwidth-bound during decoding and dominated by the cost to move the KV cache. To decode a token at step $t=1,\dots,T$, we need to load $t-1$ past keys and values. Therefore, we can approximate the average time for generating $T$ tokens as:

$$T_{\text{mem}}(\textsf{Attention})=\frac{B\cdot T\cdot \text{KV Cache Size}}{2 \cdot \text{Memory Bandwidth [Bytes/s]}}$$

Putting all of these formulas together, we can lower bound the step time and, consequently, upper bound throughput. Below we are showing the peak theoretical throughput as a function of batch size (up to the maximum batch size that can be fit) for different models running on different GPUs.

**Peak theoretical throughput.** We show peak theoretical throughput achievable during inference as a function of batch size for varying models (LLama-3.2 1B, Llama-2 7B, LLama-2 13B) running on GPUs (RTX 4090, A100-40GB, H100-80GB) when generating for a context 4096 tokens. We show the hardware-dependent critical batch size (vertical blue line) and peak throughput (horizontal purple line). Reproduce the figure via GitHub Gist. *Note, that LLama-3.2 1B uses GQA allowing it to be compute-bound more easily.*

The plot reveals several key insights:

Since we are fundamentally memory bandwidth-bound (from moving parameters and KV cache), we achieve nearly linear throughput scaling. Doubling the batch size almost doubles the throughput, with only slight diminishing returns due to the proportionally growing KV cache.
Once we reach the compute-bound threshold, increasing the batch size yields no additional throughput gains.
Yet even small models like Llama-2 7B using traditional MHA never reach the critical batch size needed to become compute-bound. Instead, they remain in the "linear throughput scale" regime until hitting the accelerator's maximum batch size limit.

It is hard to reach the compute-bound regime during inference because the maximum batch size is capped by the growing KV cache. Being total memory bound forces us into the memory bandwidth-bound regime, where we get nearly linear throughput gains as a function of the batch size.

Decentralized Inference

Frontier models rarely fit on a single accelerator, and so it is imperative to shard the model itself to reduce the memory requirements per-device. There are mainly two types of model parallelism:

Tensor Parallelism (TP) shards weight matrices; computes and communicates partial results

Pipeline Parallelism (PP) shards layers; computes layers sequentially and communicates intermediate activations/generated tokens

Depending on the type of parallelism, there are different network synchronization points during a forward pass, and so the network communication contributes to the step time differently. Assuming $N$-way parallelism, the number of communication synchronization points during a single forward pass scales as $\mathcal{O}(NL)$ in TP, and $\mathcal{O}(N)$ in PP. The additional factor $\mathcal{O}(L)$ in TP stems from the fact that it requires communication within each layer to accumulate partial results of sharded matrices. In decentralized settings, we have to opt for the parallelism technique with the smallest communication requirements. This leaves pipeline parallelism as the only viable option for running extremely large models on consumer hardware over public networks. Let's understand pipeline parallelism in detail.

Pipeline Parallelism

In PP, each device handles a contiguous sequences of layers, called a model shard. Assuming $N$ devices, this intuitively means that each device now handles $1/N$ of the model.

**Pipelie Parallel Model sharding.** We shard a model with 16 layers onto 4 GPUs. This means, that each GPU handles 4 contiguous layers.

One of the main advantages of pipeline parallelism is that the memory decreases proportional to the degree of parallelism $N$. Enumerating devices as $i=1,\dots,N$, we get

$$\text{Local Model Size}_i\approxeq\frac{\text{Model Size}}{N},\quad\text{Local KV Cache Size}_i=\frac{\text{KV Cache Size}}{N}$$

The memory required to store the shard's parameters and cache decreases! This enables running models of arbitrary size on arbitrary hardware. For example, we can run Llama-2 70B on 2×H100s with $2\cdot 80=180\text{ GB}$, 4×A100s with $4\cdot 40GB=160\text{ GB}$, or 8×4090s with $8\cdot 24=192\text{ GB}$ unified memory. However, sharding the model layer-wise introduces communication synchronization points. Nodes are arranged sequentially and have to communicate intermediate results with adjacent pipeline stages. Because of auto-regressive generation, the last stage device has to send the next generated token back to the first stage device before the subsequent decoding step, leading to a ring network topology.

**Pipeline Parallel Communication.** Each node sends hidden states to the next stage worker. The final device decodes the next token and sends it back to the first worker, and the cycle repeats.

Synchronous Pipelining

One of the simplest pipeline schedules is a synchronous blocking schedule, where each device waits for the remaining pipeline workers to complete their work.

**Synchronous Pipeline Schedule.** Synchronous schedule for 4 devices unrolled over 4 decoding steps. Blue boxes represent on-device work (both computation and memory moves). The schedule assumes no network overhead. Notably, at any time step only a single device is working. The step time is the same on a single device and the in the four-stage pipeline.

There is a strict sequential dependency in this process. At any point in time, only one of the $N$ workers is working while all others sit idle! How does the step time compare to a single device? Recall that on single accelerators, we lower-bounded the step time as $\text{Step Time}\geq T(\textsf{Linear})+T(\textsf{Attention})$. For $N$-way pipeline parallelism, both $T(\textsf{Linear})$ and $T(\textsf{Attention})$ scale with the number of layers $L$, and are therefore reduced by a factor $N$ when pipelining. Thus, we also get a factor $N$ reduction in the per-device step time.

$$\text{Local Step Time}_i=\frac{\text{Step Time}}{N}$$

But due to sequential processing, the overall step time is the sum of all local steps, and so we find that the step time is $N\cdot \text{Local Step Time}_i=\text{Step Time}$. This implies that we cannot hope for higher throughput for a fixed batch size as we increase the degree of parallelism. In fact, we cannot even hope to do as well as the single-device step time because of network overhead. The per-step network overhead in synchronous schedule can be approximated as the sum of latencies between each pipeline stage:

$$T_{\text{net}}\approxeq N\cdot \text{Network Latency}$$

Note that we make the simplifying assumption that the network cost is dominated by the latency as opposed to bandwidth. Because we assume blocking send and receive operations, the communication cost is additive, and we get the following true step time.

$$\text{Step Time}_{PP}=N\cdot \text{Network Latency} + \text{Step Time}$$

**Synchronous Pipeline Schedule with Latency.** Synchronous schedule with additional latency overhead on 4 devices unrolled over 3 steps. Blue boxes represent on-device work (both computation and memory moves). Communication is introduced between each pipeline stage for each decode step.

Let's assume a common cross-continental latency of 100ms. Even if we completely ignore the per-device processing time, the peak theoretical throughput is bound by network latency.

$$\text{Throughput/PP} = \frac{B}{\text{Step Time/PP}} \leq \frac{B}{N \cdot \text{Network Latency}}$$

To see just how much we are bottlenecked, let's check the maximal theoretical throughput for a $N=2$ pipeline for a single-batch generation.

$$\frac{1}{2\cdot (100/1000)} \approx 5\text{ tokens/s}$$

The network overhead puts a hard ceiling on the maximum throughput we can hope for! No optimization which improves the on-device compute and memory transfer time will be able to break this ceiling (we already assumed zero on-device time). Recall that on an H100 we got a theoretical upper bound of 121 tokens/s for single-sequence throughput of Llama-2 13B.

Network idle time becomes the main bottleneck in synchronous pipeline schedules and puts a low ceiling on the maximum theoretical throughput for single-batch generation.

Trading-Off Memory and Latency

However, we are most interested in increasing peak throughput, which occurs when decoding $B_{max}$ sequences in parallel. Because pipeline parallelism reduces the total memory footprint by a factor $N$, we can scale $B_{max}$ by at least the same factor. Thus, with every doubling of our pipeline, we can (at least) double the maximum batch size. At the same time, scaling the pipeline size introduces more network communication, further increasing latency. Is this trade-off worth it?

$$
\begin{align*}
\text{Throughput}_{PP} &> \text{Throughput} \\
\iff & \frac{NB_{max}}{N \cdot \text{Latency} + \text{Step Time}} > \frac{B_{max}}{\text{Step Time}} \\
\iff & NB_{max} \cdot \text{Step Time} > B_{max} \cdot (N \cdot \text{Latency} + \text{Step Time}) \\
\iff & N \cdot \text{Step Time} > N \cdot \text{Latency} + \text{Step Time} \\
\iff & (N-1) \cdot \text{Step Time} > N \cdot \text{Latency} \\
\implies & \text{Step Time} > \text{Latency} \quad (\text{approx. for large } N)
\end{align*}
$$

Scaling the pipeline only improves peak throughput if the step time is larger than the latency. However, when running over public networks, the expected latency is often higher than the step time (often <20ms).

Increasing the number of stages in a pipeline only increases throughput if the step time is larger than the latency. In high latency settings, this is rarely the case, and so smaller pipelines are preferable to avoid network latency bottlenecks.

Asynchronous Pipelining

The sequential nature of data flow through pipelines is its main disadvantage: if naively implemented, it incurs significant device idle time. For training workloads, there exists a lot of literature that tries to increase the level of parallelism by overlapping micro-batched forward and backward passes through advanced pipeline schedules.

Can we use a similar idea for inference? Assuming $N$ devices and no latency, the following algorithm is natural:

Split the total batch size $B$ into $N$ micro batches (each of micro-batch size $B/N$)
For each device: Process and asynchronously send/receive micro-batches

For $N=4$, such an asynchronous micro-batch pipeline schedule would look like this.

**Asynchronous micro-batched schedule.** Schedule for 4 devices with 4 micro-batches, unrolled over 4 decoding steps, assuming no latency. At each step, each device asynchronously receives, processes and sends intermediate micro-batch results, resulting in only a constant pipeline bubble.

While there is some unavoidable pipeline bubble at the beginning and the end of the inference process, this overhead is constant and amortizes for long generations. Can we hope for the same strong scaling of throughput as during training? Unfortunately, for as long as we are memory bandwidth-bound, instead of compute-bound (which we are during decoding), we have that the time to process a micro-batch is roughly the same as the time to process a full batch, let's denote this as time $T$. The total time is $NT$ whether a device processes all micro-batches or processes the full batch and waits $(N-1)\cdot T$ for pipeline completion.

Synchronous batched schedule. Schedule for 4 devices with a single full (original four micro-batched), unrolled over 4 decoding steps, assuming no latency. At each step, each device synchronously receives, processes and sends intermediate micro-batch results, resulting in idle time.

We need the same time for a single step: either we process the full batch and wait, or process all micro-batches in parallel. Thus, we are trading increased parallelism for reduced per-step throughput and gain nothing. This is because we are memory-bandwidth bound, so increasing the batch size yields "free" throughput gains because the operation time is dominated by moving parameters, not the actual computation.

Asynchronous micro-batched pipeline schedules cannot increase throughput for memory bandwidth-bound workloads, like decoding during inference

Research Roadmap

We will use our findings and learnings to work towards a decentralized inference engine designed for real-world networks with high latency and heterogeneous hardware. Achieving higher throughput in such environments will require reimagining accepted truths in inference optimization. Our research will focus on:

Increasing Compute Density: Exploring ways to better utilize idle compute time during network waits by shifting memory-bound operations toward compute-heavy execution
Reducing Memory Footprint: Investigating techniques to minimize memory requirements during decoding, including lighter cache management and strategic re-computation
Enabling Asynchronous Execution: Designing naturally asynchronous inference protocols that can tolerate variability in node availability, latency, and bandwidth without sacrificing throughput

References

Eurus-2-PRIME

System Prompt

When tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process. [ASSESS] [ADVANCE] [VERIFY] [SIMPLIFY] [SYNTHESIZE] [PIVOT] [OUTPUT] You should strictly follow the format below: [ACTION NAME] # Your action step 1 # Your action step 2 # Your action step 3 ... Next action: [NEXT ACTION NAME]

User

Quadratic polynomials $P(x)$ and $Q(x)$ have leading coefficients $2$ and $-2,$ respectively. The graphs of both polynomials pass through the two points $(16,54)$ and $(20,53).$ Find $P(0) + Q(0).$

Eurus-2-PRIME

[ASSESS]

We are given two quadratic polynomials $P(x)$ and $Q(x)$ with leading coefficients $2$ and $-2$, respectively. The graphs of both polynomials pass through the points $(16,54)$ and $(20,53)$. We need to find the value of $P(0) + Q(0)$.

[ADVANCE]

Let's express the quadratic polynomials $P(x)$ and $Q(x)$ in the general form:
$ P(x) = 2x^2 + bx + c $
$ Q(x) = -2x^2 + dx + e $

Since the graphs of both polynomials pass through the points $(16,54)$ and $(20,53)$, we can substitute these points into the equations of $P(x)$ and $Q(x)$ to form a system of equations.

For $P(x)$:
$ P(16) = 2(16)^2 + 16b + c = 54 $
$ 512 + 16b + c = 54 $
$ 16b + c = -458 \quad \text{(Equation 1)} $

$ P(20) = 2(20)^2 + 20b + c = 53 $
$ 800 + 20b + c = 53 $
$ 20b + c = -747 \quad \text{(Equation 2)} $

For $Q(x)$:
$ Q(16) = -2(16)^2 + 16d + e = 54 $
$ -512 + 16d + e = 54 $
$ 16d + e = 566 \quad \text{(Equation 3)} $

$ Q(20) = -2(20)^2 + 20d + e = 53 $
$ -800 + 20d + e = 53 $
$ 20d + e = 853 \quad \text{(Equation 4)} $

[ADVANCE]

Now, we can solve the system of equations for $P(x)$:
Subtract Equation 1 from Equation 2:
$ (20b + c) - (16b + c) = -747 - (-458) $
$ 4b = -289 $
$ b = -\frac{289}{4} $

Substitute $b = -\frac{289}{4}$ into Equation 1:
$ 16\left(-\frac{289}{4}\right) + c = -458 $
$ -1156 + c = -458 $
$ c = 698 $

So, for $P(x)$, we have:
$ P(x) = 2x^2 - \frac{289}{4}x + 698 $
$ P(0) = 698 $

Next, we can solve the system of equations for $Q(x)$:
Subtract Equation 3 from Equation 4:
$ (20d + e) - (16d + e) = 853 - 566 $
$ 4d = 287 $
$ d = \frac{287}{4} $

Substitute $d = \frac{287}{4}$ into Equation 3:
$ 16\left(\frac{287}{4}\right) + e = 566 $
$ 1148 + e = 566 $
$ e = -582 $

So, for $Q(x)$, we have:
$ Q(x) = -2x^2 + \frac{287}{4}x - 582 $
$ Q(0) = -582 $

[ADVANCE]

Finally, we can find $P(0) + Q(0)$:
$ P(0) + Q(0) = 698 + (-582) = 116 $

So, the value of $P(0) + Q(0)$ is $\boxed{116}$.

INTELLECT-MATH-7B-SFT (ours)

System Prompt

Solve the following math problem efficiently and clearly. Think carefully and step by step about your response and reason before providing a final response. Conclude your response with: \n\nTherefore, the final answer is: $\\boxed{answer}$. I hope it is correct.\n\nWhere [answer] is just the final number or expression that solves the problem.

User

Quadratic polynomials $P(x)$ and $Q(x)$ have leading coefficients $2$ and $-2,$ respectively. The graphs of both polynomials pass through the two points $(16,54)$ and $(20,53).$ Find $P(0) + Q(0).$

INTELLECT-MATH-7B-SFT

So I've got this problem here about quadratic polynomials $P(x)$ and $Q(x)$. $P(x)$ has a leading coefficient of $2$, and $Q(x)$ has a leading coefficient of $-2$. Both polynomials pass through the points $(16,54)$ and $(20,53)$. I need to find $P(0) + Q(0)$.

First, since both $P(x)$ and $Q(x)$ are quadratic polynomials, I can express them in the general form:

$P(x) = 2x^2 + bx + c$

$Q(x) = -2x^2 + dx + e$

Wait, actually, the leading coefficient is the coefficient of the highest power of $x$, which for a quadratic is $x^2$. So yes, $P(x) = 2x^2 + bx + c$ and $Q(x) = -2x^2 + dx + e$.

Now, both polynomials pass through the points $(16,54)$ and $(20,53)$. That means when $x=16$, $P(16)=54$, and when $x=20$, $P(20)=53$. Similarly for $Q(x)$.

So, I can set up equations using these points.

First, for $P(x)$:

$P(16) = 2(16)^2 + b16 + c = 54$

Similarly, $P(20) = 2(20)^2 + b20 + c = 53$

Let me calculate $16^2$ and $20^2$:

$16^2 = 256$

$20^2 = 400$

So,

$P(16): 2256 + 16b + c = 54 \Rightarrow 512 + 16b + c = 54 \Rightarrow 16b + c = 54 - 512 \Rightarrow 16b + c = -458$

$P(20): 2400 + 20b + c = 53 \Rightarrow 800 + 20b + c = 53 \Rightarrow 20b + c = 53 - 800 \Rightarrow 20b + c = -747$

Now, I have two equations:

1) $16b + c = -458$

2) $20b + c = -747$

I can subtract equation 1 from equation 2 to eliminate $c$:

$(20b + c) - (16b + c) = -747 - (-458)$

$20b + c - 16b - c = -747 + 458$

$4b = -289$

So, $b = -289 / 4 = -72.25$

Wait, but coefficients are usually integers, but maybe not. Let me check.

Now, plug $b$ back into one of the equations to find $c$. Let's use equation 1:

$16(-72.25) + c = -458$

$-1156 + c = -458$

$c = -458 + 1156 = 698$

So, $P(x) = 2x^2 - 72.25x + 698$

Hmm, $P(0)$ would be $c$, which is $698$.

Now, let's do the same for $Q(x)$.

$Q(16) = -2*(16)^2 + d16 + e = 54 \Rightarrow -2256 + 16d + e = 54 \Rightarrow -512 + 16d + e = 54 \Rightarrow 16d + e = 54 + 512 \Rightarrow 16d + e = 566$

$Q(20) = -2*(20)^2 + d20 + e = 53 \Rightarrow -2400 + 20d + e = 53 \Rightarrow -800 + 20d + e = 53 \Rightarrow 20d + e = 53 + 800 \Rightarrow 20d + e = 853$

Now, equations for $Q(x)$:

1) $16d + e = 566$

2) $20d + e = 853$

Subtract equation 1 from equation 2:

$(20d + e) - (16d + e) = 853 - 566$

$4d = 287$

$d = 287 / 4 = 71.75$

Now, plug $d$ back into equation 1:

$1671.75 + e = 566 \Rightarrow 1148 + e = 566 \Rightarrow e = 566 - 1148 = -582$

So, $Q(x) = -2x^2 + 71.75x - 582$

Therefore, $Q(0) = e = -582$

Now, $P(0) + Q(0) = 698 + (-582) = 116$

Wait, but the problem says "present the answer in LaTeX format: \boxed{Your answer}"

So, I should write $\boxed{116}$

But let me double-check my calculations to make sure I didn't make any mistakes.

First, for $P(x)$:

$P(16) = 2(256) + 16b + c = 512 + 16b + c = 54 \Rightarrow 16b + c = -458$

$P(20) = 2*(400) + 20b + c = 800 + 20b + c = 53 \Rightarrow 20b + c = -747$

Subtracting: $4b = -289 \Rightarrow b = -72.25$

Then, $c = -458 - 16*(-72.25) = -458 + 1156 = 698$

Similarly for $Q(x)$:

$Q(16) = -2*(256) + 16d + e = -512 + 16d + e = 54 \Rightarrow 16d + e = 566$

$Q(20) = -2*(400) + 20d + e = -800 + 20d + e = 53 \Rightarrow 20d + e = 853$

Subtracting: $4d = 287 \Rightarrow d = 71.75$

Then, $e = 566 - 16*71.75 = 566 - 1148 = -582$

So, $P(0) = 698$, $Q(0) = -582$, sum is $116$.

Seems correct.

Final Answer

$\boxed{116}$

INTELLECT-MATH-7B (ours)

System Prompt

User

Quadratic polynomials $P(x)$ and $Q(x)$ have leading coefficients $2$ and $-2,$ respectively. The graphs of both polynomials pass through the two points $(16,54)$ and $(20,53).$ Find $P(0) + Q(0).$

INTELLECT-MATH-7B

So I have two quadratic polynomials, $P(x)$ and $Q(x)$, with leading coefficients 2 and -2, respectively. Both pass through the points $(16,54)$ and $(20,53)$. I need to find $P(0) + Q(0)$.

First, since both are quadratic polynomials, I can express them in the general form:

$P(x) = 2x² + bx + c$

$Q(x) = -2x² + dx + e$

But since both pass through the points $(16,54)$ and $(20,53)$, I can set up equations using these points.

For $P(x)$:

$P(16) = 2*(16)² + b \cdot 16 + c = 54$

$P(20) = 2(20)² + b \cdot 20 + c = 53$

Similarly for $Q(x)$:

$Q(16) = -2(16)² + d \cdot 16 + e = 54$

$Q(20) = -2(20)² + d \cdot 20 + e = 53$

Let's compute the squares:

$16² = 256$

$20² = 400$

So for $P(x)$:

$2256 + 16b + c = 54 \Rightarrow 512 + 16b + c = 54 \Rightarrow 16b + c = 54 - 512 = -458 \quad ...(1)$

$2400 + 20b + c = 53 \Rightarrow 800 + 20b + c = 53 \Rightarrow 20b + c = 53 - 800 = -747 \quad ...(2)$

For $Q(x)$:

$-2256 + 16d + e = 54 \Rightarrow -512 + 16d + e = 54 \Rightarrow 16d + e = 54 + 512 = 566 \quad ...(3)$

$-2400 + 20d + e = 53 \Rightarrow -800 + 20d + e = 53 \Rightarrow 20d + e = 53 + 800 = 853 \quad ...(4)$

Now, I have two systems of linear equations:

For $P(x)$:

$16b + c = -458 \quad ...(1)$

$20b + c = -747 \quad ...(2)$

Subtract equation (1) from equation (2):

$(20b + c) - (16b + c) = -747 - (-458)$

$4b = -289 \Rightarrow b = -\frac{289}{4}$

Now plug $b$ back into equation (1):

$16\left(-\frac{289}{4}\right) + c = -458 \Rightarrow -1156 + c = -458 \Rightarrow c = -458 + 1156 = 698$

So $P(x) = 2x² - \frac{289}{4}x + 698$

For $Q(x)$:

$16d + e = 566 \quad ...(3)$

$20d + e = 853 \quad ...(4)$

Subtract equation (3) from equation (4):

$(20d + e) - (16d + e) = 853 - 566$

$4d = 287 \Rightarrow d = \frac{287}{4}$

Now plug $d$ back into equation (3):

$16\left(\frac{287}{4}\right) + e = 566 \Rightarrow 1148 + e = 566 \Rightarrow e = 566 -1148= -582$

So $Q(x)=-2x^{2}+\frac{287}{4}x-582$.

Now, $P(0)=c=698$.

$Q(0)=e=-582$.

Therefore, $P(0)+Q(0)=698+(-582)=116$.

Final Answer

$\boxed{116}$.

Authors

Mika

Sami

Johannes

Acknowledgements

SYNTHETIC-2 Release: Four Million Collaboratively Generated Reasoning Traces

We are releasing SYNTHETIC-2, an open dataset of four million verified reasoning traces spanning the most comprehensive set of complex reinforcement learning tasks and verifiers released to date. The dataset was collaboratively generated by compute contributors across the globe via our pipeline-parallel decentralized inference. Over 1,250 GPUs joined in 3 days — from 4090s to H200s — creating data for complex RL tasks.

SYNTHETIC-2: Planetary-Scale Pipeline Parallel Inference for Verified Reasoning

Today, we’re excited to launch SYNTHETIC-2, our next-generation, open-source reasoning dataset and planetary-scale, pipeline-parallel decentralized inference run.Built on our peer-to-peer inference stack and powered by the new DeepSeek-R1-0528 model, SYNTHETIC-2 generates verified reasoning traces spanning the most comprehensive set of complex reinforcement-learning tasks and verifiers released to date.

PCCL - Prime Collective Communications Library

We are excited to release the Prime Collective Communications Library, a low-level communication library built for decentralized training across the globe.

Planetary-Scale Inference: Previewing our Peer-To-Peer Decentralized Inference Stack

Designing for High-Latency Networks

Today's Releases

Traditional Inference

Accelerator Math

Roofline Analysis

Operation vs. Accelerator Intensity

Example: Dot Product

Inference Process

Naive Sampling

Sampling with KV Cache

‍Inference Intensity

Linear Layers

Self-Attention

Memory Capacity

Inference Throughput

Decentralized Inference

Pipeline Parallelism

Synchronous Pipelining

Asynchronous Pipelining

Research Roadmap

References

SYNTHETIC-2 Release: Four Million Collaboratively Generated Reasoning Traces

SYNTHETIC-2: Planetary-Scale Pipeline Parallel Inference for Verified Reasoning

PCCL - Prime Collective Communications Library

Related Posts

SYNTHETIC-2 Release: Four Million Collaboratively Generated Reasoning Traces

SYNTHETIC-2: Planetary-Scale Pipeline Parallel Inference for Verified Reasoning

PCCL - Prime Collective Communications Library