Authors

Johannes

Jackmin

Announcements

July 11, 2024

OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training

Introducing OpenDiLoCo, an open-source implementation and scaling of DeepMind’s Distributed Low-Communication (DiLoCo) method, enabling globally distributed AI model training.

Last week, we released the first step in our masterplan by launching the Prime Intellect Compute Exchange to aggregate and orchestrate global compute resources.

Today, we are thrilled to announce a major step forward on the second part by open-sourcing our distributed training framework to enable collaborative model development across globally distributed hardware.

We provide a reproducible implementation of Deepmind’s DiLoCo experiments, offering it within a scalable, decentralized training framework. We demonstrate its effectiveness by training a model across two continents and three countries, while maintaining 90-95% compute utilization. Furthermore, we scale DiLoCo to 3x the size of the original work, demonstrating its effectiveness for billion parameter models.

‍

Paper: https://arxiv.org/abs/2407.07852

Code: https://github.com/PrimeIntellect-ai/OpenDiLoCo

OpenDiLoCo Training run visualized

Large language models have revolutionized AI, but training them traditionally requires massive, centralized compute clusters. This concentration of resources has limited who can participate in AI development and slowed the pace of innovation.

Recently, we published a detailed blog post exploring the state-of-the-art in decentralized AI training. In the post, we highlight the most promising approaches and several key challenges that still need to be overcome:

Slow interconnect bandwidth
Ensuring fault-tolerant training
Non-homogeneous hardware settings
And more…

OpenDiLoCo is one of our research efforts to overcome the first of these challenges by facilitating efficient training across multiple, poorly connected devices globally.

Key Contributions

Replication and Scaling: We have successfully reproduced the original DiLoCo experiments and extended them to the billion-parameter model scale.
Open-Source Implementation: We are releasing a scalable implementation built on top of the Hivemind library, making decentralized training accessible to a wide range of developers and researchers. Our framework enables single DiLoCo workers to scale to hundreds of machines through our integration with PyTorch FSDP.
Global Decentralized Training: We demonstrated OpenDiLoCo's real-world potential by training a model across two continents and three countries, achieving 90-95% compute utilization.
Efficiency Insights: Our ablation studies provide valuable insights into the algorithm's scalability and compute efficiency, paving the way for future improvements.

DiLoCo

Recent work by Google DeepMind has introduced an approach that enables the training of language models on islands of devices that are poorly connected. This method allows for data parallel training on these different islands, requiring synchronization of pseudo gradients only every 500 steps.

DiLoCo introduces an inner-outer optimization algorithm that allows both local and global updates. Each worker independently updates its weights multiple times using a local AdamW optimizer (inner optimization). Every ~500 updates, the algorithm performs an outer optimization using the Nesterov momentum optimizer, which synchronizes all workers' pseudo gradients (the sum of all local gradients).

This approach significantly reduces the frequency of communication (up to 500 times), thus lowering the bandwidth requirements for distributed training.

OpenDiLoCo

To foster collaboration in this promising research direction to democratize AI, we have released our code for OpenDiLoCo under an open-source license: https://github.com/PrimeIntellect-ai/OpenDiLoCo.

Our implementation is built on top of the Hivemind library. Instead of using torch.distributed for the worker communication, Hivemind utilizes a distributed hash table (DHT) spread across each worker to communicate metadata and synchronize them. This DHT is implemented using the open-source libp2p project. We utilize Hivemind for inter-node communication between DiLoCo workers and PyTorch FSDP for intra-node communication within DiLoCo Workers.

Our integration with Hivemind enables a real-world decentralized training setup for DiLoCo, making many of its inherent properties usable, such as:

On/Off ramping of resources: The amount of available compute can be varied during the training, with new devices and clusters joining and leaving in the middle of training.
Fault tolerance: For decentralized training, some devices may be less reliable than others. Through Hivemind's fault-tolerant training, a device could become unavailable at any time without stopping the training process.
Peer-to-Peer: There is no master node. All communication is done in a peer-to-peer fashion.

Main Results

As a first step, we replicated the main experimental results from DiLoCO. We trained a model with 150 million parameters on a language modeling task using the C4 dataset.

We show that DiLoCo with 8 replicas significantly outperforms the baseline without any replicas and matches the performance of a stronger baseline with the same compute budget, despite having a 500x lower communication requirement.

150 million parameter Llama model pre-training with 8 DiLoCo workers yields significantly lower perplexity than the baseline without DiLoCo, and even compared to the baseline using 8 times larger batch size with the same compute budget, while communicating 500 times less.

‍

In addition to replicating the results, we ran several ablation studies for DiLoCo, focusing on the algorithm's scalability with the number of workers and compute efficiency. We also demonstrated that DiLoCo pseudo gradients can be effectively all-reduced in FP16 without any performance degradation.

For more details, check out the paper on arXiv.

Scaling DiLoCo to Billion Parameter Models

The original DiLoCo paper by DeepMind only experimented with model sizes of up to 400 million parameters. In our work, we scale the method to a model with 1.1 billion parameters. We adopt the same hyperparameters as TinyLlama and use a total batch size containing 8 million tokens (batch size of 8192 with sequence length of 1024). Due to the 4x larger batch size compared to our previous experiment, we decided to train up to only 44k steps for this experiment.

We compare our results against two baselines: a weak baseline without DiLoCo and without replicas, and a stronger baseline using a 4× larger batch size with data parallelism.

When using a local step size of 500 (synchronizing between workers every 500 steps), similar to the experiment with the 150 million parameter model, we observe suboptimal convergence in the early stages of training. The training dynamics improve in the later stages of training, possibly matching our baseline if we had trained further to 88k steps.

We also ran an experiment with a local step size of 125. Under this regime, the training dynamics were actually better during the early stages of training. The DiLoCo run with 125 local steps nearly matches the performance of the stronger baseline with the same compute budget, while communicating 125 times less.

**1.1B Scaling Experiment:** Comparing a 1.1B training with OpenDiLoCo with 4 workers syncing every 500 local steps and every 125 local steps against the two baselines.

While we demonstrate that DiLoCo works at the billion-parameter scale, we believe further work is needed to make it effective with larger batch sizes and increased local steps.

Globally Distributed Training Setting

To showcase the functionality of decentralized training with OpenDiLoCo across different continents, we used four DiLoCo workers, each with eight H100 GPUs, located in Canada, Finland, and two different states within the United States. The figure shows the network bandwidth between the workers, which varies between 127 to 935 Mbit/s. We trained our 1.1B parameter model with 500 local steps, and the gradients are all-reduced in FP16. Due to the large number of local steps, the four workers run independently for around 67.5 minutes before communicating for gradient averaging. For the outer optimizer step, our experiment shows an average all-reduce time between the workers of 300 seconds.

Network bandwidth between workers: Average bidirectional network bandwidth between the four workers in our decentralized training setup (in Mbit/s) measured using iperf. The nodes were located in three different countries and hosted by different cloud providers: Canada (Hyperstack), Finland (DataCrunch), Texas, US (Voltage Park), Delaware, US (Runpod).

Due to DiLoCo’s significant reduction in communication time, the all-reduce bottleneck only accounts for 6.9% of the training time, minimally impacting the overall training speed.

Additional training time is spent idling by the fastest worker in our scenario. In future work, we will address this issue by exploring DiLoCo in an asynchronous setting.

Running OpenDiLoCo

Running the code is simple. The only requirement is to have access to at least two GPUs, they don’t have to be co-located. After setting up the environment, create the initial DHT node using:

python ../hivemind_source/hivemind/hivemind_cli/run_dht.py
--identity_path fixed_private_key.pem
--host_maddrs /ip4/0.0.0.0/tcp/30001

In another terminal, you can start DiLoCo workers using the command below, making sure to set PEER, NUM_DILOCO_WORKERS and WORLD_RANK appropriately:

export PEER=/ip4/192.168.100.20/tcp/30001/p2p/Qmbh7opLJxFCtY22XqwETuo6bnWqijs76YXz7D69MBWEuZ
# change the IP above to your public IP if using across nodes connected via internet

export NUM_DILOCO_WORKERS=4
export WORLD_RANK=0

torchrun --nproc_per_node=8 \
    train_fsdp.py \
    --per-device-train-batch-size 16 \
    --total-batch-size 2048 \
    --total-steps 88_000 \
    --project OpenDiLoCo \
    --lr 4e-4 \
    --model-name-or-path PrimeIntellect/llama-1b-fresh \
    --warmup-steps 1000 \
    --hv.averaging_timeout 1800 \
    --hv.skip_load_from_peers \
    --hv.local_steps 500 \
    --hv.initial-peers $PEER \
    --hv.galaxy-size $NUM_DILOCO_WORKERS \
    --hv.world-rank $WORLD_RANK \
    --checkpoint_interval 500 \
    --checkpoint-path 1b_diloco_ckpt

You can find more information about running OpenDiLoCo in the README of the GitHub repository.

Running OpenDiLoCo on the PI Compute Platform

Setting up the global orchestration layer to run a DiLoCo training run can still be quite challenging. This is made a lot easier by our PI Compute Platform thanks to our prebuilt OpenDiLoCo docker image. The image comes with all the dependencies preinstalled, allowing one to spawn DiLoCo workers with ease.

In future work, we are excited to build an integrated open-source stack into the compute platform that offers smooth solutions for orchestration across multiple clusters, efficiency optimizations, handling node failures, infrastructure monitoring, and much more.

Conclusion & Future Directions

We successfully reproduced the main experiment results of DiLoCo, scaled the method to three times the parameter size of the original work, and demonstrated its application in a real-world decentralized training setting.

For future work, we aim to scale DiLoCo to larger models on a greater number of distributed workers. A few interesting directions include model merging techniques that could improve stability and convergence speed. Additionally, compute idle time could be reduced by implementing methods that perform weight averaging communication asynchronously, interleaving them with the computation for the next outer optimization step.

We are excited about the immediate practical applications of this technique and look forward to building on it for the third part of our masterplan soon: to collaboratively train and contribute to open AI models in high-impact domains like language, agents, code, and science for collective ownership of AI models.

Join Us in Building the Open Future of AI

The power to shape the future of AI should not be concentrated in the hands of a few, but open to anyone with the ability to contribute. We invite you to join us in building a more distributed and impactful future for AI:

If you are relentlessly ambitious and want to make this happen, apply for our open roles.
Collaborate on our AI model initiatives and open-source frameworks.
Contribute compute & earn ownership in state-of-the-art AI models.

Our OpenDiLoCo work has also been accepted at the ES-FoMo workshop at ICML. Reach out if you’re coming to Vienna!

‍

Eurus-2-PRIME

System Prompt

When tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process. [ASSESS] [ADVANCE] [VERIFY] [SIMPLIFY] [SYNTHESIZE] [PIVOT] [OUTPUT] You should strictly follow the format below: [ACTION NAME] # Your action step 1 # Your action step 2 # Your action step 3 ... Next action: [NEXT ACTION NAME]

User

Quadratic polynomials $P(x)$ and $Q(x)$ have leading coefficients $2$ and $-2,$ respectively. The graphs of both polynomials pass through the two points $(16,54)$ and $(20,53).$ Find $P(0) + Q(0).$

Eurus-2-PRIME

[ASSESS]

We are given two quadratic polynomials $P(x)$ and $Q(x)$ with leading coefficients $2$ and $-2$, respectively. The graphs of both polynomials pass through the points $(16,54)$ and $(20,53)$. We need to find the value of $P(0) + Q(0)$.

[ADVANCE]

Let's express the quadratic polynomials $P(x)$ and $Q(x)$ in the general form:
$ P(x) = 2x^2 + bx + c $
$ Q(x) = -2x^2 + dx + e $

Since the graphs of both polynomials pass through the points $(16,54)$ and $(20,53)$, we can substitute these points into the equations of $P(x)$ and $Q(x)$ to form a system of equations.

For $P(x)$:
$ P(16) = 2(16)^2 + 16b + c = 54 $
$ 512 + 16b + c = 54 $
$ 16b + c = -458 \quad \text{(Equation 1)} $

$ P(20) = 2(20)^2 + 20b + c = 53 $
$ 800 + 20b + c = 53 $
$ 20b + c = -747 \quad \text{(Equation 2)} $

For $Q(x)$:
$ Q(16) = -2(16)^2 + 16d + e = 54 $
$ -512 + 16d + e = 54 $
$ 16d + e = 566 \quad \text{(Equation 3)} $

$ Q(20) = -2(20)^2 + 20d + e = 53 $
$ -800 + 20d + e = 53 $
$ 20d + e = 853 \quad \text{(Equation 4)} $

[ADVANCE]

Now, we can solve the system of equations for $P(x)$:
Subtract Equation 1 from Equation 2:
$ (20b + c) - (16b + c) = -747 - (-458) $
$ 4b = -289 $
$ b = -\frac{289}{4} $

Substitute $b = -\frac{289}{4}$ into Equation 1:
$ 16\left(-\frac{289}{4}\right) + c = -458 $
$ -1156 + c = -458 $
$ c = 698 $

So, for $P(x)$, we have:
$ P(x) = 2x^2 - \frac{289}{4}x + 698 $
$ P(0) = 698 $

Next, we can solve the system of equations for $Q(x)$:
Subtract Equation 3 from Equation 4:
$ (20d + e) - (16d + e) = 853 - 566 $
$ 4d = 287 $
$ d = \frac{287}{4} $

Substitute $d = \frac{287}{4}$ into Equation 3:
$ 16\left(\frac{287}{4}\right) + e = 566 $
$ 1148 + e = 566 $
$ e = -582 $

So, for $Q(x)$, we have:
$ Q(x) = -2x^2 + \frac{287}{4}x - 582 $
$ Q(0) = -582 $

[ADVANCE]

Finally, we can find $P(0) + Q(0)$:
$ P(0) + Q(0) = 698 + (-582) = 116 $

So, the value of $P(0) + Q(0)$ is $\boxed{116}$.

INTELLECT-MATH-7B-SFT (ours)

System Prompt

Solve the following math problem efficiently and clearly. Think carefully and step by step about your response and reason before providing a final response. Conclude your response with: \n\nTherefore, the final answer is: $\\boxed{answer}$. I hope it is correct.\n\nWhere [answer] is just the final number or expression that solves the problem.

User

Quadratic polynomials $P(x)$ and $Q(x)$ have leading coefficients $2$ and $-2,$ respectively. The graphs of both polynomials pass through the two points $(16,54)$ and $(20,53).$ Find $P(0) + Q(0).$

INTELLECT-MATH-7B-SFT

So I've got this problem here about quadratic polynomials $P(x)$ and $Q(x)$. $P(x)$ has a leading coefficient of $2$, and $Q(x)$ has a leading coefficient of $-2$. Both polynomials pass through the points $(16,54)$ and $(20,53)$. I need to find $P(0) + Q(0)$.

First, since both $P(x)$ and $Q(x)$ are quadratic polynomials, I can express them in the general form:

$P(x) = 2x^2 + bx + c$

$Q(x) = -2x^2 + dx + e$

Wait, actually, the leading coefficient is the coefficient of the highest power of $x$, which for a quadratic is $x^2$. So yes, $P(x) = 2x^2 + bx + c$ and $Q(x) = -2x^2 + dx + e$.

Now, both polynomials pass through the points $(16,54)$ and $(20,53)$. That means when $x=16$, $P(16)=54$, and when $x=20$, $P(20)=53$. Similarly for $Q(x)$.

So, I can set up equations using these points.

First, for $P(x)$:

$P(16) = 2(16)^2 + b16 + c = 54$

Similarly, $P(20) = 2(20)^2 + b20 + c = 53$

Let me calculate $16^2$ and $20^2$:

$16^2 = 256$

$20^2 = 400$

So,

$P(16): 2256 + 16b + c = 54 \Rightarrow 512 + 16b + c = 54 \Rightarrow 16b + c = 54 - 512 \Rightarrow 16b + c = -458$

$P(20): 2400 + 20b + c = 53 \Rightarrow 800 + 20b + c = 53 \Rightarrow 20b + c = 53 - 800 \Rightarrow 20b + c = -747$

Now, I have two equations:

1) $16b + c = -458$

2) $20b + c = -747$

I can subtract equation 1 from equation 2 to eliminate $c$:

$(20b + c) - (16b + c) = -747 - (-458)$

$20b + c - 16b - c = -747 + 458$

$4b = -289$

So, $b = -289 / 4 = -72.25$

Wait, but coefficients are usually integers, but maybe not. Let me check.

Now, plug $b$ back into one of the equations to find $c$. Let's use equation 1:

$16(-72.25) + c = -458$

$-1156 + c = -458$

$c = -458 + 1156 = 698$

So, $P(x) = 2x^2 - 72.25x + 698$

Hmm, $P(0)$ would be $c$, which is $698$.

Now, let's do the same for $Q(x)$.

$Q(16) = -2*(16)^2 + d16 + e = 54 \Rightarrow -2256 + 16d + e = 54 \Rightarrow -512 + 16d + e = 54 \Rightarrow 16d + e = 54 + 512 \Rightarrow 16d + e = 566$

$Q(20) = -2*(20)^2 + d20 + e = 53 \Rightarrow -2400 + 20d + e = 53 \Rightarrow -800 + 20d + e = 53 \Rightarrow 20d + e = 53 + 800 \Rightarrow 20d + e = 853$

Now, equations for $Q(x)$:

1) $16d + e = 566$

2) $20d + e = 853$

Subtract equation 1 from equation 2:

$(20d + e) - (16d + e) = 853 - 566$

$4d = 287$

$d = 287 / 4 = 71.75$

Now, plug $d$ back into equation 1:

$1671.75 + e = 566 \Rightarrow 1148 + e = 566 \Rightarrow e = 566 - 1148 = -582$

So, $Q(x) = -2x^2 + 71.75x - 582$

Therefore, $Q(0) = e = -582$

Now, $P(0) + Q(0) = 698 + (-582) = 116$

Wait, but the problem says "present the answer in LaTeX format: \boxed{Your answer}"

So, I should write $\boxed{116}$

But let me double-check my calculations to make sure I didn't make any mistakes.

First, for $P(x)$:

$P(16) = 2(256) + 16b + c = 512 + 16b + c = 54 \Rightarrow 16b + c = -458$

$P(20) = 2*(400) + 20b + c = 800 + 20b + c = 53 \Rightarrow 20b + c = -747$

Subtracting: $4b = -289 \Rightarrow b = -72.25$

Then, $c = -458 - 16*(-72.25) = -458 + 1156 = 698$

Similarly for $Q(x)$:

$Q(16) = -2*(256) + 16d + e = -512 + 16d + e = 54 \Rightarrow 16d + e = 566$

$Q(20) = -2*(400) + 20d + e = -800 + 20d + e = 53 \Rightarrow 20d + e = 853$

Subtracting: $4d = 287 \Rightarrow d = 71.75$

Then, $e = 566 - 16*71.75 = 566 - 1148 = -582$

So, $P(0) = 698$, $Q(0) = -582$, sum is $116$.

Seems correct.

Final Answer

$\boxed{116}$

INTELLECT-MATH-7B (ours)

System Prompt

User

Quadratic polynomials $P(x)$ and $Q(x)$ have leading coefficients $2$ and $-2,$ respectively. The graphs of both polynomials pass through the two points $(16,54)$ and $(20,53).$ Find $P(0) + Q(0).$

INTELLECT-MATH-7B

So I have two quadratic polynomials, $P(x)$ and $Q(x)$, with leading coefficients 2 and -2, respectively. Both pass through the points $(16,54)$ and $(20,53)$. I need to find $P(0) + Q(0)$.

First, since both are quadratic polynomials, I can express them in the general form:

$P(x) = 2x² + bx + c$

$Q(x) = -2x² + dx + e$

But since both pass through the points $(16,54)$ and $(20,53)$, I can set up equations using these points.

For $P(x)$:

$P(16) = 2*(16)² + b \cdot 16 + c = 54$

$P(20) = 2(20)² + b \cdot 20 + c = 53$

Similarly for $Q(x)$:

$Q(16) = -2(16)² + d \cdot 16 + e = 54$

$Q(20) = -2(20)² + d \cdot 20 + e = 53$

Let's compute the squares:

$16² = 256$

$20² = 400$

So for $P(x)$:

$2256 + 16b + c = 54 \Rightarrow 512 + 16b + c = 54 \Rightarrow 16b + c = 54 - 512 = -458 \quad ...(1)$

$2400 + 20b + c = 53 \Rightarrow 800 + 20b + c = 53 \Rightarrow 20b + c = 53 - 800 = -747 \quad ...(2)$

For $Q(x)$:

$-2256 + 16d + e = 54 \Rightarrow -512 + 16d + e = 54 \Rightarrow 16d + e = 54 + 512 = 566 \quad ...(3)$

$-2400 + 20d + e = 53 \Rightarrow -800 + 20d + e = 53 \Rightarrow 20d + e = 53 + 800 = 853 \quad ...(4)$

Now, I have two systems of linear equations:

For $P(x)$:

$16b + c = -458 \quad ...(1)$

$20b + c = -747 \quad ...(2)$

Subtract equation (1) from equation (2):

$(20b + c) - (16b + c) = -747 - (-458)$

$4b = -289 \Rightarrow b = -\frac{289}{4}$

Now plug $b$ back into equation (1):

$16\left(-\frac{289}{4}\right) + c = -458 \Rightarrow -1156 + c = -458 \Rightarrow c = -458 + 1156 = 698$

So $P(x) = 2x² - \frac{289}{4}x + 698$

For $Q(x)$:

$16d + e = 566 \quad ...(3)$

$20d + e = 853 \quad ...(4)$

Subtract equation (3) from equation (4):

$(20d + e) - (16d + e) = 853 - 566$

$4d = 287 \Rightarrow d = \frac{287}{4}$

Now plug $d$ back into equation (3):

$16\left(\frac{287}{4}\right) + e = 566 \Rightarrow 1148 + e = 566 \Rightarrow e = 566 -1148= -582$

So $Q(x)=-2x^{2}+\frac{287}{4}x-582$.

Now, $P(0)=c=698$.

$Q(0)=e=-582$.

Therefore, $P(0)+Q(0)=698+(-582)=116$.

Final Answer

$\boxed{116}$.

Authors

Sami

Johannes

Jackmin

Acknowledgements

We want to thank Max Ryabinin for his guidance and help with the Hivemind library. His insights have been very helpful for our project. We would also like to thank Arthur Douillard for his work on DiLoCo and for helping us figure out the details of reproducing the original experiments.

15M to Build a Peer-to-Peer AI Protocol

Prime Intellect is building a peer-to-peer protocol for compute and intelligence, enabling collective creation, ownership, and access to sovereign open-source AI. We’re moving beyond centralized AI to empower anyone—from solo GPU operators to global datacenters—to contribute compute, code, or capital and shape the open and decentralized AI ecosystem.

SYNTHETIC-1 Release: Two Million Collaboratively Generated Reasoning Traces from Deepseek-R1

We are releasing SYNTHETIC-1, the largest open reasoning dataset generated from Deepseek-R1, collaboratively generated by compute contributors across the globe.

Introducing Prime Intellect’s Protocol & Testnet: A peer-to-peer compute and intelligence network

We’re excited to introduce our Protocol, and the first iteration of it on our Testnet. Prime Intellect’s Protocol is a peer-to-peer network that commoditizes compute and intelligence and coordinates the collective creation, ownership and governance of open-source AI.

OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training

OpenDiLoCo Training run visualized

Key Contributions

DiLoCo

OpenDiLoCo

Main Results

Scaling DiLoCo to Billion Parameter Models

Globally Distributed Training Setting

Running OpenDiLoCo

Running OpenDiLoCo on the PI Compute Platform

Conclusion & Future Directions

Join Us in Building the Open Future of AI

15M to Build a Peer-to-Peer AI Protocol

SYNTHETIC-1 Release: Two Million Collaboratively Generated Reasoning Traces from Deepseek-R1

Introducing Prime Intellect’s Protocol & Testnet: A peer-to-peer compute and intelligence network

Related Posts

15M to Build a Peer-to-Peer AI Protocol

SYNTHETIC-1 Release: Two Million Collaboratively Generated Reasoning Traces from Deepseek-R1

Introducing Prime Intellect’s Protocol & Testnet: A peer-to-peer compute and intelligence network