INTELLECT–1: Launching the First Decentralized Training of a 10B Parameter Model
We're excited to launch INTELLECT-1, the first decentralized training run of a 10-billion-parameter model, inviting anyone to contribute compute and participate. This brings us one step closer towards open source AGI.
Recently, we published OpenDiLoCo, an open-source implementation and scaling of DeepMind’s Distributed Low-Communication (DiLoCo) method, enabling globally distributed AI model training. We not only replicated and open sourced this work but successfully scaled it to the 1B parameter size.
Now, we are scaling it up by a further 10× to 10B-parameter model size, ~25x from the original research. This brings us to the third step in our masterplan: to collaboratively train frontier open foundation models: from language, agents to scientific models.
Our goal is to solve decentralized training step-by-step to ensure AGI will be open-source, transparent, and accessible, preventing control by a few centralized entities and accelerate human progress.
Launch Partners and Contributors
We are excited and grateful to be joined by leading open-source AI players like Hugging Face, SemiAnalysis, Arcee, Hyperbolic, Olas, Akash, Schelling AI and many others contributing compute to this decentralized training run.
How To Contribute Compute
Anyone can now contribute resources to advance open-source AI through our platform and later on also more easily with their own hardware.
As Jack Clark, co-founder of Anthropic, highlighted, no model has yet been efficiently trained at the scale of 10B parameters across globally distributed workers. Our initial OpenDiLoCo run broke through the 1B parameter barrier, and with INTELLECT-1, we are reaching a new level of scale for decentralized training.
Have there been any distributed training runs above 10bn parameters? All the papers I read about distributed training (e.g., OpenDiLoCo) seem to involve ~1bn parameter LLMs.
DiLoCo enables the training of AI models on islands of poorly connected devices. The method allows for data parallel training on these different islands, requiring synchronization of pseudo-gradients only every few hundred steps. It significantly reduces the frequency of communication (up to 500 times), thus lowering the bandwidth requirements for distributed training.
Prime: Our Decentralized Training Framework
Since our initial open-source release, we have improved our distributed training framework across two key dimensions:
1. Algorithmic progress
Many of our ablations, building on top of our OpenDiLoCo work, have shown great promise in further reducing communication requirements. Notably, our quantization experiments on pseudo-gradients have reduced bandwidth requirements by up to 2000x, combining int8 quantization of the pseudo-gradients with an outer optimizer synchronization every 500 steps. These results have been effective at smaller scales, and we are excited to scale them to larger model sizes.
2. Scalable Decentralized Training Framework
Distributed training is both an engineering and research challenge. Achieving fault-tolerant training across distributed data centers is something even the largest AI labs are striving to solve today.
We are excited to announce the release of a new decentralized training framework called Prime. Prime enables fault-tolerant training, supports dynamic on-/off-ramping of compute resources, and optimizes communication and routing across a globally distributed network of GPUs.
This framework forms the foundation of our open-source tech stack, designed to support our own and other decentralized training algorithms beyond OpenDiLoCo. By building on this infrastructure, we aim to push the boundaries of what is possible with globally distributed AI training.
Key Features:
ElasticDeviceMesh for Fault Tolerant Training:
In Prime, we’ve added a new distributed abstraction called ElasticDeviceMesh which encapsulates dynamic global process groups for fault-tolerant communication across the internet and local process groups for communication within a node or datacenter.
The ElasticDeviceMesh manages the resizing of the global process groups when nodes join or leave, unlike the standard DeviceMesh in torch distributed, which will crash and require a cold restart to resize the process group.
In order to know when to resize the process groups, we use a heartbeat mechanism to discover dead nodes and remove them from the process group. Crashing nodes will attempt a best effort deathrattle to fail their own heartbeat quickly, saving its comrades the timeout.
Asynchronous distributed checkpointing
Due to the size of the model, checkpointing can be an expensive operation, taking up to 20 minutes on the nodes we tested. This would reduce our compute utilisation if it blocked the main training process.
In order to minimize the blocking time, we first checkpoint into /dev/shm which is a RAM backed filesystem. This operation is much faster and we can unblock the main training process once the checkpoint has been created in /dev/shm.
We then use two subprocesses to asynchronously copy the checkpoint out of /dev/shm into the checkpoint directory on disk as well as upload it to the remote.
Live checkpoint recovery
Nodes that wish to join the run mid-training need to be able to get the most recent state of the model and optimizer before being able to contribute to the training. They must complete this operation in the time window between two outer steps, otherwise, the checkpoint they receive would be stale.
In order to do this quickly, we have the joining nodes request the checkpoints from their peers which all host a sidecar HTTP server serving the latest checkpoint out of /dev/shm.
Once the joining node has downloaded and initialized the model, it skips the inner steps and joins the outer step with zero pseudo-gradients. This is to prevent the joining node from stalling the existing nodes. If the joining node also performed the inner steps, it would be late to the outer step by the time it took to download and load the checkpoint, reducing the clusters compute utilisation.
Custom Int8 All-Reduce Kernel
In our experiments, we found that we are able to perform int8 quantization on the pseudo gradients without any impact on the loss curves. This means that we can reduce the payload size of each outer step all-reduce by 4x if we communicate the pseudo-gradients in int8 instead of fp32.
However, we need to accumulate the reduce in fp32, dequantizing and re-quantizing intermediate results during the all-reduce. This is not supported by any collective communication libraries.
We thus implemented our own fully pipelined ring-reduce kernel in C++ which is JIT compiled as a custom operator using the torch library.
However, with the amount of quantization work we needed to perform, using the torch ops (quantize_per_tensor, scatter_add, index, etc) was too slow, resulting in underutilisation of our target network bandwidth of 4 Gbps.
We thus implemented our own multithreaded uint8 ops in C++ to perform the quantization and dequantization operations, improving the quantization speed by more than 60x.
Maximising bandwidth utilization:
By sharding our DiLoCo pseudo-gradients in a node, we can maximise network bandwidth utilization by opening multiple connections at the same time when performing the all-reduce. This yielded a transfer speed improvement of 8x on some nodes.
Relying on the public IP forward resulted in poor or unstable p2p bandwidth on some compute providers. To mitigate this, we employ VPN technology to optimize peer-to-peer connections between nodes, allowing us to better utilize the available internet bandwidth between nodes by modifying the routing of packets through the internet.
We’ve improved bandwidth utilization between nodes in similar data center settings by up to 40x compared to our OpenDiLoCo release, achieving up to 4Gb/s connections between data centers across the whole United States.
PyTorch FSDP2 / DTensor ZeRO-3 implementation
In order to fit the 10B model training within our given memory resources, we had to do shard the model weights, gradients and optimizer states between intra-node GPUs.
We achieved this using the fully_shard API from PyTorch FSDP2 which wraps the model parameters as DTensors and registers hooks to schedule all-gather and reduce-scatter on the tensors when they are used. FSDP2 also optimizes the collectives by bucketing the parameters into FSDPParamGroups. This allows us to execute the collectives on larger tensors, improving protocol-to-payload ratio and improving the overlap from pipelining. We employ the same trick for our pseudo-gradients, bucketing them by layer.
CPU Off-Loading
Our Diloco optimizer does not add any GPU overhead. All the tensors required by the Diloco optimizer are offloaded to CPU memory.
Since we only perform a global sync every hundreds of steps, the reduced speed of copying and calculating the pseudo-gradient on cpu is negligible relative to the time to execute the inner steps and all-reduce.
For more information, check out the repo, and stay tuned for our upcoming research paper.
All of this enables INTELLECT-1 to train at the 10B parameter scale with 98% compute utilization across multiple distributed workers. For our production training run, we chose to synchronize every 100 steps, which, on islands of 8xH100 nodes, takes roughly 40 minutes to complete. We quantize the pseudo-gradients to int8, reducing communication requirements by 400x. The all-reduce synchronization for the DiLoCo outer optimizer using our new framework takes less than 1 minute, minimizing communication between nodes to just 1-2% of the total training time.
INTELLECT-1: The First Decentralized Trained 10B
INTELLECT-1 is a 10B parameter model based on the Llama-3 architecture.
It will be the first model at this scale to be extensively trained on one of the highest-quality open-source datasets, Fineweb-Edu by Hugging Face.
As some of these datasets contain shards that contain correlated data, we pre-shuffled the datasets by random sampling from 12 streaming dataset iterators and resharding the dataset. Our pre-shuffled datasets can be found on the huggingface hub. The total number of tokens in our data mix, processed with the Llama-3 tokenizer, consists of over 6 trillion tokens.
We are using the WSD learning rate scheduler, which maintains a constant learning rate after an initial warm-up phase. This approach offers flexibility in the number of tokens we can train on, depending on the number of compute contributions. Toward the end of training, we plan to implement a cool-down phase using a high-quality dataset to further boost performance, as well as post-training optimizations.
Next Steps:
INTELLECT-1 is just the first step. We will continue to make progress on our roadmap to scale decentralized training to the largest and most powerful open frontier scientific, reasoning and coding models.
Our roadmap includes:
Scaling to larger, more powerful open frontier models in scientific, reasoning, and coding domains.
Developing a system that allows anyone to contribute their own compute resources, using proof mechanisms to ensure secure and verifiable contributions to decentralized training.
Creating a framework that enables anyone to initiate a decentralized training run, open for contributions from others.
Join Us in Building the Open Future of AI
Open-source AI is the key to countering the risks of centralization, but it requires coordinated efforts to harness distributed compute, talent, and capital to compete with leading closed-source labs.
Join us to advance open and decentralized AI:
If you are relentlessly ambitious and want to make this happen, apply for our open roles.
Sami, Jackmin, and Johannes for their work on the decentralized research. Manveer, Jannik, and Kemal for their work on the platform. Elie Bakouch for his help with composing the dataset. Arthur Douillard et al. for their work on DiLoCo. Tristan Rice and Junjie Wang for discussions and ideas on fault-tolerant training. Chien-Chin Huang and Iris Zhang for ideas and discussions related to asynchronous distributed checkpointing. Yifu Wang for discussions about Tensor Parallel. Andrew Gu for convincing us to switch to FSDP2 to get rid of some weird memory allocation issues we were facing with FSDP1.