Open-source AI development still faces significant challenges in keeping pace with its closed-source competitors. The latter are rapidly expanding their capabilities by deploying co-located H100 clusters, each comprising up to hundreds of thousands of interconnected GPUs, to train their state-of-the-art models. To keep pace with Big Tech's rapid expansion of GPU clusters, the open-source community must overcome the limitations of traditional computing infrastructures and find ways to train on thousands of smaller, distributed GPU clusters across the globe.
At Prime Intellect, we're committed to addressing this gap by building infrastructure for decentralized AI development at scale. Our platform aggregates global compute resources and enable researchers to collaboratively train state-of-the-art models through distributed training across clusters.
This post explores various novel decentralized training approaches and how they can enable effective AI model training across globally distributed GPUs. We will discuss specific techniques, their trade-offs, and our development plans over the course of this year to democratize access to large-scale AI computing resources.
In a typical setting of distributed AI model training, there are three main methods for training models across multiple GPUs:
Each method has its disadvantages: data parallelism alone cannot train large models; tensor parallelism requires a lot of inter-GPU communication; and pipeline parallelism, though effective, is complex to implement and requires advanced scheduling to avoid GPU idle time.
These methods are often combined (along with more advanced optimizations like optimizer sharding, activation recomputation, etc.) to achieve high model FLOPs utilization (MFU).
Decentralized training across globally distributed GPUs introduces a radical shift in how we need to approach distributed AI model training.
In the decentralized training paradigm, we have access to relatively cheap compute power, but communication between instances is costly. For instance, we could harness unused compute (e.g., cheap spot instances) from various sources, but with the drawback that these instances can be located all around the world.
This poses several technical challenges:
Addressing these issues requires novel approaches that reduce communication overhead and enhance the flexibility and fault tolerance of training processes. The following sections outline some of these approaches:
Recent work by Google DeepMind has introduced an approach that enables the training of language models on islands of devices that are poorly connected. This method allows for data parallel training on these different islands, requiring synchronization of gradients only every 500 steps.
In data parallelism, all workers typically synchronize their gradients after each step, just before updating the model's weights. The gradient matches the model weight in shape and size. For a 1B model, it means 2GB of data in bfloat16, that need to be sent by each worker at each step. With a typical bandwidth in a decentralized training setup, such as 1Gb/s (equivalent to 0.125GB/s), communicating the gradient will take at least 16 seconds, leading to significant idle time for the GPUs.
However, what if instead of syncing the gradient at every step you would sync only every 500 steps? This would spread the 16 seconds of communication over minutes of computation, maximizing GPU utilization and reducing training time.
Doing so is not easy, because you cannot trivially delay the gradient update. DiLoCo introduces an inner-outer optimization algorithm that allows both local and global updates. Each worker independently updates its weights multiple times using a local AdamW optimizer (inner optimization). Every ~500 updates, the algorithm performs an outer optimization using the Nesterov momentum optimizer, which synchronizes all workers' pseudo gradients (the sum of all local gradients).
Additional to the reduced communication volume, Douillard et al. show that DiLoCo is very robust to scaling individual workers up or down and having an adaptive amount of total compute. This flexibility could enable a decentralized training solution to dynamically adjust the total compute based on availability and competitive pricing, by adding or removing nodes during the training.For example, if a training job were running across 8 globally distributed H100 nodes and a more cost-effective node became available, an orchestration solution could replace an existing, more expensive node, or decide to increase the compute power for the training job to complete the job faster.
Strengths:
Limitations:
DiLoCo faces the same limitation as traditional data parallelism; it cannot train models that exceed the memory capacity of its co-located GPUs. To address this, a follow-up work by the same authors at DeepMind has extended DiLoCo to support the training of sparse models (aka MoEs), on poorly connected islands of compute:
Sparse models (MoEs) allow to train gigantic models while maintaining manageable training costs, as they do not use all weights during each forward and backward pass. In these models, different inputs will take different paths in the model based on a routing mechanism. Traditional open sparse models (like Mixtral or Databrick’s DRBX model) implement a token-based routing at each transformer block level. This method requires frequent communication for every token in the sequence at each block, which won’t work efficiently across poorly connected devices.
To avoid that much communication, DiPaCo use a coarse routing mechanism. This routing is made at the sequence level (to contrast with token level), greatly reducing the amount of communication needed at each step. Additionally, the routing decisions are made offline before training, allowing data to be pre-sharded. Each worker then processes data specific to one path only.
In their experiments, they trained a sparse model with 256 possibles path, each having 150m active parameters which outperforms a 1B dense parameter model baseline. Each of these paths are not totally independent, otherwise the approach would be equivalent to train 256 different models. The paths share common blocks that are kept in sync using DiLoCo.
Google has successfully trained their flagship model, Gemini Ultra, using a distributed training infrastructure across multiple data centers. This training infrastructure underscores the challenges faced by even the largest organizations in maintaining all hardware co-located within a single location, highlighting the need for globally distributed training across clusters.
Specifically, Google leveraged data parallelism across multiple TPUv4 Superpods—each consisting of 4096 TPUs—located in various data centers. The best-in-class network latencies and bandwidths of Google’s infrastructure allowed them to exploit model parallelism within superpods and data-parallelism across superpods.
They do not disclose any further details on the training infrastructure setup. However, a simple back-of-the-envelop calculation based on the rumoured 1e26 hardware FLOPs that were used in the final Gemini Ultra training run, along with a training time of ~100 days and a reasonable hardware FLOPs utilization of 60%, would imply the use of about 18 superpods across different data centers.
(math: 1e26 / (TPU v4 with 275 TFLOPs in bfloat16 * 4000 * 60% HFU * 100 days
→ 1e26 / (275e12 * 4000 * 0.6 * 100 * 24 * 60 * 60) ≈ 17.5)
Assuming that Gemini Ultra is a 2T parameter MoE model, similar to GPT-4, and trained using mixed-precision with bfloat16, synchronizing gradients with a simple ring all-reduce method after each training step would require each superpod to transmit and receive 4TB of data. Without more advanced strategies to interleave communication and computation, this extensive data transfer volume would significantly slow down the training speed, even with Google’s great cross data center interconnect speeds.
Next, we explore promising decentralized training paradigms that do not rely on Google’s world-class proprietary infrastructure. These methods, such as SWARM parallelism, enable globally distributed training on poorly connected, heterogeneous and unreliable devices:
SWARM parallelism is a model-parallel training algorithm designed for devices with poor connectivity and varying reliability. Ryabinin et al. show the feasibility of training billion parameter scale LLMs on preemptible instances and a network bandwidth of less than 200Mb/s while achieving high training throughput.
This method presents a more flexible form of pipeline parallelism. The path for completing a full forward and backward pass is not fixed and may vary over time. Each worker may send its output to any other worker in the subsequent stage. Faster devices receive more tasks to prevent GPU idle time, and also enabling the use on non-homogeneous hardware. If a worker dies, the tasks assigned to it are redirected to others, making the algorithm fault tolerant.
Paths are determined stochastically and on the fly. Each worker in a stage is placed in a priority queue based on its recorded performance in the pipeline over time. Workers that consistently perform tasks faster—due to better hardware or are co-located with preceding workers—are more likely to be picked. If a worker fails to respond to a request, it is temporarily excluded from the queue until it re-announces itself. This dynamic allows the system to operate on preemptible instances or adaptively rebalance nodes within the swarms.
Additionally, they have experimented with gradient and activation quantization (to 8-bit) to further minimize the communication overhead.
In their experiments, the researchers simulate low-bandwidth and high-latency environments and show that it is feasible to train billion-scale transformer models with less than 200Mb/s bandwidth.
Strengths:
Limitations:
Unlike Google's proprietary efforts, SWARM parallelism is open-source, allowing anyone to begin decentralized training across clusters. Although it is primarily a research codebase not intended for production use, its contributions to the field are significant. You can explore and contribute to the project here: https://github.com/yandex-research/swarm. SWARM is built upon the hivemind project:
Hivemind is a framework to perform various decentralized training algorithms. It has been used to train a text to image model across a pool of volunteer compute. At its core, Hivemind utilizes a distributed hash table spread across each worker to communicate metadata and synchronize them. This distributed hash table is implemented using the libp2p open-source project, with an alternative version using IPFS. The GitHub repo: https://github.com/learning-at-home/hivemind.
Hivemind, as a framework, supports several decentralized training algorithms. The most recent one, SWARM, was discussed earlier. Originally, the project was designed with specific research contributions in mind, including Moshpit SGD, Decentralized Mixture-of-Experts, and Training Transformers Together.
If you are looking to deploy a decentralized Hivemind training run today, the simplest method is through our decentralized compute platform at app.primeintellect.ai by selecting the "Prime Intellect Hivemind" base image for deploying your workloads.
Another notable framework is Petals which is built on top of Hivemind. It implements a variant of swarm parallelism with inference and light fine-tuning in mind for several recent LLMs like Mixtral or Llama.
Similar to SWARM parallelism, Varuna uses a combination of pipeline and data parallelism to train models using relatively slow interconnect speeds, compared to InfiniBand connections, and also works on spot instances.. However, it differs by using multi-GPU instances instead of the single V100/T4s used in SWARM parallelism, and it significantly scales the number of pipeline stages, facilitating the training of large models.
Varuna is not trained in a decentralized manner across multiple clusters but rather within a single cluster with low interconnect speeds. Additionally, it lacks some of the recent features of SWARM parallelism, such as the "temporary randomized" pipeline approach that supports training on heterogeneous devices. Despite these differences, it is, in our view, one of the most notable developments in this field, demonstrating how to efficiently scale model training across separate spot instances to hundreds of GPUs and model sizes up to 200B parameters.
Strengths:
Limitations:
Other complementary decentralized training techniques try to lower the communication requirements between nodes by for example:
To date, no one has successfully scaled this research to actually train state-of-the-art models.
Models like GPT-4 or Gemini are said to be up to 2T parameter sparse models with up to million-token context windows. Leading open source LLMs like Mixtral 8x22b, Cohere’s Command R or the upcoming large Llama 3 are at the 100B+ scale. Even state-of-the-art models in other fields, like protein language models, are being scaled to 100B parameter model sizes. This is still one to two orders of magnitude more than what current decentralized training efforts have been able to achieve.
At Prime Intellect we will scale this research to train SOTA models through distributed training across multiple clusters.
At Prime Intellect, we are making use of distributed, low-communication training approaches to enable the training of large AI models using distributed resources, reducing costs and democratizing AI development.
We are focused on developing decentralized training frameworks that exhibit several advantageous properties compared to current research. These include:
We believe there should be an open-source stack that offers solutions for orchestration across multiple clusters, efficiency optimizations, handling node failures, infrastructure, monitoring, and much more. We will have more to share about our research efforts in this direction soon.
Join us in creating a free market for training and inference compute, breaking free of long-contracts and interconnect limitations.
We have recently raised a $5.5 million seed round from an awesome group of investors and angels, such as Clem from HuggingFace and Dylan Patel from SemiAnalysis, and are actively hiring stellar AI research engineers to work on this mission.