DistCA

TL;DR: Workload imbalance is one of the major problems in training long-context LLM models. Imbalance among data parallel (DP) and pipeline parallel (PP) workers introduces stragglers or bubbles that causes severe slowdown, and the problem becomes more severe as we scale to longer context lengths or more GPUs.

In this blog post, we show how core attention disaggregation can fundamentally eliminate the imbalance and achieve near-linear scaling for long-context LLM training. We also build a system prototype DistCA, which achieves up to 1.35× speedup over state-of-the-art training systems.