Simulating distributed ML training under heterogeneous network infrastructure

Temura, Arjun; Shah, Rinku (Advisor)

Simulating distributed ML training under heterogeneous network infrastructure

Temura, Arjun; Shah, Rinku (Advisor)

URI: http://repository.iiitd.edu.in/xmlui/handle/123456789/1927

Date: 2025-05-21

Abstract:

There has been an increasing demand to train ML models, particularly large language models (LLMs), on multiple GPUs to ensure reduced training time and costs. However, making the correct training configuration choice (for example, the number of GPUs, parallelism technique, and network topology) to ensure minimal training time and maximum resource utilisation remains challenging. Distributed ML simulators help users with capacity planning and selecting optimal configuration knobs before training. However, state-of-the-art simulators assume homogeneous compute and network infrastructure. Distributed ML training infrastructure frequently consists of heterogeneous hardware, arising from generational shifts in devices or resource sharing in cloud environments. Several training plans have been introduced in the last few years to make the best out of the available heterogeneous hardware and improve training performance. However, there are no simulation tools that mimic realistic training environments for these heterogeneity-aware training strategies. Generally, heterogeneity-aware training optimisations make guided training plans considering compute or network heterogeneity. Therefore, we design a heterogeneity-aware distributed ML training simulator that supports compute and network heterogeneity. As part of our preliminary analysis, we study GPU communication flows for popular LLMs (GPT, Mixtral) on existing simulation frameworks under realistic training configurations with network heterogeneity. We observe improvement in the completion time of the median flows under heterogeneous configurations during training. Additionally, we develop ideas for effective model partitioning strategies in light of heterogeneous compute. Finally, we briefly discuss the additional abstractions required for our simulator to leverage heterogeneous hardware effectively.

Show full item record

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

Year-2025 [18]
Year-2025

Search Repository

Advanced Search

Browse

All of Repository
This Collection
- By Issue Date
- Authors
- Titles
- Subjects

Simulating distributed ML training under heterogeneous network infrastructure

Simulating distributed ML training under heterogeneous network infrastructure

Abstract:

Files in this item

This item appears in the following Collection(s)

Search Repository

Browse

All of Repository

This Collection

My Account