Simulating distributed ML training under heterogeneous network infrastructure

Temura, Arjun; Shah, Rinku (Advisor)

dc.contributor.author	Temura, Arjun
dc.contributor.author	Shah, Rinku (Advisor)
dc.date.accessioned	2026-04-18T07:03:04Z
dc.date.available	2026-04-18T07:03:04Z
dc.date.issued	2025-05-21
dc.identifier.uri	http://repository.iiitd.edu.in/xmlui/handle/123456789/1927
dc.description.abstract	There has been an increasing demand to train ML models, particularly large language models (LLMs), on multiple GPUs to ensure reduced training time and costs. However, making the correct training configuration choice (for example, the number of GPUs, parallelism technique, and network topology) to ensure minimal training time and maximum resource utilisation remains challenging. Distributed ML simulators help users with capacity planning and selecting optimal configuration knobs before training. However, state-of-the-art simulators assume homogeneous compute and network infrastructure. Distributed ML training infrastructure frequently consists of heterogeneous hardware, arising from generational shifts in devices or resource sharing in cloud environments. Several training plans have been introduced in the last few years to make the best out of the available heterogeneous hardware and improve training performance. However, there are no simulation tools that mimic realistic training environments for these heterogeneity-aware training strategies. Generally, heterogeneity-aware training optimisations make guided training plans considering compute or network heterogeneity. Therefore, we design a heterogeneity-aware distributed ML training simulator that supports compute and network heterogeneity. As part of our preliminary analysis, we study GPU communication flows for popular LLMs (GPT, Mixtral) on existing simulation frameworks under realistic training configurations with network heterogeneity. We observe improvement in the completion time of the median flows under heterogeneous configurations during training. Additionally, we develop ideas for effective model partitioning strategies in light of heterogeneous compute. Finally, we briefly discuss the additional abstractions required for our simulator to leverage heterogeneous hardware effectively.	en_US
dc.language.iso	en_US	en_US
dc.publisher	IIIT-Delhi	en_US
dc.subject	Machine Learning	en_US
dc.subject	large lan- guage models (LLMs)	en_US
dc.title	Simulating distributed ML training under heterogeneous network infrastructure	en_US
dc.type	Thesis	en_US

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

Year-2025 [18]
Year-2025

Show simple item record

Search Repository

Advanced Search

Browse

All of Repository
This Collection
- By Issue Date
- Authors
- Titles
- Subjects

Simulating distributed ML training under heterogeneous network infrastructure

Files in this item

This item appears in the following Collection(s)

Search Repository

Browse

All of Repository

This Collection

My Account