IIIT-Delhi Institutional Repository

Simulating distributed ML training under heterogeneous network infrastructure

Show simple item record

dc.contributor.author Temura, Arjun
dc.contributor.author Shah, Rinku (Advisor)
dc.date.accessioned 2026-04-18T07:03:04Z
dc.date.available 2026-04-18T07:03:04Z
dc.date.issued 2025-05-21
dc.identifier.uri http://repository.iiitd.edu.in/xmlui/handle/123456789/1927
dc.description.abstract There has been an increasing demand to train ML models, particularly large language models (LLMs), on multiple GPUs to ensure reduced training time and costs. However, making the correct training configuration choice (for example, the number of GPUs, parallelism technique, and network topology) to ensure minimal training time and maximum resource utilisation remains challenging. Distributed ML simulators help users with capacity planning and selecting optimal configuration knobs before training. However, state-of-the-art simulators assume homogeneous compute and network infrastructure. Distributed ML training infrastructure frequently consists of heterogeneous hardware, arising from generational shifts in devices or resource sharing in cloud environments. Several training plans have been introduced in the last few years to make the best out of the available heterogeneous hardware and improve training performance. However, there are no simulation tools that mimic realistic training environments for these heterogeneity-aware training strategies. Generally, heterogeneity-aware training optimisations make guided training plans considering compute or network heterogeneity. Therefore, we design a heterogeneity-aware distributed ML training simulator that supports compute and network heterogeneity. As part of our preliminary analysis, we study GPU communication flows for popular LLMs (GPT, Mixtral) on existing simulation frameworks under realistic training configurations with network heterogeneity. We observe improvement in the completion time of the median flows under heterogeneous configurations during training. Additionally, we develop ideas for effective model partitioning strategies in light of heterogeneous compute. Finally, we briefly discuss the additional abstractions required for our simulator to leverage heterogeneous hardware effectively. en_US
dc.language.iso en_US en_US
dc.publisher IIIT-Delhi en_US
dc.subject Machine Learning en_US
dc.subject large lan- guage models (LLMs) en_US
dc.title Simulating distributed ML training under heterogeneous network infrastructure en_US
dc.type Thesis en_US


Files in this item

Files Size Format View

There are no files associated with this item.

This item appears in the following Collection(s)

Show simple item record

Search Repository


Advanced Search

Browse

My Account