High-Performance Computing in Deep Learning: Distributed Training Strategies for Transformer Models in Natural Language Processing
Keywords:
distributed training, transformer models, heterogeneous clusters, gradient sparsification, fault toleranceAbstract
Distributed training of large Transformer models is increasingly conducted on heterogeneous high-performance computing (HPC) clusters, where variability in compute capacity and network topology degrades efficiency and stability. Existing systems rely on static partitioning or uniform gradient compression, leading to communication bottlenecks, suboptimal convergence, and poor fault tolerance. To address these limitations, we propose an adaptive distributed training framework that integrates topology-aware model placement, layer-wise adaptive sparsification based on gradient variance, and error feedback with hybrid parallelism. Evaluated on a 1.3-billion-parameter Transformer across 32 GPUs (including RTX 4090 and V100), our method achieves a throughput of 2,268 ± 29 samples/sec (23.1% higher than Megatron-LM) and reduces time to target validation loss (<2.85) to 12.8 ± 0.2 hours (12.9% shorter than Megatron-LM and 25.1% shorter than DeepSpeed ZeRO-2 (p < 0.001)). Communication volume is lowered to 2.03 ± 0.02 GB/step (approximately 58% lower than Megatron-LM), and the robustness score reaches 0.92 ± 0.01. The approach maintains competitive out-of-domain perplexity (PubMed: 14.2; GitHub: 18.7) and recovers from 5% node failures in 30 ± 3 steps. These results demonstrate a practical path toward efficient, stable, and deployable large-model training in shared, heterogeneous infrastructure.References
1. S. Wang, H. Zheng, X. Wen, and S. Fu, "Distributed high-performance computing methods for accelerating deep learning training," Journal of Knowledge Learning and Science Technology ISSN: 2959-6386 (online), 3(3), 108-126, 2024.
2. L. Chen, P. H. Lin, T. Vanderbruggen, C. Liao, M. Emani, and B. De Supinski, "Lm4hpc: Towards effective language model application in high-performance computing," In International Workshop on OpenMP, September, 2023, pp. 18-33.
3. S. Sarkar, M. F. Babar, M. M. Hassan, M. Hasan, and S. K. Karmaker Santu, "Processing Natural Language on Embedded Devices: How Well Do Transformer Models Perform?," In Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering, May, 2024, pp. 211-222. doi: 10.1145/3629526.3645054
4. S. Dash, I. R. Lyngaas, J. Yin, X. Wang, R. Egele, J. A. Ellis, and P. Balaprakash, "Optimizing distributed training on frontier for large language models," In ISC High Performance 2024 Research Paper Proceedings (39th International Conference), May, 2024, pp. 1-11. doi: 10.23919/isc.2024.10528939
5. Q. Anthony, B. Michalowicz, J. Hatef, L. Xu, M. Abduljabbai, A. Shafi, and D. K. Panda, "Demystifying the communication characteristics for distributed transformer models," In 2024 IEEE Symposium on High-Performance Interconnects (HOTI), August, 2024, pp. 57-65. doi: 10.1109/hoti63208.2024.00020
6. F. Zeng, W. Gan, Y. Wang, and P. S. Yu, "Distributed training of large language models," In 2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS), December, 2023, pp. 840-847. doi: 10.1109/icpads60453.2023.00126
7. M. Aach, E. Inanc, R. Sarma, M. Riedel, and A. Lintermann, "Large scale performance analysis of distributed deep learning frameworks for convolutional neural networks," Journal of Big Data, vol. 10, no. 1, p. 96, 2023. doi: 10.1186/s40537-023-00765-w
8. A. Rahali, and M. A. Akhloufi, "End-to-end transformer-based models in textual-based NLP," Ai, vol. 4, no. 1, pp. 54-110, 2023. doi: 10.3390/ai4010004
9. P. Liang, Y. Tang, X. Zhang, Y. Bai, T. Su, Z. Lai, and D. Li, "A survey on auto-parallelism of large-scale deep learning training," IEEE Transactions on Parallel & Distributed Systems, vol. 34, no. 08, pp. 2377-2390, 2023.
10. A. Kasoju, and T. Vishwakarma, "Optimizing Transformer Models for Low-Latency Inference: Techniques, Architectures, and Code Implementations," International Journal of Science and Research (IJSR), vol. 14, pp. 857-866, 2025.
11. M. Z. Hossain, and S. Goyal, "Advancements in Natural Language Processing: Leveraging Transformer Models for Multilingual Text Generation," Pacific Journal of Advanced Engineering Innovations, vol. 1, no. 1, pp. 4-12, 2024. doi: 10.70818/pjaei.2024.v01i01.02
12. L. Chen, A. Bhattacharjee, N. Ahmed, N. Hasabnis, G. Oren, V. Vo, and A. Jannesari, "Ompgpt: A generative pre-trained transformer model for openmp," In European Conference on Parallel Processing, August, 2024, pp. 121-134. doi: 10.1007/978-3-031-69577-3_9
13. S. Zhang, X. Yi, L. Diao, C. Wu, S. Wang, and W. Lin, "Expediting distributed DNN training with device topology-aware graph deployment," IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 4, pp. 1281-1293, 2023. doi: 10.1109/tpds.2023.3243261
14. B. Hanindhito, B. Patel, and L. K. John, "Bandwidth characterization of deepspeed on distributed large language model training," In 2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), May, 2024, pp. 241-256. doi: 10.1109/ispass61541.2024.00031
15. Y. Wang, X. Han, W. Zhao, G. Zeng, Z. Liu, and M. Sun, "H3T: Efficient integration of memory optimization and parallelism for large-scale transformer training," Advances in Neural Information Processing Systems, vol. 36, pp. 38311-38334, 2023.

