December 9-10
Friday, December 10 • 14:05 - 14:40
字节跳动中基于异构资源的机器学习训练加速 | ML training acceleration with heterogeneous resources in Bytedance - Deliang Fan & Tao Xin, ByteDance

字节跳动中有大量的中央处理器/图形处理器资源支持大量的深度学习模型训练。这些中央处理器/图形处理器资源有多种类型或规格。如何有效地利用这些异构资源是一个关键问题,特别是对于大规模分布式模型。本次分享将讨论如何通过充分利用字节跳动中的异构资源,从系统角度加快模型培训。主要工作包括:1.通过多个图形处理器共享机制充分利用图形处理器资源,增强模型培训能力。2.深入研究非统一内存访问架构关联资源分配(包括中央处理器/内存/图形处理器和 NIC),以获得更好的培训性能。3.集成 RDMA CNI,使用英特尔 SRIOV 技术实现高通量网络通信。

There are vast CPU/GPU resources to support a large number of deep learning model training in ByteDance. These CPU/GPU resources have multiple types or specifications. How to effectively use these heterogeneous resources is a critical issue, especially for large-scale distributed model. This sharing will talk about how to accelerate model training from a system perspective by fully utilizing heterogeneous resources in ByteDance. The main work includes: 1. Empower model training by fully utilizing GPU resources via multiple GPU sharing mechanisms. 2. Deep dive into NUMA affinity resource allocation (including CPU/Mem/GPU and NIC) for better training performance. 3. Integrate RDMA CNI for high throughput networking communication using Intel SRIOV technology.

avatar for Deliang Fan

Deliang Fan

avatar for Tao Xin

Tao Xin

Software Engineer, ByteDance

Friday December 10, 2021 14:05 - 14:40 CST
Kubecon + CloudNativeCon 演讲厅