Sched 应用程式允许你建立你的日程表,但不能代替你的活动注册。你必须注册 2021年中国 KubeCon + CloudNativeCon + Open Source Summit - 线上峰会 才能参加会议。如果你还没有注册但想加入我们,请到活动注册页面购票注册。

请注意:此日程表自动显示为中国标准时间(UTC +8)。要想看到您选择的时区,请从右侧 「Filter by Date」上方的下拉菜单中选择。日程表可能会有变动。

December 9-10
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon + Open Source Summit China 2021 - Virtual to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in China Standard Time (UTC +8). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change.
Back To Schedule
Thursday, December 9 • 14:05 - 14:40
Bagua:Kubernetes 上的轻量级分布式学习 | Bagua: Lightweight Distributed Learning on Kubernetes - Xiangru Lian & Xianghong Li, Kuaishou

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Bagua 是快手科技和苏黎世理工 (ETH Zürich) 共同开发的一个项目,在 Kubernetes 上支持高性能分布式深度学习,而无需特殊的网络设备和严格的调度。得益于 Bagua 创新的通信算法和与 Kubernetes 的集成,用户可以在 Kubernetes 集群上通过普通以太网连接水平扩展培训,并提供出色的加速保证。Bagua 的有效性在各种场景和模型中得到了验证,包括 ImageNet 上的 ResNet、Bert Large,以及在快手的大规模工业应用,如:具有数十个 TB 参数的推荐模型训练,超过 10 亿张图像/视频的视频/图像理解,具有 TB 级别数据集的 ASR 等。在端到端性能方面,在 Kubernetes 生产集群中,Bagua 在不同任务范围内的端到端训练时间显著超过 PyTorch-DDP、Horovod 和 BytePS(高达 1.95 倍)。

Bagua is a project developed by Kuaishou Technology and ETH Zürich to support high performance distributed deep learning on Kubernetes without requiring special network devices and restrictive scheduling. Benefiting from Bagua's innovative communication algorithms and integration with Kubernetes, users can scale the training horizontally with excellent speedup guarantee, on a Kubernetes cluster with just ordinary ethernet connection. Bagua's effectiveness has been validated in various scenarios and models, including ResNet on ImageNet, Bert Large, and huge scale industrial applications at Kuaishou such as ● recommendation model training with dozens of TB parameters, ● video/image understanding with >1 billion images/videos, ● ASR with TB level datasets, etc. As for end to end performance, in a production Kubernetes cluster, Bagua can outperform PyTorch-DDP, Horovod and BytePS in the end-to-end training time by a significant margin (up to 1.95×) across a diverse range of tasks.

avatar for Xiangru Lian

Xiangru Lian

Senior Staff Research Scientist, Kuaishou Technology
avatar for Xianghong LI

Xianghong LI

Senior Architect, Kuaishou Technology
Xianghong Li currently serves as a senior architect at Kuaishou Technology, focusing on cloud-native machine learning platform based on Kubernetes, and large scale AI system performance acceleration solutions, in order to help algorithm engineers deploy production ready machine learning... Read More →

Thursday December 9, 2021 14:05 - 14:40 CST
Kubecon + CloudNativeCon 演讲厅