Please note: This schedule is automatically displayed in China Standard Time (UTC +8). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change.
Bagua is a project developed by Kuaishou Technology and ETH Zürich to support high performance distributed deep learning on Kubernetes without requiring special network devices and restrictive scheduling. Benefiting from Bagua's innovative communication algorithms and integration with Kubernetes, users can scale the training horizontally with excellent speedup guarantee, on a Kubernetes cluster with just ordinary ethernet connection. Bagua's effectiveness has been validated in various scenarios and models, including ResNet on ImageNet, Bert Large, and huge scale industrial applications at Kuaishou such as ● recommendation model training with dozens of TB parameters, ● video/image understanding with >1 billion images/videos, ● ASR with TB level datasets, etc. As for end to end performance, in a production Kubernetes cluster, Bagua can outperform PyTorch-DDP, Horovod and BytePS in the end-to-end training time by a significant margin (up to 1.95×) across a diverse range of tasks.
Xianghong Li currently serves as a senior architect at Kuaishou Technology, focusing on cloud-native machine learning platform based on Kubernetes, and large scale AI system performance acceleration solutions, in order to help algorithm engineers deploy production ready machine learning... Read More →