Please note: This schedule is automatically displayed in China Standard Time (UTC +8). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change.
The ability to quickly find and locate problems is the cornerstone of the fast recovery system. Only by quickly discovering and locating problems first can we talk about how to solve problems and minimize user losses. So how can we find and locate problems before users in complex large-scale scenarios? I will bring some of our experience and practice in quickly discovering and locating problems in the process of managing large-scale K8S clusters-how we solved what we encountered by creating a universal link detection + directional inspection tool KubeProbe To the stability challenge of large-scale clusters. Link detection: Simulate generalized user behavior and detect whether the link and process are abnormal Directional inspection: Check the abnormal indicators of the cluster and find the existing or possible risk points in the future System enhancements: the efficiency and speed of problem discovery, root cause analysis after problem discovery, and Chat-Ops
Nanguang Peng is a platform development engineer from Alibaba Cloud, currently focusing on large-scale kubernetes cluster management and stability construction