Loading…

线上峰会
12月9-10日
了解更多信息注册参加

Sched 应用程式允许你建立你的日程表,但不能代替你的活动注册。你必须注册 2021年中国 KubeCon + CloudNativeCon + Open Source Summit - 线上峰会 才能参加会议。如果你还没有注册但想加入我们,请到活动注册页面购票注册。

请注意:此日程表自动显示为中国标准时间(UTC +8)。要想看到您选择的时区,请从右侧 「Filter by Date」上方的下拉菜单中选择。日程表可能会有变动。


Virtual
December 9-10
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon + Open Source Summit China 2021 - Virtual to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in China Standard Time (UTC +8). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change.
Back To Schedule
Thursday, December 9 • 12:10 - 12:45
在阿里巴巴我们是怎样先于用户发现和定位K8s集群问题的 | How We Discover and Locate k8s Cluster Problems Before Users at Alibaba - Peng Nanguang, Alibaba

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
快速发现和定位问题的能力是快速恢复系统的基石,只有做到先快速发现和定位问题,才能谈如何解决问题,尽量减少用户损失。那么如何在复杂的大规模场景中,做到真正的先于用户发现和定位问题呢? 我会将我们在管理大型K8S集群过程中快速发现和定位问题的一些经验和实践带给大家——我们是如何通过自研通用链路探测+定向巡检工具KubeProbe应对我们遇到的大规模集群的稳定性挑战的。
链路探测:模拟广义用户行为,探测链路和系统是否异常
定向检测:检查集群异常指标,发现未来存在或可能存在的风险点
系统增强:发现问题提速增效,根因分析
发现问题之后:后置检查和自愈,Chat-Ops

The ability to quickly find and locate problems is the cornerstone of the fast recovery system. Only by quickly discovering and locating problems first can we talk about how to solve problems and minimize user losses. So how can we find and locate problems before users in complex large-scale scenarios? I will bring some of our experience and practice in quickly discovering and locating problems in the process of managing large-scale K8S clusters-how we solved what we encountered by creating a universal link detection + directional inspection tool KubeProbe To the stability challenge of large-scale clusters. Link detection: Simulate generalized user behavior and detect whether the link and process are abnormal Directional inspection: Check the abnormal indicators of the cluster and find the existing or possible risk points in the future System enhancements: the efficiency and speed of problem discovery, root cause analysis after problem discovery, and Chat-Ops

Speakers
avatar for Nanguang Peng

Nanguang Peng

Software Engineer, Alibaba Cloud
Nanguang Peng is a platform development engineer from Alibaba Cloud, currently focusing on large-scale kubernetes cluster management and stability construction



Thursday December 9, 2021 12:10 - 12:45 CST
Kubecon + CloudNativeCon 演讲厅