Going to KubeCon in Search of Reliability Talks

Going to KubeCon in Search of Reliability Talks? The Ultimate Guide.

If you’re into SLOs and you’re going to KubeCon + CloudNativeCon North America November 17 through 20, this is the guide for you. We scoured the agenda for talks that we’re excited about, and we’ve picked out the ones we think will be most interesting and valuable.

An SLO-Driven Approach to Enhance Kubernetes Cluster Reliability
Wednesday, November 18 • 2:45pm – 3:20pm PT

Summary: This talk first briefs the philosophy behind the SLO-driven approach for reliability engineering, followed by a deep dive of how SREs define SLOs for one of the world’s largest Kubernetes clusters in Ant Financial. 

Comment: Looks like a great intro to the SLOs with hands-on examples for Kubernetes clusters.

Evolution of Metric Monitoring and Alerting: Upgrade Your Prometheus Today – Bartlomiej Płotka, Red Hat, Björn Rabenstein & Richard Hartmann, Grafana Labs, & Julius Volz, PromLabs
Wednesday, November 18 • 12:00pm – 12:35pm PT

Summary: Prometheus Maintainers will introduce you to the universe of reliable monitoring and alerting with metrics via Prometheus with specific and actionable examples. After that, we will make sure more experienced users can learn as well, by explaining the advanced usage patterns of the Prometheus and new, useful features available in the newest versions.

Comment: Learn about monitoring using Prometheus and Thanos, an essential tool for any Kubernetes user.

A High-Schooler’s Guide to Kubernetes Network Observability – Drew Ripberger, Nirmata
Wednesday, November 18 • 12:00pm – 12:35pm PT

Summary:Though prior to his internship, Drew had not used Kubernetes, or even heard of eBPF or Prometheus before getting assigned the project, this talk will take you through the creation of kube-netc and his journey from hacks and workarounds to utilizing everything that the CNCF ecosystem has to offer.

Comment: Look at Kubernetes and Prometheus through the fresh eyes of a highschooler beginner.

Eating Your Vegetables: How to Manage 2.5 Million Lines of YAML – Daniel Thomson & Jesse Suen, Intuit
Wednesday, November 18 • 12:00pm – 12:35pm PT

Summary: This session will explain our journey, hard lessons faced for managing YAML at scale, and where Intuit thinks the future of Kubernetes configuration management needs to head.

Comment: All that yaml can be daunting, how to keep track of it all?

Supercharged Analytics for Prometheus Metrics with Spark, Presto, & Superset – Rob Skillington & Gibbs Cullen, Chronosphere
Thursday, November 19 • 12:45pm – 1:20pm PT

Summary:We’ll walk through a working example to run Superset and Presto in docker connected to a remote Prometheus to perform advanced SQL queries of arbitrary size reliably without timeout. We’ll also demo joining metrics data using the Kubernetes node name Prometheus label to detailed Kubernetes object metadata (events, pods, etc) collected by Fluentd using a simple SQL join thanks to Presto’s query federation capabilities.

Comment: Advanced SQL in Prometheus helps with creating rich customer-focused SLOs. 

How the OOM-Killer Deleted My Namespace, and Other Kubernetes Tales – Laurent Bernaille, Datadog
Thursday, November 19: 1:50pm – 2:25pm PT

Summary: In this talk Laurent and Tabitha will share some of these stories, including a favorite: how a complex interaction between familiar Kubernetes components allowed an OOM-killer invocation to trigger the deletion of a namespace.

Comment: War stories of unreliability are fun, and a great way to learn.

Whatever Can Go Wrong, Will Go Wrong – Rook/Ceph and Storage Failures – Sagy Volkov, Red Hat
Thursday, November 19 • 1:50pm – 2:25pm PT

Summary: In this presentation we’ll go over the basics of storage demands (RPO/RTO), How different types of replications in Ceph impact our recovery time, and how components failure such as drive, node or cluster determine how long we are at risk. We’ll include a live demo of a Rook/Ceph recovery process from a failed component. We’ll show what components of Rook are recreated, how Ceph behaves during components/pods recreation, and what is the impact on the application while these failures occur (In our case the application will be MariaDB).

Comment: Live demo of failure and recovery!

GitOps Is Likely More Than You Think It Is – Cornelia Davis, Weaveworks
 Friday, November 20 • 12:10pm – 12:45pm PT 

Summary: In this session Cornelia will cover the four key principles of GitOps, and she’ll demo those concepts with specific tools including Flux. She’ll also talk about use cases including cluster-api (CAPI).

Comment: Looks like a great intro to GitOps, starting with the 4 key principles.


Image Credit: Frank Eiffert on Unsplash

SLOs in Minutes, Not Months

Get Started with Nobl9 Reliability Center Free Edition

Start Now

Do you want to add something? Leave a comment