SRE / DevOps / Kubernetes Weekly Collection#88(Week 40, 2021)

10 min readOct 10, 2021

In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #562 October 3rd, 2021
SRE Weekly Issue #290 October 3rd, 2021
KubeWeekly #280 October 8th, 2021

DEVOPS WEEKLY ISSUE #562 October 3rd, 2021

News

A solid argument for a more specific definition of observability, and why the definition matters to solving the problems at hand.

The title is “Observability: The 5-Year Retrospective”.
The content of the title is explained in the following points.
○ From the mouths of Tweeters: Observability vs. Monitoring
○ A historical taxonomy of definitions for observability
○ Observability as defined by what you can actually do with it
○ Observability must be a clear concept

A post on data-center scope monitoring, including the evolution from SNMP-based systems to streaming telemetry.

The title is “Expanding the Observable Universe🌌 (or Scalable Model-Driven Telemetry with SR Linux Custom Agents and gNMI)”.
The contents of the title are explained with figures etc. at the following points.
○ Streaming Telemetry
○ Periodic sampling versus “on change” events
○ Custom conditional “on change” alerts using agents
○ Custom agents: Edge computing for data centers
○ Prototype: The SRL Docter Agent
○ Expanding your adjacent possible with SR Linux

A post on the concept of zero trust supply chain security, with a full worked example using cosign and sigstore.

The title is “Zero Trust Supply Chain Security”.
As the comment of the above Editor, the example using cosign and sigstore is explained in the following points. It’s easy to imagine the contrast with docker pullor npm install and the USB that has fallen on the roadside at the beginning.
○ Background
○ Zero Trust
○ Getting Started
○ The Pieces
○ Tracing a Container
○ Wrapping Up

A quick introduction to SLOs and SLIs and why you should care about them. Nice quick tips as well on where to start out.

The title is “SLOs and why you should care”.
The content of the title is explained with the following points along with the principle of SRE.
○ Ok great …. and why should I care?
○ Hmm … sounds interesting … where to start?
○ Latency
○ Traffic
○ Errors
○ Saturation
○ Happy SLO’ing!

A post filled with hard one observations and recommendations for improving the operability of a legacy application.

The title is “Production issues: the owl effect”.
“Owl effect” in the title: The following points explain the theme of small bugs hidden in the system.
○ A — On the road to a good mindset
○ B — How to scale your application?
○ C — Main lessons learned

If you’re working with Kubernetes you’re likely to be familiar with the kubectl command line tool. kubectl supports a plugin mechanism, and there are quite a few handy plugins for administrators covered in this post.

The title is “Making Kubernetes Operations Easy with kubectl Plugins”.
It introduces a subcommand plugin that extends kubectl.
If you would like to check for more plugins, go to the “awesome-kubectl-plugins” repository on GitHub .

Tools

Connaisseur is a Kubernetes admission controller that integrates container image signature verification and trust pinning into a cluster. Under the hood it supports Notary V1 and Sigstore for signing.

The GitHub page of “Connaisseur”, an admission controller that integrates container image signature verification and trust consolidation into a Kubernetes cluster.
I felt I knew this and found that, I’ve previously covered the aritcle of the version 2.0 release in KubeWeekly #270 July 30th, 2021.
As of 2021/10/03, v2.1.2 is the latest version.

Damon is a terminal user interface for Nomad. It provides functionality to observe and interact with Nomad resources such as Jobs, Deployments, or Allocations.

As mentioned above, the GitHub page of “Damon” which is Nomad’s terminal user interface.

Automated Cloud Advisor is a tool for facilitating cost optimization in AWS, by collecting data for resources that are under utilized.

A web page for “Automated Cloud Advisor”, an extensible tool that collects data from underutilized resources and facilitates AWS cost optimization.
Click here for the GitHub page.

Tremors is an event processing system for unstructured data with rich support for structural pattern-matching, filtering and transformation with features like aggregation, rollups, an ETL language, and a built-in query language.

As mentioned above, the GitHub page of the “Tremors” project, an early-stage event handling system for unstructured data that provides rich support for structural pattern matching, filtering, and transformation.
Click here for the web page .
I felt I knew this too and I found that I used to introduce “Tremor Con 2021” as event information.

A new version of the Puppet Development Kit is in the works, providing a new set of tools for Puppet developers, starting out with Puppet Content Templates.

The GitHub page of “Puppet Content Templates (PCT)”.
Currently in the “EXPERIMENTAL” phase, so be careful when verifying and using it.

SRE Weekly Issue #290 October 3rd, 2021

Articles

Postmortem: Partial RavenDB Cloud outage

Despite carefully testing how they would handle this week’s expiration of the root CA that cross-signed Let’s Encrypt’s CA certificate, they had an outage. The reason? Poor behavior in OpenSSL. See the next article for a deeper explanation of what went wrong with OpenSSL.

Oren Eini — RavenDB

Post-mortem of the title the author experienced on 2021/9/24 and the content in the comment of the above Editor.

Path Building vs Path Verifying: The Chain of Pain

This article explains why some versions of OpenSSL are unable to validate certificates issued by Let’s Encrypt now, even though the certificates should be considered valid.

Ryan Sleevi

Due to the unexpected impact of the expiration of “self-signed AddTrust External CA Root certificate”, the author describes in detail in the following points of open-source libraries involved, why things went bad, why they’re still bad, and what can be done about it. The Article dates on 2020/06/24.
○ Understanding The Problem
○ How to Avoid the Problem
○More Ways to Go Wrong
○ Key Elements of a Successful Implementation
○ Built for the Internet
○ Open Source Roundup

Stop adopting multicloud to achieve application resilience, says Honeycomb’s Charity Majors

This says it all:

It turns out that the path to safety isn’t increased complexity.

Matt Asay — TechRepublic

It briefly explains that the idea of adopting multi-cloud by the aspect in the title, is “Making hard things even harder”.

Reliability is not an engineering metric

The thrust of this article is that reliability applies to and should matter to the entire company, not just engineering. I really like the term “pitchfork alerting”.

Robert Ross — FireHydrant

The author’s conclusion is “Reliability is a business metric.”

How HTTP Keep-Alive can cause TCP race condition

Lesson learned: always make your application server’s timeout longer than your reverse proxy’s.

Ivan Velichko

It explains why the event with the title that returns 502 via HTTP occurs and how to deal with it. TL; DR is as follows.
○ TL;DR: HTTP Keep-Alive between a reverse proxy and an upstream server combined with some misfortunate downstream- and upstream-side timeout settings can make clients receiving HTTP 502s from the proxy.

The strange beauty of strange loop failure modes

Who deploys the deploy tool? The deploy tool, obviously — unless it’s down.

Lorin Hochstein

As commented by the Editor above, it describes the importance of having a rollback method using the Spinnaker UI as an alternative to performing a deployment if the deploy tool goes down, and practicing it.

Partitioning GitHub’s relational databases to handle scale

Their approach: group tables into “schema domains”, make sure that queries don’t span schema domains, and then move a schema domain to its own separate database cluster.

Thomas Maurer — GitHub

Built on Ruby on Rails over a decade ago, a single MySQL database that stores most of GitHub&s’s data supports the growth and ever-evolving resiliency requirements. It describes that it has implemented a plan to improve the partitioning function of tools and relational databases since 2019.

Groot: eBay’s Event-graph-based Approach for Root Cause Analysis

Groot is about helping figure out what’s wrong during an incident, not about analyzing an incident after the fact. I totally get why they need this tool, since they have over 5000 microservices!

Hanzhang Wang — eBay

It introduces “Groot”, a framework that delivers superior coverage and performance across various incident triage scenarios and is superior to other state-of-the-art root cause analysis techniques.

SRE is not a monolithic role

SRE is a broad, overarching responsibility that needs a multitude of role considerations to pull off properly.

Ash P — Cruform

It refutes widespread misconceptions about SREs among senior stakeholders.

Outages

Heroku
(also this one)Heroku had a major outage that coincided with an Amazon EBS failure in a single availability zone in us-east1. Customers of Heroku such as Dead Man’s Snitch were impacted.
Slack
Slack had a big disruption related to DNSSEC. Here’s an interesting analysis of what may have gone wrong (link).
Let’s Encrypt
Let’s Encrypt saw heavy traffic as everyone clamored to renew their certificates, causing certificate issuance to slow down.
Microsoft 365
Apple’s “Find My” service
Signal
Xero
This one coincided with the same Amazon EBS outage mentioned above. Xero also had another outage on October 1.

The Headlines

Editor’s pick of the highlights from the past week.

Last chance to register for KubeCon + CloudNativeCon North America!

This fall’s biggest event is almost here, KubeCon + CloudNativeCon North America 2021 at the Los Angeles Convention Center (LACC) or virtually from anywhere in the world.

Did you know that this year’s event will bring together 200+ sessions, 70+ project maintainer presentations, 12 cloud native tracks, and 29 co-located events? Not to mention a variety of experiences including diversity and inclusion and interactive sessions (with both in-person and virtual options.)

Get ahead of the game and begin planning your schedule today! Not registered yet? Don’t worry, there is still space available! Check out your in-person and virtual options.

Please note: KubeWeekly will be on a brief break next week for KubeCon + CloudNativeCon North America 2021 and the week after. We will resume publishing on October 29.

KubeCon + CloudNativeCon North America 2021 will be held soon. If you haven’t planned which session, presentation, or event to attend, you can do it right now. Me, Yes, I haven’t.
As mentioned above, KubeWeekly will take a two-week break before and after KubeCon + CloudNativeCon North America 2021, so it will resume on October 29th.

ICYMI: CNCF online programs this week

A weekly summary of CNCF online programs from this week.

Kubernetes 1.22 release

Savitha Raghunathan, James Laverack, & Jesse Butler, Kubernetes 1.22 Release Team

A 68-minute session explaining important changes and new features in the Kubernetes 1.22 release, as well as project-wide news and updates.

Next generation observability using open source monitoring

Scott Fulton, Opscruise

A 61-minute session that explains how to obtain observability using open source tools at the following three points along the title.

Get deep insights into your application from open-source CNCF monitoring
Leverage real-time analytics for proactively detecting, isolating and resolving problems, and
Learn how Ops teams can stay on top of your modern applications and infrastructure

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

Global load balancer approaches

Sanjeev Rampal and Raffaele Brusholi, Red Hat

It suggests one of the considerations when using Kubernetes or OpenShift in a multi-cluster (possibly hybrid cloud) deployment. It describes a global load balancer as a way to forward traffic to applications deployed across these clusters.

Announcing Linkerd 2.11: Policy, gRPC retries, performance improvements, and more!

William Morgan, Linkerd

With the release of Linkerd 2.11, it introduces the “free Upgrading to Linkerd 2.11 workshop” to be held on Thursday, Oct 23rd, 9am PT., and explains the changes and future.

Making Kubernetes operations easy with kubectl plugins

Martin Heinz

Since it is covered in DEVOPS WEEKLY # 562 above, I will skip it.

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

Flux October 2021 update

Daniel Holbach, Flux

October update of Flux. As a recapping for September, it is explained in the following items.
○ Flux Project Facts
○ News in the Flux family
○ Upcoming events
○ In other news
○ Over and out

Do your demos like a boss at KubeCon

Alex Ellis

It briefly explains the origins of the live conference demos, some of the people who do them best, why having traffic to localhost may be beneficial to your talk and how you could go about getting real traffic into your local applications.

Services don’t have to be eight-9s reliable, with Liz Fong-Jones

Kongcast

A 38-minute podcast session that introduces the concept of SLO’s error budget and explains how to accelerate observable software delivery.

Kubernetes cluster API reaches production readiness with version 1.0

Cluster API team

Cluster API v1.0 is production-ready and officially ,pvomg to v1beta1 API. Information that should be referred to in the FAQ section and future information are also briefly explained.

Upcoming CNCF Online Programs

No Online Programs will be hosted next week due to KubeCon + CloudNativeCon North America 2021! We hope to see you there.

Looking for more great curated content? Visit our Online Programs playlist on YouTube.

Learn more about CNCF Online Programs

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara