SRE / DevOps / Kubernetes Weekly Collection#93(Week 45, 2021)

11 min readNov 14, 2021

In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #567 November 7th, 2021
SRE Weekly Issue #295 November 7th, 2021
KubeWeekly #283 November 12th, 2021

DEVOPS WEEKLY ISSUE #567 November 7th, 2021

News

An excellent post on the subtleties of building trust in systems, including the technical systems and the people that make complex software work.

The title is “In our systems we trust”.
It explains how they can maintain user and stakeholder trust in their systems and products in an environment where failure is part of the journey, what about trust between them internally as team members, feeling confident in the quality of each other’s work through a story with the following points.
○ Health of code
○ Health of relation
○ Let’s talk about trust.
○ A word of advice
○ Final thoughts

Another great post, this one on a long term effort to reduce the cost and improve the developer experience of a large, and growing, CI infrastructure.

The title is “Infrastructure Observability for Changing the Spend Curve”.
It is a deep dive on how they crafted an order of magnitude change in they spend (10x reduction compared to baseline growth) over the last two years with iterative understanding and changes in Slack’s Continuous Integration (CI) infrastructure.
The following three ideas described in Takeaways and Hacklang were interesting.

Adaptive capacity to decrease the cost of each test by changing the infrastructure runtime.
Circuit breakers to decrease the number of tests by changing the infrastructure workflow.
Pipeline changes to decrease the number of tests by changing our user workflows.

A new programming language based on Lua! Luau is described as a fast, small, safe, gradually typed embeddable scripting language. Lots of use cases for this, I hope it attracts an active community.

An introductory article of “Luau”, a fast, small, secure, and gradually typed embeddable scripting language derived from Lua. It has been used by Roblox game developers to write game code, and used by the company engineers to implement most of the user’s application code and part of the editor (Roblox Studio) as plugins and open-sourced.
Click here for the GitHub page.

CI systems are so key to modern software development that some companies develop their own custom solutions. Not something most folks should do, but an interesting post from one team that took the custom approach.

The title is “Developing Databricks’ Runbot CI Solution”.
It explores the motivations behind developing their own CI solution “Runbot”, the core design decisions that went into it, and how they used it to greatly improve the experience of all the developers within the Databricks engineering organization.

A handy list of observability themed talks from the recent KubeCon event. Compiled by https://monitoring.love.

A playlist “KubeCon 2021 o11y Talks” that collects sessions on Observability of KubeCon + CloudNativeCon NA 2021.

Nginx was always one of my favourite bits of software to manage. This post looks into how to monitor it and provides an overview of various mainly SaaS solutions that can help, written by one of those SaaS providers.

The title is “NGINX Monitoring: Best Tools and Key Metrics You Should Know About”.
The explanation focuses on NGINX’s Key Metrics and “The Top 7 NGINX Monitoring Tools” below.

Sematext
Prometheus and Grafana
New Relic
datadog
AppDynamics
SolarWinds Server & Application Monitor
Dynatrace

A post introducing the monitoring golden signals (latency, traffic, errors and saturation) from first principles.

The title is “Golden Signals — Monitoring from first principles”.
The first article in a three-part blog series. It describes the four major SRE golden signals for metric-driven measurements and their role in the overall context of monitoring.

Tools

A useful public GitHub template for bootstrapping an AWS EKS cluster using Terraform. Good accompanying blog post as well about the usefulness of such boilerplate templates.

An article of “Boilerplate for a basic AWS infrastructure with EKS cluster” using Terraform.
Click here for the GitHub page.

SRE Weekly Issue #295 November 7th, 2021

Articles

MTTR is a Misleading Metric — Now What?

I love this crystal clear argument based on statistics and research. MTTR as a metric is simply meaningless.

Courtney Nash — Verica

Part 2 of a series that focuses on each of the key findings of VOID Report 2021. It raises the title and comments in the Editor above and considers what to use instead of MTTx metrics.

Five steps to better customer communication

Their steps for better communication during an outage:

* Provide context to minimise speculation
* Explain what you’re doing to demonstrate you’re ‘on it’
* Set some expectations for when things will return to normal
* Tell people what they should do0
* Let folks know when you’ll be updating them next

Chris Evans — incident.io

The explanation focuses on the five steps excerpted by the above Editor in the title.

Heroku Incident 2365 Follow-Up

Despite checking in advance to be sure their systems would support the new Let’s Encrypt certificate chain, they ran into trouble.

[…] we discovered that several HTTP client libraries our systems use were using their own vendored root certificates.

Heroku

Looking back on the incident that occurred on September 30, 2021. It was triggered by the expiration of the old root certificate issued by Let’s Encrypt.

Multicloud failover is almost always a terrible idea

This is the best case I’ve seen yet against multi-cloud infrastructure. I really like the airline analogy.

Lydia Leong

Along the title, it explains that the huge cost and complexity of multi-cloud implementation is effectively a negative distraction from what you should actually be doing that would improve your uptime and reduce your risks.

An Update on Our Outage — Roblox

Roblox had a major, several-day outage starting on October 28. I don’t usually include game outages in the Outages section since they’re so common and there’s not usually much information to learn from, I sure do like a good post-incident report. Thanks, folks!

David Baszucki — Roblox

In line with Roblox’s key value of “Respect the Community”, in outage information, they mentioned to continue to be transparent in their post-mortem.

40 Ms Bug

When you’re sending small TCP packets, two optimizations can conspire to introduce an artificial 40 millisecond (not megasecond…) delay.

Vorner

It shares the story of tracking bugs that occurred in the production environment of the application written in Rust at the following points.
○ A bit of backstory
○ The increased latencies
○ It was acting really weird
○ The benchmarks
○ Configuration options
○ Overriding the defaults of http1_writev
○ Splitting vectored writes
○ Nagle’s algorithm
○ Ok, but why 40ms?
○ Conclusion

Google Incident report — Meet

_Here’s Google’s follow-up report for their October 25–26 Meet outage.

Click here for additional information in Japanese or other languages if you are guided to the English page.
Improvements identified are as follows.
○ Increased resource allocation for the backend message delivery system in the short term, and automatically detect message delivery overload in the long term.
○ Enhancements to monitoring systems to capture real time data on quality of livestreams for all volumes of traffic.
○ Alert logic updates to capture spikes in rebuffering rate proactively to help mitigate before any customer impact.

/r/sre — How to deal with retries in SLIs

Should you count failed requests toward your SLI if the client retries and succeeds? A good argument can be made on either side.

u/Sufficient_Tree4275 and other Reddit users

As commented by the Editor above, it is interesting that both opinions discuss the concept of retry in SLI.

What the SRE team wants to achieve with the development team

Mercari restructured its SRE team, moving toward an embedded model to adapt to their growing microservice architecture.

ShibuyaMitsuhiro — Mercari

The English translation of the blog published on January 29, 2021 is as follows. It’s been about 10 months since it was written, so I think they are on the next stage, so I am curious about the current situation.
○ This article is a translation of the Japanese article written on 2021/01/29 Jan. 29th, 2021.

Episode 1: Honeycomb and the Kafka Migration — The VOID

There’s a really great discussion in this episode about leaving slack in the system in the form of bits of capacity and inefficiency that can be drawn upon to buy time during an outage.

Courtney Nash, with guests Liz Fong-Jones and Fred Hebert — Verica

Approximately 31 minutes of Podcast describes behind-the-scenes coverage of Honeycomb’s May this meta-analysis on a series of outages related to the Kafka architecture migration, beyond the Editor’s comments above, as well as the specific technology details below.
○ Complex socio-technical systems and the kinds of failures that can happen in them (they’re always surprises)
○ Transparency and the benefits of companies sharing these outage reports
○ Safety margins, performance envelopes, and the role of expertise in developing a sense for them
○ Honeycomb’s incident response philosophy and process
○ The cognitive costs of responding to incidents What we can (and can’t) learn from incident reports

Why a ‘Reliability Mindset’ Must Be Adopted Beyond SRE

Here’s how non-SREs can use SRE principles to improve their systems.

Laurel Frazier — Transposit

The explanation focuses on the following six methods by which non-SREs begin to adopt the “reliability concept”.

Be prepared
Embrace automation
Let the data do the talking
Debrief without blame
Close the feedback loop
Be customer-centric

Outages

Facebook, Messenger and Instagram
Or Meta or whatever.
Google Nest

KubeWeekly #283 November 12th, 2021

The Headlines

Editor’s pick of the highlights from the past week.

How Pokemon Go creator builds on Kubernetes for developers

B. Cameron Gain, The New Stack

In this latest episode of The New Stack Makers podcast, Ria Bhatia, senior product manager of Niantic, discusses why the Pokemon Go platform remains relevant today to developer customers and why Kubernetes will remain an integral part of the platform.

It explains why the Pokemon Go platform remains relevant and why Kubernetes will remain an integral part of the platform as the company hopes to bring in more “developer customers”. Podcasts and YouTube videos are embedded respectively.

ICYMI: CNCF online programs this week

A weekly summary of CNCF online programs from this week.

Breaking tradition: The future of package management with Kubernetes

Aaron Hurley & Dmitriy Kalinin, VMware

A one-hour session that deeply explores the idea of the session of the same name introduced in the keynote of Kubecon + CloudNativeCon NA 2021.
It explains in detail how the Carvel project team is re-imagining package management for Kubernetes to bring you a modern, declarative way to automate end-to-end lifecycle management of packaged applications and their dependencies.

Improve core-to-edge mobility and resiliency for cloud native applications

Ben Morrison, Trilio

A 53-minute session that introduces how to implement the mobility and resiliency of cloud-native apps at the edge with the following points.
○ How to further simplify deployment and management of cloud-native applications to improve resiliency and availability for edge clouds and help customers better curate their data for competitive advantage.
○ How to protect and migrate workloads between core and edge using enterprise and lightweight Kubernetes with data management tooling.

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

How Krateo PlatformOps integrates Backstage

Diego Brag, Kiratech

As the title suggests, the open source project “Krateo Platform Ops” based on a comprehensive and modular architecture for centrally creating and managing all types of resources is called “Backstage”, an open platform created by the Spotify IT team. It introduces the value and importance of integration of Krateo PlatformOps with Backstage. Backstage is a CNCF sandbox project.

Kube-lineage: A CLI tool for visualizing Kubernetes object relationships

Justin Toh

It introduces the CLI tool “kube-lineage” that visualizes the relationships between Kubernetes objects.

Multifactor SSO authentication for Postgres on Kubernetes

Jonathan Katz, Crunchy Data

Since PostgreSQL 12 has made it possible to provide multi-factor authentication for databases, we are introducing the introduction method.

Flux security audit has concluded

Daniel Holbach, Weaveworks

Since the security audit by CNCF and OSTIF (the Open Source Technology Improvement Fund) of CNCF’s Incubation project “Flux” has been completed, the results are shared.
The main purpose of the audit is to assess Flux’s basic security posture and identify the next steps in the security story.

Horizontal pod autoscaling with custom metrics in Kubernetes

Natalie Serrino, Pixie

As the title suggests, it explains how to autoscale Kubernetes Deployment using kind: HorizontalPod Autoscaler with custom application metrics.

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

Announcing the 2021 Steering Committee election results

Kaslin Fields, Google

As the title suggests, the results of the 2021 Steering Committee Election of the Steering Committee, which is elected from the Kubernetes community for a two-year term, are announced. It introduced four new and reappointed members from this election, three continuing members who were not subject to the election, and thanked the people involved.

Kubernetes podcast from Google: Knative 1.0, with Ville Aikas

Craig Box & Jimmy Moore

Kubernetes Podcast by Google employees. This time the Host is Craig Box and Guest Host Jimmy Moore.
With the release of Knative 1.0, they are welcoming Ville Aikas , a member of Knative’s Steering Committee and co-founder of Chainguard Inc.
This time they are in direct guest interviews without News of the week.

Security microservices, configuration and observability take the stage at KubeCon NA 2021

Patrick Nelson, SiliconANGLE

The keynote of KubeCon + CloudNativeCon NA 2021 explains the following five important points of the security session.

Modern security practices take hold
Configuration is more important in elaborate environments than cyberattack prevention
Supply chain hacks are escalating, and in the spotlight
Streamlining app deployment to Kubernetes
Costs getting reined in

Key takeaways from KubeCon: deeper focus on FinOps, GitOps

Charlotte Dunlap, GlobalData

Key Takeaways related to FinOps and GitOps of KubeCon + CloudNativeCon NA 2021 are excerpted and explained.
There are two Summary Bullets:
○ The Open Source Security Foundation (OpenSSF), a new group focused on software security supply chain problems, added $10 million in vendor funding.
○ Google Cloud recently joined the FinOps Foundation, representing the first major cloud provider to commit.

Kubernetes operators: Cruise control for managing cloud native apps

Senthil Raja Chermapandian, Ericsson

It explains the benefits of Kubernetes Operator and the overall picture.

Upcoming CNCF Online Programs

Live Webinar

November 16 @10amPT: Intro to open source observability with Grafana, Prometheus, Loki, and Tempo presented by Richard Hartmann, Grafana Labs — RSVP

Cloud Native Live

November 17 @9amPT: Building, analyzing, optimizing, and securing containerized apps presented by Martin Wimpress, Slim.AI — RSVP

On-demand

November 18: InfluxDB + Telegraf Operator: Easy Kubernetes Monitoring presented by Pat Gaughen & Wojciech Kocjan, InfluxData — RSVP

YouTube playlist submissions

Looking for more great curated content? Visit our Online Programs playlist on YouTube.

Learn more about CNCF Online Programs

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara