SRE / DevOps / Kubernetes Weekly Collection#90(Week 42, 2021)

6 min readOct 24, 2021

In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links. (KubeWeekly is off this week).
Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #564 October 17th, 2021
SRE Weekly Issue #292 October 17th, 2021
KubeWeekly # 281 October 29th, 2021 ← 2 weeks off due to KubeCon + CloudNativeCon NA 2021

DEVOPS WEEKLY ISSUE #564 October 17th, 2021

News

A good post introducing custom validation in Terraform, and why failing early is so important with any cloud automation.

The title is “FAILING FASTER WITH TERRAFORM”.
As commented by the Editor above, Terraform validation is introduced from the perspective of a Terraform beginner.

Not everyone has found devops practices easy to adopt or scale, and there are often tensions within operations teams. This post summarises some of those quite well, even if I don’t believe those issues are inevitable.

The title is “Operations is not Developer IT”.
An article that vividly describes the heavy loads on the operations team by adopting DevOps, Docker, Kubernetes, and various vendor tools.

An interesting thought experiment considering how various large scale incidents with the gigantic AWS us-east-1 region would be handled.

The title is “Worst Case”.
As commented by the Editor above, it is interesting to do a thought experiment if various incidents occur in the us-east-1 region, which is a region that has a great influence on AWS.

A look under the hood of distribution software packaging, looking at the far reaching implications and challenges of upgrading OpenSSL to the latest version. Good insight into the tension between centralised distributions and distributed development.

The title is “The long-term consequences of maintainers’ actions”.
The good news is that OpenSSL 3 has entered into Alpine, but it also has a down side. It explains the package dependencies that can be a pitfall.

A solid argument that if you’re building a Terraform module you should strive to make it opinionated. Focus on use cases rather than on monolithic modules just around a particular piece of software.

The title is “Your Terraform Module Needs an Opinion”.
The author, who has a strong opinion about what the Terraform module should be, explains that you should not make a Swiss Army knife, you should not make a complicated wrapper, and so on.

Another post on Terraform modules, this one focused on patterns you can adopt to build useful, maintainable modules.

The title is “Terraform Module Patterns”.
Another Terraform module article. This explains the applicable modules, so it seems good to read it together with the previous article.

Observations from an updated study on container adoption within one (large) ecosystem. Growing adoption of auto-scaling and stateful workloads, adoption of containerd and fargate, popular images and more.

The title is “10 TRENDS IN REAL-WORLD CONTAINER USE”.
As the title suggests, it explains the following 10 trends. It is expressed in an easy-to-read manner as Datadog, and each is interesting.

Nearly 90 percent of Kubernetes users leverage cloud-managed services
Amazon ECS users are shifting to AWS Fargate
The average number of pods per organization has doubled
Host density is 3 times higher on Kubernetes than on Amazon ECS
Pod auto-scaling is becoming more popular
Organizations are deploying more stateful workloads on containers
Organizations running container environments create more monitors
Organizations are starting to replace Docker with containerd as their preferred runtime for Kubernetes
OpenShift adoption is growing rapidly
NGINX, Redis, and Postgres are the top three container images

An interesting interview with one of the founders of Kubernetes, covering some of the original philosophy of the project and other observations about software development.

The title is “Kubernetes Co-founder Joe Beda:” Software development is a team sport “”.
An Interview with many interesting stories such as Internet Explorer in the Microsoft, Kubernetes, and work-life balance.

Tools

A very interesting new user interface for Kubernetes, Kui, mixes the best of CLI and GUI tools. It’s also a framework for building similar tools, so it will be interesting to see if integrations emerge here.

The GitHub page of “Kui”, a framework that enhances CLI with graphics.

Panther is an event consolidation and management application that centralizes and manages events from IT systems, networks and applications in a single console.

The GitHub page of Panther, an event integration and management app that centralizes and manages events from IT systems, networks, and applications from a single console.

age is a simple, modern and secure file encryption tool, format, and Go library. It features small explicit keys, no config options, and UNIX-style composability.

As the Editor mentions above, the GitHub page of a simple, modern and secure file encryption tool, format, and Go library “age”.

Kdigger is a new context discovery tool for Kubernetes, intended for discovery when conducting a penetration test. Nice documentation explaining what and why.

The GitHub page of “Kdigger” (short for Kubernetes digger), a context detection tool for Kubernetes penetration testing.
Click here for the introductory blog.

SRE Weekly Issue #292 October 17th, 2021

Articles

Four lessons every company should learn from the back-to-back Facebook outages

The lessons:

1. Acknowledge human error as a given and aim to compensate for it
2. Conduct blameless post-mortems
3. Avoid the “deadly embrace”
4. Favor decentralized IT architectures

There have been quite a few of these “lessons learned” articles that I’ve passed over, but I feel like this one is worth reading.

Anurag Gupta — Shoreline.io

Niall Murphy

As you can see from the title above and the comments from the Editor, four lessons learned from Facebook’s recent outage are extracted and explained.
It is important to have a culture and organizational atmosphere where you can have the following conversations by looking back on outages.
“We’ve already paid for this outage. What benefit can we get from that expenditure?”

Worst Case

Could us-east-1 go away? What might you do about it? Let’s catastrophize!

I love catastrophizing!

Tim Bray

Since it is covered in DEVOPS WEEKLY ISSUE # 564 above, I will skip it.

What Managed Kubernetes Service is Best for SREs?

When evaluating options, this article focuses on reliability, both of the service itself and the options it provides for building reliable services on it.

Quentin Rousseau — Rootly

This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

It explores the five most popular Kubernetes services (Amazon EKS, Azure AKS, Google Cloud GKE, SUSE Rancher, Red Hat OpenShift) and gives an overview of how they stack up on the reliability engineering forefront.

SRE Toolkit: Failure Domains

This one answers the questions: what are failure domains, and how can we structure them to improve reliability?

brandon willett

The first article in a short series named “SRE Toolkit”. Each entry being a friendly introduction to one concept the author has consistently found useful in its quest to make software sturdier.

SRE top interview questions to land an SRE role

It’s a great list of questions, and it covers a lot of ground. SREs wear many hats.

Opsera

It’s a good list to prepare for an SRE role, and I think it can be used by interviewers and when you want to quickly check your understanding with your current SRE skillsets.

How Time Series Databases Work — and Where They Don’t

I’ve always been curious about how Prometheus and similar time-series DBs compress metric data. Now I know!

Alex Vondrak — Honeycomb

It details the topic of Time Series Databases (TSDB) and why Honeycomb couldn’t be limited in the implementation of TSDB.

An UPDATE without a WHERE, or something close to it

This one has some unconfirmed (but totally plausible!) deeper details about what might have gone wrong in the Facebook outage, sourced from rumors.

rachelbythebay

There is another rumor about the cause of Facebook’s outage that is different from the rumors it covered last week, so it has picked it up and explained.

Turning Safety vs. Profits Into a Fair Fight

There’s a really intriguing discussion here about why organizations might justify a choice of profit at the expense of safety, and how the deck is stacked.

Rob Poston

The content of the title explains the steps and issues to improve safety from the perspective of “Why does a powerful concept such as HRO (high reliability organization) fail to spread in hospitals?”

Outages

KubeWeekly # 281 October 29th, 2021 ← 2 weeks off due to KubeCon + CloudNativeCon NA 2021

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara