SRE / DevOps / Kubernetes Weekly Collection#74(Week 26, 2021)

6 min readJul 3, 2021

In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #548 June 27th, 2021
SRE Weekly Issue #276 June 27th, 2021
KubeWeekly # 267 ← KubeWeekly is off due to a US holiday.

DEVOPS WEEKLY ISSUE #548 June 27th, 2021

News

The Twelve Factor App methodology is 10 years old. But how does it hold it to modern application needs? This post explores each of the factors.

The title is “Revisiting the Twelve-Factor App Methodology”.
Since it was covered in KubeWeekly#266 last week, I will skip it.

A post on the need to watch out for the pitfall of counterfactuals when analyzing incidents, with good examples of potential issues.

The title is “Counterfactuals are not Causality”.
Since it was covered in last week’s SRE Weekly Issue#275 , I will skip it.

We’re definitely seeing higher and higher levels of abstraction when it comes to cloud services, but the user interface for developers is still key. A post considering this issue and the influence of Heroku.

The title is “What AWS Tells Us About Heroku 2.0”.
It contrasts and explains Heroku’s history with AWS’s approach in the same area.

The Kyverno policy engine for Kubernetes can be used as a mutating webhook as well as a validating one, which opens up several use cases. This post looks at ensuring pull secrets are available in new namespaces and podspecs.

The title is “How I tackle Docker Hub rate-limiting policy with a policy engine Kyverno”.
As the title suggests, it shares a way around the troublesome Docker Hub rate-limiting policy using Kyverno and Docker Hub’s Pro account.

A post on getting Gatekeeper violation information from your Kubernetes cluster into Prometheus and Grafana for improved visibility.

The title is “Expose Open Policy Agent/Gatekeeper Constraint Violations for Kubernetes Applications with Prometheus and Grafana”.
It explains how to use Prometheus and Grafana to provide platform users with a concise view of Gatekeeper Constraint Violation.

Tools

An online editor and visualisation tool, along with a built-in tutorial, for writing Kubernetes network policies.

A GitHub page of “Network Policy Editor” that I covered earlier in this blog. Improvements have been made, such as more tutorial patterns.

Rocky Linux is a new Linux Operating System designed to be a drop-in replacement for CentOS, operating in the same manner CentOS did previously as a downstream project.

A web page of community-led “Rocky Linux” that is designed to be a drop-in replacement for CentOS.
Click here for the GitHub page.

Kube Karp is a handy tool with a specific purpose, to add a floating virtual IP to Kubernetes cluster nodes to make load balancing easy.

A GitHub of “Kube Karp” that realizes automatic failover of Kube API server by sharing a common virtual IP address among Kubernetes cluster nodes. It runs as a DaemonSet in the cluster.

SRE Weekly Issue #276 June 27th, 2021

Articles

@GergelyOrosz on blaming the intern

HBO accidentally sent an email to a bunch of people, and they tweeted (jokingly?) blaming their intern. This is a link to a short, thoughtful response thread.

Gergely Russian

When there is a mistake, it asks who should be responsible if the individual is responsible as an organization, how to stop it with the system, and whether mentors and onboarding are working.

The stack overflow of death. How we lost DNS and what we’re doing to prevent this in the future.

This is the story of the Bunny CDN outage linked below. Great read, thanks folks!

Dejan Grofelnik Pelzel — Bunny

It looks back at the Bunny CDN’s outage, as commented by the Editor above. A very transparent article with details such as the region that moved the traffic, the resource for decision, and the support status of the support ticket.

Navigating the 8 fallacies of distributed computing

There’s never a bad time to review the fallacies of distributed computing. This article introduces them with examples and discussion of each.

Alex Diaconu — Ably

As the title suggests, the following eight fallacies are explained about what they are, how they occurred, and how to manage them in order to design a reliable distributed system.

The network is reliable.
Latency is zero.
Bandwidth is infinite.
The network is secure.
Topology doesn’t change.
There is one administrator.
Transport cost is zero.
The network is homogeneous.

7 Essential Tools for SREs

These aren’t specific tools, but rather 7 classes of tools (with examples). They are:

* Chaos engineering
* Monitoring and alerting
* Observability
* Paging tools
* SLO management
* Infrastructure-as-Code (and everything-as-code)
* Automated incident response

Quentin Rousseau — Rootly

It describes the points that SREs should consider when building toolboxes and the above seven major categories of tools that SREs utilize, and suggests specific options for each.

Designing like a joint cognitive system

Design is interpretive. We have to find common ground before we can even start to create a design, but finding that common ground is part of the design.

For example, we think of building codes as being precise, but when applied to new situations, they are ambiguous, and the engineers must make a judgment about how to apply them.

Lorin Hochstein

As the Editor commented above, you have to find the “common ground” before you start designing, but there is a dilemma that the design itself is part of the design.

Resilience in Action E8: Vanessa Yiu on Crafting Enterprise Architecture

This starts with a really neat moment in which the interviewer asks Yiu to talk about lessons from her jewelry-making hobby that she applies to SRE.

Kurt Andersen

An episode of the podcast “Resilience in Action”. The podcast is embedded in the web page, with some excerpts and transcripts.
It talks about the similarities between its hobby of jewelry making and SRE, enterprise architecture, and SREcon. It was a very positive and calm narrative, and listening to it made me feel positive.

r/WallStreetBets Incident Anthology: Reddit’s Open Systems

When Gamestop’s stock shot through the roof earlier this year, Reddit’s traffic did too. This is the first article in a short series by Reddit’s SRE team on how they handled the influx.

This article is about the ways that user actions affected their systems in unexpected ways, and how they responded.

Courtney Wang — Reddit

The Editor’s comments above include everything I want to say here. Check out the other articles in this series.

SRE Cultural Values

Recently in our Site Reliability Engineering organization in Azure, we established a set of cultural values that we hold ourselves and each other accountable to.

Bill Johnson — Microsoft

As excerpted from the title and article above, it shares the following Cultural Values as SREs as determined by Microsoft’s Azure SRE organization. Recruitment information is also linked as a set for those who sympathize.
○ We Are Intentional
○ We Are Kind
○ We Are Brave
○ We Are Infinitely Curious

Outages

Western Digital “My Book Live” hard drives
Amazon Prime Video and Alexa
PharmOutcomes
PharmOutcomes is a SaaS used by pharmacies.
Commonwealth Bank
medium
I’ve gotten a few 500s from Medium while trying to review articles last week and this week. Maybe it’s this incident on their status page?
Bunny (CDN)
reddit
This post on their status site says “API errors”, but I saw rumblings that suggested that reddit itself was down.

KubeWeekly # 267 ←KubeWeekly is off due to a US holiday.

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara