SRE / DevOps / Kubernetes Weekly Collection#47(Week 52)

  • In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.

DEVOPS WEEKLY ISSUE #521 December 20th, 2020
SRE Weekly Issue #249 December 20th, 2020
KubeWeekly # 245 ← No Updates

DEVOPS WEEKLY ISSUE #521 December 20th, 2020


There are lots of tools for storing data, but how do you find the right dataset for analysis? This post explores a number of different architectural approaches and discusses pros and cons.

  • The title is “Data Hub: Popular metadata architectures explained”.

A good writeup from a recent AWS reInvent talk focused on AWS’s Serverless services. This post focused on what this means for operations, which is often neatly ignored in the marketing.

  • The title is “Does AWS Serverless care about IT Operations? Their service naming says “no” but their breadth and quality of choice says “yes”.

Related to a large CDN outage, this post looks at some of the theory behind the RAFT consensus algorithm, and whether it provides liveness guareeness during network failures and what can be done about it.

  • The title is “Raft does not Guarantee Liveness in the face of Network Faults”.

This long read introduces YOLOSec and FOMOSec as terms to describe problematic but all-to-common approaches to security strategy, driven either by short-termism or by chasing fashion.

  • The title is “On YOLOsec and FOMOsec”.

More and more teams are now needing to manage multiple Kubernetes clusters. This post takes a look at the monitoring challenges that brings, and how to solve them with Prometheus and Grafana.

  • The title is “How to monitor multi-cloud Kubernetes with Prometheus and Grafana”.

A post exploring DNS routing in Kubernetes, stepping through several potential solutions to a specific problem.

  • The title is “Forbidden lore: hacking DNS routing for k8s”.

Ensuring SSL certificates don’t expire is an essential if annoying problem, and several services exist to help. This post runs down a list of different solutions.

  • The title is “10 Best Tools to Monitor SSL Certificate Expiry, Validity & Change”.
  1. Sematext Synthetics

A look at building a Kubernetes-based platform using Argo Workflows and Argo Events.

  • The title is “Building Kubernetes Clusters using Kubernetes”.

SRE Weekly Issue #249 December 20th, 2020


Generic mitigations

Every service needs a couple of big hammers that are easy to swing.

Jennifer Mace — O’Reilly and Google

  • The concept of “generic mitigation” is explained using cute illustrations.

How Facebook keeps its large-scale infrastructure hardware up and running

Answer: automation. Lots of automation. And automation of the automation.

Fred Lin, Harish Dattatraya Dixit, and Sriram Sankar — Facebook

Tips for On Call Engineers During the Holidays

Oh, how quaint! This article was written back when people traveled for the holidays.

Ashley Roof —

  • It introduces tips for On Call during the holiday season.

Raft does not Guarantee Liveness in the face of Network Faults

Surprise! Fortunately, there are some ways to fix this limitation.

Heidi Howard, Ittai Abraham — Decentralized Thoughts

  • I covered it in DEVOPS WEEKLY ISSUE # 521 above, so I will skip it.

Anatomy of Unsuccessful Incident Management

A common question when a company is implementing incident management is: why do we need this process?

It turns out that the easiest way to answer this question is to look at the world of unsuccessful incident management.


  • The simplest way to answer a frequently asked question when a company implements incident management, “Why this process is needed,” describes the following characteristics of failed incident management:
    ○ Confusion about Process
    ○ Panic and Thrash
    ○ Lack of Awareness
    ○ Blame
    ○ Uncoordinated & Conflicting Response
    ○ Confusion over Ownership
    ○ Repeat Problems

Just Culture: Standardizing Fire Service Accountability

Whether you’re new to Just Culture or an old hand, there’s a lot of great detail in this article.

Tory Thompson — Firehouse

  • It describes “Just Culture” as an industry term used to describe a value-based accountability model that considers the behaviors, systems, and expectations that make up an organization.

Let’s Talk: Full-Service Ownership

Not sold yet on full service ownership for development teams? This interview may help.

Vivian Chan — PagerDuty

  • It introduces the introduction of “full-service ownership” to the issues and answering questions in an interview format. The question is below.
    ○ Q: First things first, what exactly is a service?
    ○ Q: So what’s the big deal about full-service ownership? Why should IT and engineering leaders care? Paint me a picture.
    ○ Q: What is one of the biggest drivers for moving to a model of full-service ownership?
    ○ Q: Where does one even start? Supporting Grounded Incident Analysis

While ostensibly about, this article makes a great case for why incident analysis is important in general and what kind of data we should be trying to gather.

John Allspaw — Adaptive Capacity Labs

  • An introductory article on, an analysis platform specializing in software-related incidents, by an angel investor.

Heroku incident #2130 follow-up: Heroku Connect Sync Issue

A new feature roll-out resulted in impaired service for some customers.

  • An incident report of Heroku Heroku Connect. Syncing with Salesforce affected 25% of production connections.

Uber’s adventures in the adaptive universe

The adaptive universe: where adaptations to challenges feed back and cause more challenges, requiring more adaptations.

Lorin Hochstein

  • It explains what’s in the title and Editor’s comments, but since it’s an article from a Twitter thread by former Uber engineer McLaren Stanley, the author highly recommends reading the original threat as follows:
    ○ I highly recommend reading the original thread in full. My writing above is based solely on that thread, I don’t have any additional information, and I probably got some stuff wrong. I also created a concept map based on Stanley’s thread.

The Shadow Request: Troubleshooting OkCupid’s First GraphQL Release

Our first GraphQL release was twice as slow as our old REST API. Here’s how we fixed it.

Another great example of making a duplicate request to a new API in the background to test it before deploying it.

Michael P. Geraci — OkCupid

  • Since they were building the GraphQL API on a whole new stack, they wanted to see how it measures under production load compared to the previous REST API so that it doesn’t adversely affect the user experience. The story of thinking and releasing “Shadow Request”.


Failure information of each of the above companies

KubeWeekly # 245 ← No Updates

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara

An infra engineer in Tokyo, Japan. Grew up in Athens, Greece(1986–1992). #Network, #Kubernetes, #GCP, #Certified AWS SAP

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store