SRE / DevOps / Kubernetes Weekly Collection#45(Week 50)

  • In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.

DEVOPS WEEKLY ISSUE #519 December 6th, 2020
SRE Weekly Issue #247 December 6th, 2020
KubeWeekly #243 December 11th, 2020

A post on moving from systems administration to information security, with good observations of the advantage of being new to a topic and how a variety of skills help when moving to different roles.


Hindsight is an interesting design tool for future systems. This post, from someone ver familiar with its working, looks at what applying hindsight and some opinions to Kubernetes might mean if you build a new container orchestrator.

  • The title is “A better Kubernetes, from the ground up”.

An interesting post on scaling CI/CD pipelines across development teams, with a focus on self-service.

A breakdown of the recent large AWS outage, based on the public information but with a useful diagram to understand the apparent architecture and a handy list of the proposed plans to avoid similar issues in the future.

  • The title is “Kinesis Outage”.

Software is always rooted in when it was first written. This post touches on GitHub’s architecture and technology choices and also how, and why, that’s evolving.

  • The title is “GitHub’s journey towards microservices and more: ‘We actually have our own version of Ruby that we maintain”.

Details of scaling datastores, from active/active MySQL clusters to using Vitess.

The Kubernetes API is designed to be extended with new resources. This post looks at a more flexible Deployment resource which supports more fine grained control around rollout and running multiple versions of a service at the same time.

  • The title is “Introducing Kubernetes Pinned Deployments”.


Opstrace is a new horizontally-scalable metrics and logs platform, optimised for installation on cloud platforms. It exposes a prometheus-compatible API, as well as working with a variety of agents like those from Fluentd and Datadog.

  • As mentioned above, the GitHub page of the horizontally scalable metric and logging platform “opstrace”.

Replicate is a new tool that aims to solve version control problems for machine learning. It’s a python library that allows for snapshots to be saved in S3 or Google Cloud Storage and tools for retrieving and reusing those versions.

Nydus is a set of tools that aims to improve over the current OCI image specification in terms of container launching speed, image space and network bandwidth efficiency. The tutorial is a nice way of understanding how it works.

  • The GitHub page of the project “Nydus” that implements a userspace file system on top of a container image format that improves the current OCI image specification.

SRE Weekly Issue #247 December 6th, 2020


2020 09 25 Incident: Infrastructure connectivity issue impacting multiple systems

This incident report from a September Datadog outage has an interesting tidbit about scaling external incident response in tandem with internal.

Alexis Lê-Quôc — Datadog

  • An article dated 10/06/2020. It reports Datadog’s incident occurred in US region between September 24, 2020, 14:27 UTC and September 25 00:40.

Google Cloud Issue Summary — Google Drive — 2020–11–16

This is Google’s write-up for an interesting issue that involved repeated re-sending of invitations to edit a Google Drive document.


  • The first notification email sent when sharing Drive resources with “users or groups whose profile email address contains uppercase letters” using Google Drive’s shared web UI, was duplicated repeatedly as noted in the Editor’s comments above too.

What I Wish I Knew About Incident Management

I basically want to immediately absorb any article with this title, unless it’s just clickbait spam. This one definitely isn’t.

Ronak nathani

  • It shares the incident management practices he has learned over the years as LinkedIn SREs.

Scaling Datastores at Slack with Vitess

Lots of juicy details in this one about the difficulty Slack has had in scaling their DB layer and how Vitess solved their problems.

Arka Ganguli, Guido Iaquinti, Maggie Zhou, and Rafael Chacón — Slack

Mitigate Connection Leaks in Production via Proxies

Hitting file descriptor limits is such an annoying kind of outage. Some good tips here, clearly coming from hard-won experience.

Utsav shah

  • It describes several approaches that can be used together to mitigate Connection Leaks and the trade-offs between them.
    ○ Singletons/Dependency Injection
    ○ Client Count Metrics
    ○ File Descriptor Count Alerts
    ○ Sidecar Processes

Improving the Resiliency of Our Infrastructure DNS Zone

They used two providers synced with OctoDNS.

Ryan Timken and Kiran Naidoo — Cloudflare

  • It describes how to increase the reliability of your infrastructure’s DNS zones by leveraging their own DNS products running at the edge and third-party DNS providers using multiple primary name servers.

Root Cause Analysis For Reliability: A Case Study

This is all about understanding the whole system (people and technology) and building learning, rather than finding a superficial “root cause”.

Piyush Verma — Last9

  • From “Why RCA(Root Cause Analysis) is necessary for reliability?”, The content of the title is explained based on the author’s experience.


KubeWeekly #243 December 11th, 2020

The Headlines

Editor’s pick of the highlights from the past week.

Kubernetes 1.20: The Raddest Release

Kubernetes 1.20 Release Team

The final Kubernetes release of 2020 is “the raddest release”, bringing 42 enhancements to the project as well as bug fixes and performance improvements. Check out the release notes or listen to the Kubernetes Podcast interview with the release team lead Jeremy Rickard.

  • An article with title from Kubernetes Blog on And the following summary of Kubernetes 1.20, “Major Themes / Major Changes / Other Updates / Release notes / Availability of release / Release Team / Release Logo / User Highlights / Project Velocity / Ecosystem Updates / Event Updates / Upcoming release webinar / Get Involved “ item of Explanations and related information are linked for each.
    ○ 42 enhancements: 11 enhancements have graduated to stable, 15 enhancements are moving to beta, and 16 enhancements are entering alpha.

ICYMI: CNCF Webinars

You can view all CNCF recorded and upcoming webinars here.

CNCF Member webinar: Fundamentals of OpenTelemetry

Ted Young, Director of Developer Education @Lightstep

  • It tries to explain everything you need to know to get started with OpenTelemetry.

CNCF Member webinar: A look at how Hackers exploit Prometheus, Grafana, Fluentd, Jaeger & more (hacking monitoring for fun and profit)

Omer Levi Hevroni, Application Security Engineer @Synk

  • They perform threat modeling of the tools in the title(Prometheus, Grafana, Fluentd, Jaeger & more) and check what risks they pose.

CNCF Member webinar: Preventing Kubernetes misconfiguration: static analysis and beyond

Matt Johnson, Developer Advocate Lead @Bridgecrew

  • The following points describe best practices for large-scale infrastructure creation, testing, and maintenance using policy-as-code on both CI/CD and Kubernetes clusters at runtime.
    ○ Compare methods for securing infrastructure using open-source tools including Checkov
    ○ Review sample use cases that showcase the benefits of different approaches
    ○ Cover the current state of open source repositories and Kubernetes manifests found in the wild

CNCF Member webinar: SPIFFE and SPIRE in practice

Dan Feldman, Principal Software Engineer @Hewlett Packard Enterprise Umair Khan, Product Marketing Lead @Hewlett Packard Enterprise

  • It introduces the key use cases of two projects (SPIFFE and SPIRE) and how they are used to build secure systems in some of the world’s largest and most security-conscious organizations.

CNCF Member webinar: Metal³: Kubernetes-native bare metal host management

Maël Kimmerlin, Senior Software Engineer @Ericsson Software Technology Feruzjon Muyassarov, Experienced Developer @Ericsson Software Technology Pep Turro Mauri, Senior Software Engineer @Red Hat

  • It introduces the Metal³ (called “metal kubed”) project and its motives, and outlines what has been achieved so far.

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

Using Snyk and Podman to scan container images from development to deployment

Matt Jarvis, Red Hat

  • It focuses on the container scanning capabilities available via the “Snyk CLI” and how to integrate it with the new Podman APIs available in Podman and Podman 2.x and available in Red Hat Enterprise Linux 8.3.

Kubernetes: Efficient multi-zone networking with topology aware routing

Bob Killen, Google Cloud

OpenShift/Kubernetes failure stories at scale — Lessons learned from large and dense deployments

Naga Ravi Chaitanya Elluri, Red Hat

  • As a member of Red Hat’s Performance and Scalability team, what happened while building tools, workloads, and automation to simulate a real-world production environment and monitor cluster health? The following three scenarios are explained, such as how to debug and how to prevent such a situation.
    ○ Scenario 1: Rogue DaemonSet Took Down the 2000 Node Cluster
    ○ Scenario 2: Too Many Objects in Etcd Lead to Writes Being Blocked
    ○ Scenario 3: Hosting Etcd on Slower Disks Created Havoc on the Cluster

Automating volume expansion management — an operator-based approach

Raffaele Brusholi, Red Hat

  • A blog post in the “Storage, How-tos, Operators, Prometheus, automation” category of Red Hat. A previous blog post described some best practices for metrics used when monitoring applications.

OPA the easy way feat. Styra DAS!

Amey Deshmukh, InfraCloud Technologies

  • It describes hands-on with a focus on configuring OPA as an admission controller for Kubernetes clusters using StyraDAS.

How to use a policy engine to improve your security posture


  • It touches on the fact that most of the recent security breaches are due to “misconfiguration” and explains the necessity and the role of the policy engine.

Service discovery in Kubernetes — combining the best of two worlds

Ivan Velichko,

  • As the title suggests, it describes Kubernetes service discovery.

Kernel privilege escalation: how Kubernetes container isolation impacts privilege escalation attacks

Kamil Potrec, Snyk

  • It explains how the isolation of Kubernetes containers impacts privilege escalation attacks as titled.

GSoD 2020: Improving the API reference experience

Philippe Martin

  • As part of the “ GSoD (Google Season of Docs) 2020” project, we announced the results of improving the documentation of the Kubernetes API Reference.

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

Why Linkerd doesn’t use Envoy

William Morgan, Linkerd

  • While paying homage to Envoy, he explains the title “Why Linkerd doesn’t use Envoy” as follows.
    ○ Let me also state upfront that this is not an “Envoy sucks’’ blog post. Envoy is a great project, is clearly a popular choice for many, and we have nothing but respect for the fine folks who work on it. We recommend Envoy to Linkerd users every day in the form of ingress controllers like Ambassador, and there are production systems around the world today where you can find Envoy and Linkerd working side by side.
    ○ So in this article I’m going to do my best to lay out the reasons why in a frank and engineering-focused way. After all, Linkerd is built by engineers and for engineers, and if there’s one thing I’m proud of, it’s that we’ve made decisions on the basis of engineering tradeoffs rather than marketing pressure.
    ○ In short: Linkerd doesn’t use Envoy because using Envoy wouldn’t allow us to build the lightest, simplest, and most secure Kubernetes service mesh in the world.

2021 Predictions: The year that cloud native transforms the IT core

Bill Mann, Steering

  • Looking at the articles with these titles and contents, you can feel the year-end and New Year holidays. Styra’s 2021 forecast Top 5 is as follows. I can’t help but worry about the numbers in the various data.
  1. Kubernetes in production will continue to skyrocket, creating new challenges for security and compliance

Upcoming CNCF webinars

You can check some Recorded Webinars and Upcoming Webinars here. The following are posted as Upcoming CNCF webinars at that moment.

Member Webinar: Power to the people — making root/Docker a reality inside a Gitpod Container
Christian Weichel, Chief Architect @Gitpod
Alban Crequy, Director of Kinvolk Labs @Kinvolk
Dec 11, 2020 10:00 AM Pacific Time

Member Webinar: Implementing automated managed k8s service
Mason Choi, Senior Engineer @Samsung SDS
Kangsub Song, Senior Engineer @Samsung SDS
Dec 15, 2020 10:00 AM Korea Time

Member Webinar: Reducing your Kubernetes cloud spend
Webb Brown, CEO @Kubecost
Dec 15, 2020 10:00 AM Pacific Time

Member Webinar: Argo: Real Enterprise-scale with K8s
Al Kemner, Principal Software Engineer and Architect @New Relic
Daniel Jimbel, Staff Engineer @New Relic
Caleb Troughton, Product Manager, Telemetry Data Platform @NewRelic
Dec 16, 2020 7:00 AM Pacific Time

Member Webinar: Machine learning for K8s logs and metrics
Larry Lancaster, Founder and CTO @Zebrium
Dec 16, 2020 1:00 PM Pacific Time

Member Webinar: Kubernetes configuration — Auditing for enterprise best practices through open source tooling
Kendall Miller, President @Fairwinds
John Wynkoop, Cloud CTO @IGNW
Dec 18, 2020 10:00 AM Pacific Time

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara

An infra engineer in Tokyo, Japan. Grew up in Athens, Greece(1986–1992). #Network, #Kubernetes, #GCP, #Certified AWS SAP

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store