SRE / DevOps / Kubernetes Weekly Collection#34(Week 39)

  • In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
  • Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
  • I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #508 September 20th, 2020
SRE Weekly Issue #236 September 20th, 2020
KubeWeekly #234 September 25th, 2020

DEVOPS WEEKLY ISSUE #508 September 20th, 2020


Describing policy (or in fact configuration in general) in machine-readable form quickly gets into a conversation over whether you should prefer data, a general programming language or a DSL. This post does a good job of explaining why.

  • The title is “Anatomy of a Rule”.
  • It explains how the system handles complex authorization policies efficiently and naturally from both perspectives (code and data) when building “oso”, an open source policy engine for authorization in the following three points.
  • Rules as Data
  • Rules as code
  • Rules Redux

An excellent post on moving to alerts based on service-level objectives, SLOs. Covers the why and how, based on documents used internally to make the case for the change.

  • The title is “Alerting on SLOs”.
  • I will skip it because it was covered in last week’s SRE Weekly Issue # 235.

A discussion of the need to test in production and an introduction to the dark canary pattern for doing so safely.

  • The title is “Production testing with dark canaries”.
  • I will skip it because it was covered in last week’s SRE Weekly Issue # 235.

A look at a tool-agnostic architectural framework for building resilient systems, focused around predictability, observability, recoverability and keeping things simple.

  • The title is “PORK: A Technology Resilience Framework”.
  • This article explains why you shouldn’t waste valuable engineering time bulletproofing systems in advance. Instead, design your systems to absorb a constant rate of change and uncertainty by recovering quickly before there’s client impact.
  • PORK is an acronym for the following four principles: a resiliency framework.
    ○ Predictability
    ○ Observability
    ○ Recoverability
    ○ Keep it simple

A look at a range of Kubernetes local clients/dashboards including Octant, Kubenav, Lens and more.

  • An article that briefly compares and describes common tools that developers can use as alternatives or additional tools to kubectl for Kubernetes cluster research.
  • The following tools featured in this article can be installed locally without the need to install components on the cluster. I want to try it one by one.

Even with all the talk of cloud native, it’s still super useful for lots of roles to have a solid grounding in UNIX programming. This Advanced Programming in the UNIX Environment course is not available completely online.

  • It is a programming course named “Advanced Programming in the UNIX Environment”.
  • It is an online course to learn Unix OS (and all operating systems in this family, including Linux, BSD, and even Mac OS), how to develop complex system-level software in the C programming language, and how to program it.
  • Click here for Youtube video.

Have you ever wanted to write Python inside your SQL queries? Well now you can with Postgres using PL/Python. These post act as an introduction and show off some interesting demos with embedded numpy.

  • The titles are “Getting Started with Postgres Functions in PL / Python (link above)” and “ Exploring PL / Python: Turn Postgres Table Data Into a NumPy Array “.
  • The first article briefly describes how to create a Postgres function using PL/Python.
  • In the second article, it is running NumPy with a simple user-defined function that takes advantage of PL/Python database access functions. This function demonstrates an easy way to convert a Postgres data table to a NumPy array.

An introduction to Open Policy Agent Gatekeeper, specifically looking at addressing issues with the built-in pod security policies feature in Kubernetes.

  • The title is “Using Gatekeeper as a drop-in Pod Security Policy replacement in Amazon EKS”.
  • An AWS article that describes “Gatekeeper” as a standard way to proactively control what is allowed within a Kubernetes cluster, including Amazon EKS.

GitHub Actions is still relatively new, but there is already a huge amount of content available for it. This post looks at various actions for analyzing code for security problems.

  • The title is “GitHub Actions for Security Code Analysis”.
  • Here are some of the author’s favorite GitHub Actions features that it used for security-focused code analysis.
  • Before getting into his favorite commentary on GitHub, there is a list of recent blog posts he has published related to code analysis and security. It is a good article that has a good reference link and is concisely organized.


Terratag is a new CLI tool that enables users of Terraform to automatically create and maintain tags across their entire set of AWS, Azure, and GCP resources

  • The GitHub page of the OSS CLI tool Terratag, which allows you to apply tags or labels to the entire set of Terraform files.
  • It applies tags and labels to AWS, GCP, and Azure resources.

SRE Weekly Issue #236 September 20th, 2020


My first outage

A nice juicy post-incident report from the archives. Remember the first time you took down production?

Mads Hartmann — Glitch

  • A retrospective article when the author first caused a failure with his own hands in the production environment of
  • At, they were reading SRE books at an in-house reading club, which was a great opportunity to put their theory into practice. Post-mortem is written in the format of SRE books.

Fault during testing of NordLink

While testing a new power transmission link, it was accidentally overloaded by a factor of ~14x, with far-reaching but ultimately well-managed effects.

Thanks to Jesper Lundkvist for this one.

  • An article about a failure that occurred during a test drive of the NordLink project.
  • The NordLink project is an electricity interchange between Norway and Germany. Normal business is scheduled to start next year.

Throughput autoscaling: Dynamic sizing for

As Facebook moved from a static to an auto-scaled web pool, they had to try to predict their expected demand as accurately as possible.

Daniel Boeve, Kiryong Ha, and Anca Agape — Facebook

  • A Facebook article explaining throughput autoscaling with one of the main services, “Web Tier” which handles HTTP requests from people who use their service each time they interact with Facebook.
  • Web Tier is a large-scale global service distributed in multiple data centers around the world.
  • The presentation video of Systems @ Scale 2020 is embedded. It would be nice to watch the presentation video after reading this article. They provide presentation style for the first 20 minutes and discussion style in the second half.

Database migrations lessons learned

The key lesson involves ensuring that your migrations avoid using parts of the production code, which could cause their action to change down the line inadvertently.

Frank Lin — Octopus Deploy

  • It introduces database migration and shares the following five lessons learned, several common frameworks and the author’s nearly a decade of experience.
  1. Keep your migration scripts away from your production code
  2. Keep it low-tech, don’t deserialize
  3. Write tests to exercise each migration script individually
  4. Consider running long migrations online
  5. Consider versioning your documents

Moobot vs. Gatebot: Cloudflare Automatically Blocks Botnet DDoS Attack Topping At 654 Gbps

Cloudflare uses an interesting multi-layered approach to mitigating attacks.

Omer Yoachimik — Cloudflare

  • It explains that the DDoS attack that occurred on July 3 was automatically detected and mitigated by Cloudflare’s global DDoS protection system Gatebot.
  • UDP-based DDoS attacks reached 654 Gbps at peak times. This DDoS attack is believed to have been generated by the Mirai- based botnet “Moobot.” I didn’t have any ideas about botnets or IoT devices’ DDoS, so it was very helpful for me to get into this genre.

Availability, Maintainability, Reliability: What’s the Difference?

The availability/reliability distinction in this article is thought-provoking.

Emily Arnott — Blameless

  • “What does reliability mean?” To answer this question, it classifies “reliability” from the perspectives of “availability” and “maintainability”, which are other indicators of SRE as follows:
  • Distinguishing these terms isn’t a matter of semantics. Understanding the differences can help you better prioritize development efforts towards customer happiness.

Troubled Times: Episode 3

2020 has shown the value of adaptive capacity. 2021 will show whether or not adaptive capacity can be sustained.

This article (not a video or podcast despite the name) also focuses on the increasing importance of learning from incidents.

Dr. Richard Cook — Adaptice Capacity Labs

  • The following four crises related to the current situation are listed and explained while focusing on the interaction between the four and the resilience of society.
  1. Covid-19 pandemic
  2. Economic slowdown
  3. Social disintegration
  4. Climate change

Building and revising adaptive capacity sharing for technical incident response: A case of resilience engineering

What is resilience engineering? What does a resilience engineer do? Are there principles of resilience engineering? If so, what are they? What makes it possible to engineer resilience?

This academic paper uses a case study to show how a company engineered the resilience of their system in response to a series of incidents.

Richard I. Cook and Beth Adele Long — Applied Ergonomics

  • It describes some of the candidate features and conditions observed in certain cases of resilience engineering. When I read these papers, I think, “As an engineer, I would like to clarify and dig deeper into my specialty and theme like this.”


  • Google Drive
    This is a post-analysis for two outages, one from this past week and the other from the week before.
  • Instagram
  • Facebook
  • Discord
  • Fastly
  • Gandi
    Postmortem regarding the Network Incident from September 15, 2020 on IAAS and PAAS FR-SD3, FR-SD5, and FR-SD6
    A layer 2 network loop was accidentally introduced, on two separate occasions.
    Sébastien Dupas — Gandi
  • Azure
    This was an outage on Sept. 14 in the UK South region. A cooling system was shut off in error during a maintenance procedure.

KubeWeekly #234 September 25th

The Headlines

Editor’s pick of the highlights from the past week.

KubeCon + CloudNativeCon Europe 2020 — Virtual Conference Transparency Report: A very successful first virtual event!

CNCF staff

The shift to a virtual KubeCon + CloudNativeCon EU wasn’t easy or even expected, but the community came together to share knowledge, learn about new projects, and play drag queen bingo. The KubeCon + CloudNativeCon transparency reports provide insight into event attendance, diversity and inclusion, and drills into the talk section process for the events, which is run by the event co-chairs and their program selection committee.

YAML Templating Solutions: Helm & Kustomize


Writing config files by hand is like coding with Notepad instead of an IDE. There are ways to automate most of it, and this usually starts with either Helm or Kustomize. This article presents a 101-level overview of both, and helps in choosing which one’s the better fit for your use case.

  • An explanation video of YouTube is embedded and transcribed.
  • I laughed at the joke at the beginning.
    ○ The modern term Kubernetes engineer derives from an ancient Greek idiom that translates roughly to… YAML craftsperson. Sorry, bad joke.

ICYMI: CNCF Webinars

You can view all CNCF recorded and upcoming webinars here.

CNCF Member webinar: Critical DevSecOps considerations for multi-cloud Kubernetes

Sylvain Huguet, Sr. Product Manager — Karbon/Kubernetes @Nutanix & Loris Digioanni, CTO & Founder @Sysdig

  • Two cloud-native experts in infrastructure and security provide valuable insights on the following:
  • How containerized applications use compute, storage, and network resources differently than do legacy applications
  • Why hyper-converged infrastructure is suited for Kubernetes
    ○ How a Kubernetes stack should be instrumented for observability
    ○ Best practices for implementing system-wide security for multi-cloud Kubernetes

CNCF Member webinar: Mitigating Kubernetes attacks

Wei Lien Dang, Head of Strategy @StackRox

  • The following points provide recommendations for protecting your cloud, on-premises, and hybrid Kubernetes deployments.
  • Key tactics and techniques you can expect attackers will use on Kubernetes clusters
  • The range of Kubernetes-specific and cloud-specific controls to apply
  • A prioritized list of mitigation steps you should apply to give you the broadest protection
  • StackRox is investigating each of its 40 documented attack vectors and has created a series of detailed mitigation procedures that can be applied to protect the Kubernetes environment.

CNCF Member webinar: Using KubeVirt in telcos

Abhinivesh Jain, Distinguished Member of Technical Staff @Wipro

  • It describes the relevance of KubeVirt to telcos, focusing on current limitations and challenges from a telco adoption perspective.
  • There are instructions on how to use KubeVirt for hosting a Windows VM.

CNCF Member webinar: AWS controllers for Kubernetes — AWS services, now kubified!

Jay Pipes, Principal Open Source Engineer @Amazon Web Services

  • A video explaining the design and usage of ACK (AWS Controllers for Kubernetes) by one of the creators of this in AWS.
  • You can use the Kubernetes API to write not only Kubernetes objects, but also the resources in which your app resides in your Kubernetes manifest. As the title says, “Make AWS services as Kubernetes.(kubified)”
  • No CloudFormation behind the scenes
  • Not EKS specific, runs on any Kubernetes

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

Ingress for Anthos — Multi-cluster ingress and global service load balancing

Gokul Chandra

  • An article explaining the Google cloud-hosted multi-cluster ingress controller “Ingress for Anthos” for Anthos GKE clusters. It is amazing that this writer explains and illustrates from the reader’s point of view so that the readers can understand it very carefully and hands on by themselves.

Installing Kubernetes Metrics Server securely

Neil Wilson, Brightbox

  • It explains some methods and points to install Kubernetes Metrics Server securely.
  • Kubernetes Metrics Server is a service that runs within Kubernetes and provides indicators of container resources such as CPU and RAM usage.

How we moved to Github-based Kubernetes config management

Benjamin Yolken, Segment

  • An article by introducing the history of Github-based Kubernetes config management in line with the release of “kubeapply”, a lightweight tool for git-based management of Kubernetes configurations.
  • kubeapply supports configurations in raw YAML, templated YAML, Helm charts, and skycfg, facilitating a complete change workflow that includes configuration enhancement, validation, diff generation, and application.

GSoC 2020 — Building operators for cluster addons

Somtochi Onyekwere

  • The story of the author participating in the Google Summer of Code and contributing to the cluster addons of Kubernetes.
  • I also thought that I was getting older so I would be happy to let go of these positive challenges and achievements of such younger students. (When I was a student, I was in the School of Political Science and Economics, and I was mainly involved in international law, so it’s a completely different major.)

Detecting CVE-2020–14386 with Falco and mitigating potential container escapes

Kaizhe Huang, Sysdig

  • The explanation focuses on the contents of CVE-2020–14386, which was reported as severity “high” on 2020/09/04, and the detection method by Falco and Sysdig Secure.

Containing a real vulnerability

Fabricio Voznika, gVisor

  • Following the announcement of the above vulnerability (CVE-2020–14386), gVisor is not vulnerable to this particular issue, but provides an interesting case study to continue gVisor’s security investigation.
  • In addition, gVisor is not affected by vulnerabilities, but it describes some steps to minimize the impact and fix vulnerabilities if found.

Yes, you can run VMs on Kubernetes with KubeVirt

Bryant Son, Red Hat correspondent

  • It explains how to use KubeVirt via the locally runnable open source Kubernetes platform “Minikube”.
  • It has embedded videos that explain how to use KubeVirt, how to install Minikube, and the rest of the tutorial.

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

Grafana, with Torkel Ödegaard

Craig Box and Adam Glick, Kubernetes Podcast from Google

CommunityBridge Spotlight: Get the most out of the CommunityBridge program

Sonia singla

  • The author, who graduated from the “Community Bridge Program” with the Thanos community of the Linux Foundation, describes its experience and suggestions for future Community Bridge internships with plenty of emojis in order to make the most of its internship.

Cloud native ecosystem feels COVID-19 crunch

Dan Meyer, SDxCentral

  • As an example, the release of Kubernetes 1.19 shows that the ongoing COVID-19 pandemic is also affecting software developers and the cloud-native community, who tend to be seen as isolated environments that appear to be unaffected by the outside world. It is explained with examples such as “It was postponed to the end of August” and “The term forcing function used by Mr. Kelsey Hightower”.

DevOps 049: DevOps, Open Source, and OpenShift with Chris Short

Adventure in DevOps Podcast

  • DevOps-themed podcast.
  • The guest is Chris Short, editor of KubeWeekly. He explains about OpenShift and the new streaming media channel that has been orchestrated over the past few months.

Ask the Product Manager Office Hours: Top 5 problems with Kubernetes and how we are fixing them

Mike Barrett and Chris Short, Red Hat

  • A YouTube video of Red Hat’s OpenShift team. Chris Short, editor of Kube Weekly and Mike Barrett (Senior Director of Product Management) talk about the title. The story of Mike’s career path at the beginning was also interesting.

Air Force to demo updating software on a jet in flight, official says

Mila Jasper, Nextgov

  • Nextgov article from Government Executive Media Group.
  • A showcase of US Air Force jets updating software during flight will be held within a few weeks (as of 9/15).

Upcoming CNCF webinars

You can check some Recorded Webinars and Upcoming Webinars here. The following are posted as Upcoming CNCF webinars at that moment.

Member Webinar: VanillaStack as a platform for a truly vendor-agnostic open-source ecosystem
Karsten Samaschke, CEO @Cloudical
Sept 29, 2020 10:00 AM Pacific Time

Member Webinar: Effective disaster recovery strategies for Kubernetes Rasheed Amir, CEO @Stakater AB
Sept 30, 2020 7:00 AM Pacific Time**

Member Webinar: Self service Kubernetes for enterprises
Jim Bugwadia, Founder and CEO @Nirmata
Sept 30, 2020 10:00 AM Pacific Time

Member Webinar: Dapr, Lego for microservices
Mark Chmarny, Principal Program Manager @Microsoft
Oct 1, 2020 10:00 AM Pacific Time

Member Webinar: Transactional microservices — The final frontier
Daniel Kozlowski, Minister of Engineering @PlanetScale
Oct 2, 2020 10:00 AM Pacific Time

Member Webinar: Multi-Cluster & multi-cloud service mesh with CNCF’s Kuma and Envoy
Marco Palladino, CTO & Co-Founder @Kong
Oct 6, 2020 10:00 AM Pacific Time

Member Webinar: The evolution of cloud orchestration systems from ephemeral to persistent storage
Boyan Krosnov, CPO @StorPool
Oct 7, 2020 8:00 AM Pacific Time

Member Webinar: Kubernetes native two-level resource management for AI/ML workloads
Diana Arroyo Software Engineer @IBM Research
Alaa Youssef, Manager, Container Cloud Platform @IBM Research
Oct 7, 2020 10:00 AM Pacific Time

Member Webinar: Building dynamic machine learning pipelines with KubeDirector
Tom Phelan, Fellow, Software Organization @Hewlett Packard Enterprise
Oct 8, 2020 10:00 AM Pacific Time

Member Webinar: A full application environment for every PR–before you merge to master!
Vishal Biyani, CTO @InfraCloud
Jono Spiro, Staff Software Engineer, Engineering Operations @OpenGov
Oct 14, 2020 10:00 AM Pacific Time

Member Webinar: S&P experience report: multi-cloud serverless on Knative
Evan Anderson, Software Engineer @VMware
Mark Wang, Head of Cloud Engineering @S&P Global Ratings
Oct 15, 2020 10:00 AM Pacific Time

Member Webinar: How to migrate NF or VNF to CNF without vendor lock-in
Grzegorz Sikora, VP Business Development @OVOO
Oct 20, 2020 10:00 AM Pacific Time

Member Webinar: Deploying Kubernetes to bare metal using cluster API
Seán McCord, Principal Senior Software Engineer @Talos Systems, Inc.
Oct 21, 2020 1:00 PM Pacific Time

Member Webinar: K8s audit logging deep dive
Randy Abernethy, Managing Partner @RX-M
Oct 22, 2020 10:00 AM Pacific Time

Member Webinar: Building 12 factor streaming data apps on Kubernetes
Stelios Charmpalis, Frontend Engineer
Francisco Perez, Senior Backend Engineer
Oct 23, 2020 10:00 AM Pacific Time

Member Webinar: Developer-friendly platforms with Kubernetes and infrastructure as code
Lee Briggs, Staff Software Engineer @Pulumi
Nov 6, 2020 10:00 AM Pacific Time

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara

An infra engineer in Tokyo, Japan. Grew up in Athens, Greece(1986–1992). #Network, #Kubernetes, #CKA, #CKAD, #Certified AWS SAP

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store