SRE / DevOps / Kubernetes Weekly Collection#52(Week 4, 2021)

  • In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
  • Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
  • I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #526 January 24th, 2021
SRE Weekly Issue #254 January 24th, 2021
KubeWeekly #248 January 29th, 2021

DEVOPS WEEKLY ISSUE #526 January 24th, 2021


A post on the evolution of the relationship between development and security teams, proposing 4 levels of maturity.

  • The title is “Four levels of maturity that bridge the AppSec / engineering divide”.
  • It proposes to integrate security work into continuous delivery (CD) as one of the most useful tools for successfully linking security and engineering.
  • It describes the following four typical maturities that security and engineering organizations pass through when building a pipeline of continuous integration (CI) and automation.
    ○ Level 1: Security finds problems; Engineering fixes them
    ○ Level 2: Security and Engineering collaborate to produce test cases and remediations
    ○ Level 3: After the issue is fixed, Security and Engineering collaborate to find systemic fixes and develop checks
    ○ Level 4: Security and Engineering now also proactively look for new classes of issues and create systemic checks before an actual problem occurs

There are lots of interesting things about the rise of ARM for server workloads, but one that will likely drive adoption is price/performance. This post looks at a series of PostgreSQL benchmarks.

  • The title is “PostgreSQL on ARM-based AWS EC2 Instances: Is It Any Good?”.
  • It follows the announcement of AWS’s second generation Graviton2 based EC2 instance in May 2020, they tested PostgreSQL on ARM-based EC2 instances as the title suggests.

A good web performance case study, with lots of examples, discussion of tools, code samples and improvements made.

  • The title is “How We Improved Smashing Mag Performance”.
  • This blog post details efforts to improve web pages running on JAM Stack using React. It optimized web performance and improved Core Web Vitals metrics.
  • Core Web Vitals is a subset of Web Vitals. Web Vitals, announced by Google in 2020, provides unified guidance on the high-quality signals that are essential to delivering a great user experience on the web.

Rust is picking up lots of interest recently, especially for systems work or low-level CLI tooling. But it might not be suitable, as a language or an ecosystem, yet for higher-level work like web development and APIs.

  • The title is “Rust is a hard way to make a web API”.
  • While touching on the goodness of Rust at the beginning, the author explains the content of the struggle of the title based on his experience.

Lots of software benefits from a custom installer, but what makes for a good user experience for this kind of software? This post shares some thoughts and examples.

  • The title is “Design choices for a declarative installer”.
  • If you focus on installing, upgrading, and removing a set of Kubernetes components, you can use the off-the-shelf software below, depending on your target environment, but you will need to tweak to integrate multiple components. It is mentioned at the beginning that the amount of configuration can cause frustration, errors, and nightmares to deal with.
    ○ For Kubernetes apps there is Helm and Continuous Delivery systems like Argo that can manage applications lifecycle described simply in naked yaml.
    ○ For pure operators there’s Operator Lifecycle Manager (OLM).
    ○ For more general infrastructure there is Terraform.
  • It presents the design choices that led them to their current approach on how to solve the above problems and create a hopefully better user experience.

I think we’re starting to see a new rise of distributions, several related tools that solve a larger problem when combined together. Konveyor is a new project focused on migration, with tools for moving between and to Kubernetes, moving virtual machines and more.

  • As the Editor mentioned above, the web page of the community “Konveyor” that focuses on tool-based migration. They are working on the following tools.
    crane — Migrate namespaces between Kubernetes clusters.
    forklift — Migrate virtual machines to KubeVirt.
    move2kube — Migrate from Cloud Foundry or Docker Swarm to Kubernetes.
    pelorus — Measure the four critical measures to software delivery performance.
    windup — Analyze applications for modernization paths.


Cinc is a community project to build a free distribution of the Chef software stack (currently including the Infra, Workstation and Inspec tools), released under an Apache 2.0 license.

  • A web page of project “Cinc” which has the following two goals.
  1. Making Chef Software Inc’s open source products easily distributable, by anyone
  2. Creating free distributions of Chef Software Inc’s open source products
  • The phrase “CINC is not Chef” under the logo is reminiscent of “YAML Ain’t Markup Language”.

Web Assembly is a low level technology which is likely to have wide ranging influence. A good example of the kinds of innovation it makes possible are things like Artichoke, a new Ruby language which compiles to a WASM binary.

PolicyHub CLI is a CLI tool that provides a simple discovery engine for finding useful Rego policies for Open Policy Agent.

  • A GitHub page of “Policy Hub CLI” that provides policy creators with a standard format to share policies to make them searchable.

Biome is a community distribution of Chef Habitat released under the Apache 2.0 license.

SRE Weekly Issue #254 January 24th, 2021


Coinbase Incident Post Mortem: January 6–7, 2021

This one’s juicy. At one point, the front-end was blocked up, so the back-end saw less traffic and scaled down. Then when the traffic came flooding back, the back-end was ill-prepared. We can all learn from this.


  • As stated in the title, Coinbase’s post-mortem. It has been updated with the full version of the post-mortem. It details the causes of downtime, how remediated it, and the steps taken to prevent similar outages.
  • The outage affected and the API used to serve mobile apps, but did not affect exchange trading through the API and the health of the underlying market.

Soar: Simulation for Observability, reliAbility, and secuRity

Cloudflare has what amounts to a sophisticated staging environment for testing new code.

Yan Zhai — Cloudflare

  • It describes simulation, one of the techniques used to fight with software complexity.
  • The Cloudflare’s simulation system “SOAR”, which is also in the title, has the following environment.
    ○ Simply put, it’s a data center built specifically for simulations. It runs the same software stack as our production data centers, but without any production traffic. Within SOAR, there are end-user servers, product servers, and origin servers (Figure 2). The product servers behave exactly the same as servers in our production edge network, and they are the targets that we want to test.

Failing to make progress under excess request load

Sometimes rolling back doesn’t actually get you back to a good state, especially when there’s pent-up demand.

Rachel By the Bay

  • It shares about the obstacles she experienced. As for the event that occurred, as described in the title and the comment of the above Editor.

Google Cloud Issue Summary — Google Meet — 2021–01–08

Here’s Google’s follow-up on a Google Meet outage earlier this month.


  • A summary of the outage in the title. Due to a Google Meet outage, the landing page cannot be accessed. With the release of the new landing page, it had set a redirect between the old and new landing pages, but redirect loops occurred there.

The Next Gen Database Servers Powering Let’s Encrypt

Those are some seriously big database servers.

Josh Aas and James Renken — Let’s Encrypt

  • It explains that Let’s Encrypt has achieved satisfactory results with the database server upgrade that was carried out in the late 2020.

Incident Management in 2021: from Basics to Best Practices

A great general overview of all aspects of incident response, including definitions and best practices.

Better Uptime

  • The content of the title is explained along with the following items “5 parts of the incident management process” and “5 steps to a bulletproof incident management process”.
  • 5 parts of the incident management process
  1. Best incident monitoring practices
  2. Best on-call practices
  3. Best incident alerting practices
  4. Best incident communication practices
  5. Best incident response practices
  • 5 steps to a bulletproof incident management process
  1. Best incident monitoring practices
  2. Best on-call practices
  3. Best incident alerting practices
  4. Best incident communication practices
  5. Best incident response practices

Using GPT-3 for plain language incident root cause from logs

Check out what happens when you unleash a generalized language model AI on some log messages related to an incident.

Larry Lancaster — Zebrium

  • It explains what they have done with OpenAI ‘s “GPT-3 language model” that the author is involved in so that he can get a glimpse of what they have done. It shares a couple of straightforward results with basic prompts only.

Taming Operational Load with VMware CRE

The CRE team at VMware undertook a project to find and reduce toil. Note that “with VMware CRE” does not mean “with some product named VMware CRE™”.

Gustavo Franco — VMware

  • VMware’s CRE (Customer Reliability Engineering) team describes the following results from a recently completed operational load assessment.
    ○ As a result, we significantly reduced that load, improved our team well-being, and increased the amount of spare time and energy we have to invest in reliability engineering projects to improve Tanzu.

Slack RCA for outage on January 4, 2021

This is Slack’s RCA for their outage earlier this month. This is a great example of a complex incident with many contributing factors — certainly no single “root cause” here.

  • As the title suggests, the final version of Slack’s outage.
  • As commented by the Editor above, there are many items in “Corrective Actions” because the outage is caused by multiple factors and is no single root cause. For some items, they have corrective actions with the cooperation of the cloud provider.

KubeWeekly #248 January 29th, 2021

The Headlines

Editor’s pick of the highlights from the past week.

Maintainer Spotlight: Kevin Wang of KubeEdge and Volcano

CNCF blog

This month’s spotlight focuses on Kevin Wang, a contributor in the CNCF community since its beginning, leader of the cloud native open source team at Huawei, and co-founder of the KubeEdge and Volcano projects. Read the blog to learn more about Kevin’s experience with the CNCF community over the past five years.

  • Kevin Wang, who is in the spotlight this time, is also challenging the CNCF TOC election. When I checked the GitHub page of TOC Elections for 2021 , the schedule was as follows, so the results will come out soon.
    ○ Election closes Feb 1, announced at noon

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

Operator integration testing for Operator Lifecycle Manager

Taneem Ibrahim, Red Hat

  • As the title suggests, it describes the steps required to test Operator’s OLM (Operator Lifecycle Manager) integration. The demo uses a simple Operator that outputs test messages to the shell.
  • The tools required for the local development environment used in this hands-on are as follows. It also offers the use of a free Red Hat account.
    Red Hat CodeReady Containers (CRC)
    Podman , or a Docker daemon process running on the local machine
    ○ Operator SDK toolkit, v1.0.0 or higher (optional)
    Operator Package Manager (OPM)
    ○ OpenShift Container Platform, cluster version 4.5 or higher

Kubernetes and GitOps with Flux CD V2.0


  • It explains the contents of the title hands-on according to the official instructions. Although he was recommended to try Flux CD and there’s a good reference project initiated by his colleague: k8s-gitops , he wanted to fully understand how to use the Flux CD, so he chose to start from scratch with the above official instruction, but it didn’t take him long to fully enable GitOps on his cluster.

Kubernetes at scale using Rancher Fleet

Saiyam Pathak, Civo

  • A YouTube Webinar video explaining the content of the title.
  • The speaker Saiyam Pathak energetically publishes to the live stream interview video of the events and Webinar video like this as good references as CNCF Ambassador, Director of Technical Evangelism at @civocloud. So I subscribe to his YouTube channel now.

Database Migrations Using Screwdriver and Kubernetes

Zhongkai Liu, Software Dev Engineer II & Palash Agrawal, Principal Software Dev Engineer

  • It explains the differences between the previous and current DB Migration Process in Yahoo Sports. As a tool, they use Screwdriver, an open source CD(continuous delivery) platform .
  • It makes sure that the term “migration” denotes any changes made to the database, including but not limited to inserting or deleting tables, populating data into, or removing entries from the database in this article.

Firecracker: start a VM in less than a second

Julia Evans

  • The author explains the use of Firecracker from the perspective of more DIY “I just want to run some VMs” perspective.
  • Although Initially when she read about Firecracker being released, she thought it was just a tool for cloud providers to use, the following points that turns out are explained at the beginning.
    ○ Firecracker is relatively straightforward to use (or at least as straightforward as anything else that’s for running VMs)
    ○ The documentation and examples are pretty clear
    ○ You definitely don’t need to be a cloud provider to use it
    ○ As advertised, it starts VMs really fast!

Scaling Kubernetes to 7,500 Nodes


  • They share the findings of using Kubernetes as a very flexible platform to meet their researcher needs.
  • The following two are listed as Unsolved problems, and it seems that migration work is underway for Metrics problems, and blog posts are planned for the results in the future.
    ○ Metrics
    ○ Pod network traffic shaping

Docker security scanning cheat sheet 2021

Jim Armstrong, Snyk

How to unit-test your helm charts with Golang

Alistair Hey

Create Kubernetes federated clusters on AWS

Theo “Bob” Massard,

  • It mentioned that AWS recently introduced their new solution to orchestrate Federated EKS clusters and explained from Kuberfed(Kubernetes Cluster Federation), which this solution is based on.

Self-Service Velero Backups with Kyverno

Ritesh Patel, Nirmata

  • It explains how to enable developer self-service backups in Velero using the new CNCF sandbox project, Kyverno.

ICYMI: CNCF online programs this week

A weekly summary of CNCF online programs from this week.

CNCF Live webinar: Kubernetes 1.20

Jeremy Rickard, VMware and Kirsten Garrison, Red Hat

  • The release team details Kubernetes 1.20 with new features and important deprecations.
  • Kubernetes 1.20 is one of the largest releases with over 40 different enhancements.

This Week in Cloud Native: Cloud Native Infrastructure in the Data Center with Cluster API & Tinkerbell (CAPT) (livestream)

Jason DeTiberus, Equinix and Manny Mendez, Equinix

  • Based on the following challenges, they explain how to use the Cluster API and Tinkerbell to introduce real cloud-native infrastructure management to your data center.
    ○ Up until now managing Kubernetes infrastructure outside of cloud providers has been difficult, and while there have been attempts to ease management of Kubernetes clusters within the data center previously we feel those attempts have been focused mostly on trying to shoehorn the management of clusters into legacy practices.

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

Siri, Storage, and Solutions, with Josh Bernstein

Craig Box, Kubernetes Podcast from Google

Nvidia Views Kubernetes as Key to GPU Accelerated AI Scale

Tobias Mann, SDxCentral

  • I noticed a lot because I had no connection and contact with the content of this title. It was good to me to read the following points from this article.
    ○ He explained that Nvidia’s work in this arena has been somewhat drowned out by webscale applications which have been and remain the primary use case for Kubernetes. However, Lamb argues there is a huge potential for GPU-accelerated Kubernetes clusters in artificial intelligence (AI) workloads, an arena where Nvidia has long dominated.
    ○ Looking to the future, Lamb expects GPUs will begin to move into the mainstream of Kubernetes, especially as “AI serving becomes a GPU-accelerated workload, which is just at the inflection point of taking off.”
    ○ “As things expand, I think most people are going to be able to just think about GPU accelerated as a fast button or an efficient button and not have to think about GPU development or programming,” he added.

Closed Box Monitoring, the Artist Formerly Known as Black Box Monitoring

Rick Rackow, Red Hat

Announcing Vitess 9

Vitess team

  • As the title suggests, a CNCF article that guides the release of Vitess 9.
  • In this article, it is explained as Major Themes in the following items.
    ○ Compatibility (MySQL, frameworks)
    ○ Migration
    ○ Innovation
    ○ Documentation
  • Click here for Release Notes.

Mentorship Spotlight: CommunityBridge Mentee with Keptn

CNCF blog

  • An article reporting that the author has completed the program as Mentee of the Community Bridge Program of Keptn, a CNCF sandbox project .
  • The program name has changed from “Community Bridge program” to “LFX Mentorship program”.

# 63 — From Prometheus to Thanos with Simon Pasquier (in French)

Electro Monkeys podcast

  • The podcast is spoken in French. They talk about what a project like Thanos brings to Prometheus, how it works, and what that feature does.

What is GitOps?

Salman Iqbal

  • A 6-minute webinar video on YouTube that explains the principles of GitOps and its benefits.

The Cloud Native Landscape: The Application Definition and Development Layer

Catherine Paganini, Buoyant and Jason Morgan, VMware

  • from Cloud Native Computing Foundation Business Value Subcommittee co-chairs Catherine Paganini and Jason Morgan that focuses on explaining each category of the cloud native landscape to a non-technical audience as well as engineers just getting started with cloud native computing.
  • This post is part of an ongoing series from “Cloud Native Computing Foundation Business Value Subcommittee” co-chairs Catherine Paganini and Mr. Jason Morgan on explaining each category of the cloud native landscape to a non-technical audience as well as engineers just getting started with cloud native computing.
  • This post describes the Application Definition and the Development layer of cloud native landscape. The next article will focus on cloud-native platforms.

Kubernetes Begins Year With A Bang — And You Can Expect More

Chris Metinko, Crunchbase

  • An article that describes investments and acquisitions in the Kubernetes ecosystem, which has already been in motion since the beginning of 2021, as well as future forecasts.

Linkerd User Survey 2021- Take the survey

  • One-page Survey of Google Forms. Check it out if you are using or interested in Linkerd.

Upcoming CNCF Online Programs

This Week in Cloud Native: Kubernetes Policies-as-code
Jim Bugwadia, Nirmata
February 3, 2021 at 11:00 am PT
Register Now

CNCF On-demand Webinar: Policy as Code to Manage Security Rist in Kubernetes Before and After Deployment
Cesar Rodriguez, Accurics
February 4, 2021
Register Now

For more information, please visit our updated Online Programs page.

  • From the link above, the items “Future events”, “Past events”, and “Organizer” were created, and future events were included.
  • Since a Group has been created in CNCF Online Programs, I have registered as the 8th member.

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara

An infra engineer in Tokyo, Japan. Grew up in Athens, Greece(1986–1992). #Network, #Kubernetes, #GCP, #Certified AWS SAP

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store