SRE / DevOps / Kubernetes Weekly Collection#50(Week 2, 2021)

  • In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.

DEVOPS WEEKLY ISSUE #524 January 10th, 2021
SRE Weekly Issue #252 January 10th, 2021
KubeWeekly #246 January 15th, 2021

DEVOPS WEEKLY ISSUE #524 January 10th, 2021


A post on different sources of on-going maintenance, and some discussion of ways to improve the situation.

  • The title is “Software is drowning the world”.

As development teams grow, change becomes harder. This post looks at an approach for addressing high impact changes that are spread out across teams, and how to get buy in.

  • The title is “Campaigns”.

Discussion of the evolution of frameworks in software development, in particular looking at how AWS itself can be considered a framework, providing primitives for logging, events, scaling, monitoring and more.

  • The title is “AWS as a Framework”.

FOSDEM is going virtual this year on the 6th and 7th February, and lots of the devrooms have announced sessions. I’m particularly looking forward to the software composition devroom.

  • The web page of “FOSDEM (Free Open Source Developers’ European Meeting)”, a two-day online event sponsored by volunteers to promote the spread of free open source software. The above link is an introduction of each track.

Lots of organisations will have a bunch of Perl code busily servicing critical needs. The Perl Foundation is looking for input on what they can do to better support the community.

  • The title is “Coding in Perl? What support do you need?”.

A walkthrough of setting up a build and deployment pipeline for AWS ECS using Terraform, Terragrunt and GitHub Actions.

  • The title is “CI / CD Workflow for AWS ECS via Terragrunt and GitHub Actions”.

A great resource for learning Google Cloud Platform, this repo contains comprehensive sketchnotes covering all of the main GCP services.

  • As mentioned above, the GitHub page of sketch notes covers the main services of GCP. There is “Next 2020 Summary Announcements” as Topic, and it would be nice to have a summary of the services announced at such an event.

A good reading list for anyone moving into more management roles in software.

  • The title is “Recommended Engineering Management Books”.

A how to for setting up Kubernetes to use AWS EC2 spot instances to reduce cost and maintain a zero-downtime cluster as instances come and go.

  • The title is “Run Kubernetes Production Environment on EC2 Spot Instances With Zero Downtime: A Complete Guide”.

SRE Weekly Issue #252 January 10th, 2021


Building On-Call Culture at GitHub

Their on-call started out as four 24 hour shifts per person interspersed throughout the year. Find out how they transitioned to a new approach in a process that spanned the start of the pandemic.

Mary Moore-Simmons — GitHub

  • The contents of the title are explained according to the following major items.
    ○ Monolithic On-Call
    ○ New On-Call Culture
    ○ Continuing the Journey
    ○ The expression “Monolithic On-Call” and the hurdles from various perspectives were interesting.
    ○ I think the characters are a little too tight. I want to make it easier to see with line breaks.

Google Cloud Issue Summary — Google Meet — 2020–12–14

A new Meet version had a higher storage usage requirement, and a backend system filled up.


  • A summary of the failures that occurred on Google Meet at 2020–12–14 from 08:20 AM to 11:36 AM (PST) showed that storage surged when new features were released, depleting resources for one data store. Cause. The recurrence prevention measures are as follows.
    ○ Review alerting processes to improve detection of data store capacity issues
    ○ Adjust automated monitoring system logs to be more concise and exact to assist in troubleshooting
    ○ Evaluate existing troubleshooting processes to determine available improvements to mitigation and resolution times.

WTF is Alert Fatigue

This is a webinar on alert fatigue, coming up on January 14.

Sarah Wells — Financial Times Jamie Dobson — Container Solutions

  • Since it is covered in DEVOPS WEEKLY ISSUE # 524 above, I will skip it.

Announcing the Security Chaos Engineering Report

The chaos experiments you do for security purposes can often expose weak points in reliability as well.

Aaron Rinehart — Verica Kelly Shortridge — Capsul8

  • The first article in a series of multiple free O’Reilly reports.

Little Known Ways to Better Use Your Error Budgets

Here are four nifty outside-the-box ideas to use the data you may already have.

Emily Arnott — Blameless

  • The following items explain how error budgets can be useful for teams that are beyond the boundaries of departments throughout the organization, such as QA, legal affairs, and executives. It also touches on how engineers can use error budgets beyond development plans.
    ○ Legal teams can use error budgets as early warnings
    ○ Executives can use error budgets to take the pulse of development
    ○Error budgets and SLOs elevate the role of QA
    ○ Error budgets provide objectivity for experimentation

Lessons learned in incident management

Their custom incident management tool, DropSEV, can detect incident-worthy availability drops and file an incident automatically, obviating the need for an engineer to decide on severity level on the fly.

Joey Beyda and Ross Delinger — DropBox

  • The lessons learned at Dropbox in incident management are divided into the following six items and explained in detail.
  1. Background
  • The author hopes this article will serve as a case study of how to systematically understand the incident response of the organization itself and evolve it to meet user needs.

GitHub Availability Report: December 2020

This one has some additional detail on a November outage involving MySQL replication lag.

Keith Ballinger — GitHub

  • December 2020 of GitHub’s monthly Availability Report, which I have covered several times in this blog.


KubeWeekly #246 January 15th, 2021

The Headlines

Editor’s pick of the highlights from the past week.

CNCF Security Whitepaper Shows the Complexity of Securing Cloud Native Operations

Jack Wallen, The New Stack

Jack Wallen of The New Stack dives into CNCF’s Security whitepaper that focuses on the security of cloud native applications and highlights key learnings. The whitepaper discusses everything from cloud native layers, to the full lifecycle of development, to compliance (and everything in between).

  • It introduces the “Security whitepaper” released by CNCF .

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

Analyze Kubernetes files for errors with KubeLinter

Jessica Cherry,

  • An article which describes KubeLinter, an open source project released by Stackrox for analyzing security issues and erroneous code in YAML files. Red Hat announced that it has signed a definitive agreement to acquire StackRox.

Getting started with Buildah

Cedric Clyburn, Red Hat

  • As the title suggests, it explains how to get started with “Buildah”. YouTube video of the interactive session is also embedded in the web page.

Isolate a Pod in Kubernetes

Salman Iqbal

  • Continuing from last week, Salman’s YouTube video features the Webinar series, which describes the behavior of each Kubernetes component. It is easy to see because the time is settled in about 10 minutes.

Build Your Kubernetes Operator With the Right Tool

Alex Handy, Red Hat

  • It touches on the current state of choice when building Kubernetes Operators for software, and describes different approaches to simplifying decision-making for your use case.

The Editorial

Articles, announcements, and more that give you a high-level overview of challenges and features.

Sysdig 2021 container security and usage report: Shifting left is not enough

Aaron Newcomb, Sysdig

  • As the title suggests, this is the fourth annual report by Sysdig. It also details metric usage, popular alerts, container density trends, and Kubernetes usage patterns.

Vertical Pod Autoscaling: The Definitive Guide

Povilas Versockas

  • As the author writes, the vertical scaling of the “Definitive / Complete guide” for pods is comprehensively explained with the following items. An article to be read again.
    ○ Why do we need Vertical Pod Autoscaling?
    ○ Kubernetes Resource Requirements Model
    ○ What is Vertical Pod Autoscaling?
    ○ Understanding Recommendations
    ○ When to use VPA?
    ○ VPA Limitations
    ○ Real-World Examples
    ○ How does VPA work?
    ○ VPA’s Recommendation model
    ○ Lots more

What’s Your Kubernetes Maturity?

Danielle Cook, Fairwinds

  • It provides an end-to-end overview of the Kubernetes journey, the phases it passes through, and the Kubernetes Maturity Model, which provides the skills and activities you need to learn / perform in each.

Upcoming CNCF Online Programs

We have expanded our webinar program to Online Programs! Visit our website for the latest updates.

  • I checked the link, “Upcoming webinars” was “No Results Found” as of January 16, 2021, so this year’s Webinar seems to be still waiting for updates.

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara

An infra engineer in Tokyo, Japan. Grew up in Athens, Greece(1986–1992). #Network, #Kubernetes, #GCP, #Certified AWS SAP

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store