SRE / DevOps / Kubernetes Weekly Collection#50(Week 2, 2021)

10 min readJan 19, 2021

In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #524 January 10th, 2021
SRE Weekly Issue #252 January 10th, 2021
KubeWeekly #246 January 15th, 2021

DEVOPS WEEKLY ISSUE #524 January 10th, 2021

News

A post on different sources of on-going maintenance, and some discussion of ways to improve the situation.

The title is “Software is drowning the world”.
One of the many advantages the author has gained from working in many organizations is that he can understand the commonalities, and he explains from the following perspectives on the subject of “technical debt.”
○ “Every time you decide to solve a problem with code, you are committing part of your future capacity to maintaining and operating that code. Software is never done.”

As development teams grow, change becomes harder. This post looks at an approach for addressing high impact changes that are spread out across teams, and how to get buy in.

The title is “Campaigns”.
The author proposes a tool called Campaign.
To coordinate groups of many people, hold groups accountable, and ultimately succeed in paying off technical debt, making architectural changes, improving the customer experience, reducing costs, and more. As a tool / framework that can be used for, Campaign is explained that the followings are required.
○ A Goal
○ Metrics toward that goal
○ Buy-in
○ Method of Accountability
○ A “Window”
○ A Target Date

Discussion of the evolution of frameworks in software development, in particular looking at how AWS itself can be considered a framework, providing primitives for logging, events, scaling, monitoring and more.

The title is “AWS as a Framework”.
It explains according to the title from the following viewpoints. It aims to justify both the AWS framework and its unique potential when it’s fully utilized.
○ AWS doesn’t sound like an “infrastructure” provider anymore, not even a “platform” provider. It sounds like a framework!

FOSDEM is going virtual this year on the 6th and 7th February, and lots of the devrooms have announced sessions. I’m particularly looking forward to the software composition devroom.

The web page of “FOSDEM (Free Open Source Developers’ European Meeting)”, a two-day online event sponsored by volunteers to promote the spread of free open source software. The above link is an introduction of each track.
Click here for “the software composition devroom” that the Editor is interested in.
○ It is usually held in Brussels (Belgium) and says “FOSDEM is widely recognized as the best such conference in Europe.”

Lots of organisations will have a bunch of Perl code busily servicing critical needs. The Perl Foundation is looking for input on what they can do to better support the community.

The title is “Coding in Perl? What support do you need?”.
They are conducting a survey that takes only a few minutes to investigate what they want or need to support engineers looking to move to Perl or progress within Perl.
This survey will be conducted throughout January and the results will be announced at FOSDEM mentioned above.

A walkthrough of setting up a build and deployment pipeline for AWS ECS using Terraform, Terragrunt and GitHub Actions.

The title is “CI / CD Workflow for AWS ECS via Terragrunt and GitHub Actions”.
The content of the title is explained in the following flow in different colors so that the figures and codes are easy to see.
○ Initial Setup
○ Workflow via GitHub Flow
○ Configure Infrastructure and Deployment Targets
○ Configure Container Environment and Secrets
○ Integration via GitHub Actions — Pytest
○ Deployment via GitHub Actions — Terragrunt
○ Conclusion

A great resource for learning Google Cloud Platform, this repo contains comprehensive sketchnotes covering all of the main GCP services.

As mentioned above, the GitHub page of sketch notes covers the main services of GCP. There is “Next 2020 Summary Announcements” as Topic, and it would be nice to have a summary of the services announced at such an event.

A good reading list for anyone moving into more management roles in software.

The title is “Recommended Engineering Management Books”.
It introduces a book from the author who has been an engineering manager for the past three and a half years.
It Introduces a carefully selected list of books that helped, influenced / impacted themselves in “Professional software engineers for over 10 years, a whole new challenge, the process of growing as an engineering manager” and highly recommended to engineering managers.
Below is a list of books. You can explain the good points of each with actual experience.
○ The Manager’s Path: A Guide for Tech Leaders Navigating Growth & Change by Camille Fournier
○ Thanks for the Feedback by Douglas Stone & Sheila Heen
○ The Hard Thing About Hard Things: Building a Business When There are No Easy Answers by Ben Horowitz
○ Accelerate: Building and Scaling High Performing Technology Organizations by Nicole Forsgren, PhD, Jez Humble, and Gene Kim
○ Dare to Lead: Brave Work. Tough Conversations. Whole Hearts. by Brene Brown
○ Switch: How to Change Things When Change is Hard
○ Atomic Habits: An Easy & Proven Way to Build Good Habits by James Clear

A how to for setting up Kubernetes to use AWS EC2 spot instances to reduce cost and maintain a zero-downtime cluster as instances come and go.

The title is “Run Kubernetes Production Environment on EC2 Spot Instances With Zero Downtime: A Complete Guide”.
I will skip it because it was covered in Kube Weekly # 245 last week .

Events

3 a.m. wake ups are for heart surgeons, newborn parents… and SREs. Sarah Wells from the Financial Times is on a mission with Jamie Dobson, CEO of Container Solutions to keep good night rests sacred. Join them in the latest WTFinar, on Alert Fatigue and how to manage it. Sign up here:

It introduced Webinar with the theme of “Alert Fatigue”.
Since it was scheduled on 1/14 (Thursday) 11:00 CET (Central European Time zone).

SRE Weekly Issue #252 January 10th, 2021

Articles

Building On-Call Culture at GitHub

Their on-call started out as four 24 hour shifts per person interspersed throughout the year. Find out how they transitioned to a new approach in a process that spanned the start of the pandemic.

Mary Moore-Simmons — GitHub

The contents of the title are explained according to the following major items.
○ Monolithic On-Call
○ New On-Call Culture
○ Continuing the Journey
○ The expression “Monolithic On-Call” and the hurdles from various perspectives were interesting.
○ I think the characters are a little too tight. I want to make it easier to see with line breaks.

Google Cloud Issue Summary — Google Meet — 2020–12–14

A new Meet version had a higher storage usage requirement, and a backend system filled up.

Google

A summary of the failures that occurred on Google Meet at 2020–12–14 from 08:20 AM to 11:36 AM (PST) showed that storage surged when new features were released, depleting resources for one data store. Cause. The recurrence prevention measures are as follows.
○ Review alerting processes to improve detection of data store capacity issues
○ Adjust automated monitoring system logs to be more concise and exact to assist in troubleshooting
○ Evaluate existing troubleshooting processes to determine available improvements to mitigation and resolution times.

WTF is Alert Fatigue

This is a webinar on alert fatigue, coming up on January 14.

Sarah Wells — Financial Times Jamie Dobson — Container Solutions

Since it is covered in DEVOPS WEEKLY ISSUE # 524 above, I will skip it.

Announcing the Security Chaos Engineering Report

The chaos experiments you do for security purposes can often expose weak points in reliability as well.

Aaron Rinehart — Verica Kelly Shortridge — Capsul8

The first article in a series of multiple free O’Reilly reports.
After issuing the following lines, he touches on the outline of the report while touching on Security Chaos Engineering (SCE), SCE’s core tool “Chao Slinger”, and so on.
○ Hope isn’t a strategy. Likewise, perfection isn’t a plan.

Little Known Ways to Better Use Your Error Budgets

Here are four nifty outside-the-box ideas to use the data you may already have.

Emily Arnott — Blameless

The following items explain how error budgets can be useful for teams that are beyond the boundaries of departments throughout the organization, such as QA, legal affairs, and executives. It also touches on how engineers can use error budgets beyond development plans.
○ Legal teams can use error budgets as early warnings
○ Executives can use error budgets to take the pulse of development
○Error budgets and SLOs elevate the role of QA
○ Error budgets provide objectivity for experimentation

Lessons learned in incident management

Their custom incident management tool, DropSEV, can detect incident-worthy availability drops and file an incident automatically, obviating the need for an engineer to decide on severity level on the fly.

Joey Beyda and Ross Delinger — DropBox

The lessons learned at Dropbox in incident management are divided into the following six items and explained in detail.

Background
The SEV process
Detection
Diagnosis
Recovery
Continuous improvement

The author hopes this article will serve as a case study of how to systematically understand the incident response of the organization itself and evolve it to meet user needs.

GitHub Availability Report: December 2020

This one has some additional detail on a November outage involving MySQL replication lag.

Keith Ballinger — GitHub

December 2020 of GitHub’s monthly Availability Report, which I have covered several times in this blog.
In December, there were no incidents leading to service downtime, so we provide an overview of incident response and follow-up details as described in the November report.

Outages

Slack
My first couple hours of work this year were oddly quiet…
Heroku
Google Meet
This is different from the one above.
Fanduel
Twitch
Coinbase
Archive of Our Own

KubeWeekly #246 January 15th, 2021

The Headlines

Editor’s pick of the highlights from the past week.

CNCF Security Whitepaper Shows the Complexity of Securing Cloud Native Operations

Jack Wallen, The New Stack

Jack Wallen of The New Stack dives into CNCF’s Security whitepaper that focuses on the security of cloud native applications and highlights key learnings. The whitepaper discusses everything from cloud native layers, to the full lifecycle of development, to compliance (and everything in between).

It introduces the “Security whitepaper” released by CNCF .
It digs deep into the white paper from an administrator’s perspective, touching on the need to develop and manage with complexity across multiple layers of the cloud.

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

Analyze Kubernetes files for errors with KubeLinter

Jessica Cherry, Opensource.com

An article which describes KubeLinter, an open source project released by Stackrox for analyzing security issues and erroneous code in YAML files. Red Hat announced that it has signed a definitive agreement to acquire StackRox.

Getting started with Buildah

Cedric Clyburn, Red Hat

As the title suggests, it explains how to get started with “Buildah”. YouTube video of the interactive session is also embedded in the web page.

Isolate a Pod in Kubernetes

Salman Iqbal

Continuing from last week, Salman’s YouTube video features the Webinar series, which describes the behavior of each Kubernetes component. It is easy to see because the time is settled in about 10 minutes.

Build Your Kubernetes Operator With the Right Tool

Alex Handy, Red Hat

It touches on the current state of choice when building Kubernetes Operators for software, and describes different approaches to simplifying decision-making for your use case.

The Editorial

Articles, announcements, and more that give you a high-level overview of challenges and features.

Sysdig 2021 container security and usage report: Shifting left is not enough

Aaron Newcomb, Sysdig

As the title suggests, this is the fourth annual report by Sysdig. It also details metric usage, popular alerts, container density trends, and Kubernetes usage patterns.
The numbers and proportions of each element are expressed in an easy-to-understand manner by combining figures and graphs.

Vertical Pod Autoscaling: The Definitive Guide

Povilas Versockas

As the author writes, the vertical scaling of the “Definitive / Complete guide” for pods is comprehensively explained with the following items. An article to be read again.
○ Why do we need Vertical Pod Autoscaling?
○ Kubernetes Resource Requirements Model
○ What is Vertical Pod Autoscaling?
○ Understanding Recommendations
○ When to use VPA?
○ VPA Limitations
○ Real-World Examples
○ How does VPA work?
○ VPA’s Recommendation model
○ Lots more

What’s Your Kubernetes Maturity?

Danielle Cook, Fairwinds

It provides an end-to-end overview of the Kubernetes journey, the phases it passes through, and the Kubernetes Maturity Model, which provides the skills and activities you need to learn / perform in each.
Click here to check the details of each face. This article only provides a brief summary of each phase.
○ Phase 1 Prepare
○ Phase 2 Transform
○ Phase 3 Deploy
○Phase 4 Build Confidence
○ Phase 5 Improve Operations
○ Phase 6 Measure & Control
○Phase 7 Optimize & Automate

Upcoming CNCF Online Programs

We have expanded our webinar program to Online Programs! Visit our website for the latest updates.

I checked the link, “Upcoming webinars” was “No Results Found” as of January 16, 2021, so this year’s Webinar seems to be still waiting for updates.

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara