SRE / DevOps / Kubernetes Weekly Collection#71(Week 23, 2021)

9 min readJun 13, 2021

In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #545 June 6th, 2021
SRE Weekly Issue #273 June 6th, 2021
KubeWeekly #265 June 11th, 2021

DEVOPS WEEKLY ISSUE #545 June 6th, 2021

News

An excellent post on building a healthy on-call culture for developers. Lots of concrete advice, centered on respecting engineers and their time.

The title is “Building a Healthy On-Call Culture”.
As commented by the Editor above, this is a wonderful article with very specific advice such as the optimal on-call frequency, the number of rotations, and the reasonable length of time to hand over.

ProtoBuf API v2 has some large performance implications. This post is a good primer, as well as a look at how one project solved the problem with a project-specific code generator.

The title is “A new Protocol Buffers generator for Go”.
An article on how to interact with Vitess Protocol Buffers, performance tuning, benchmarks, and more.

Is an AWS account a security boundary? This post digs into the details, showing a large number of ways services cross accounts.

The title is “AWS Accounts as Security Boundaries — 97+Ways Data Can be Shared Across Accounts”.
It describes an account strategy that assumes that the AWS account itself acts as security boundaries, and how to share each service and their data between AWS accounts.
Click here for a list on GitHub. Click here for the spreadsheet version of the list.

Moving from a monolithic architecture to one based on many services often means having to distribute authorization. This post explores how one organization used humeji.

The title is “Himeji: A Scalable Centralized System for Authorization at Airbnb”.
It introduces “Himeji,” Airbnb’s Zanzibar-based authorization system. The “Himeji Cache” in the figure just looks like “Himeji Castle”.

Some useful tips for building usable monitoring dashboards.

The title is “[MONITORING] How to build your monitoring dashboards?”.
It guides how to make a monitoring dashboard by comparing the following two strategies.
○ Strategy 1: Yeah, I do not know. We have metrics, we plot metrics
○ Strategy 2: Overview. Top-down. Left-right. Cohesive. Consistent.

A post on the benefits of being able to repave a datacenter, including tips on how to get started.

The title is “IS REPAVING DATA CENTERS THE WAY TO BETTER ROI?”.
It describes the content of the title briefly with the following items.
○ Why Repave Your Data Center?
○ IT Ops become Innovation Leaders Instead of Blocker
○ Planning Considerations
○ It’s OK to Start Small
○ Real Talk

A look at a toolchain for building and publishing container images, using GitHub Actions and ECR. It’s a good example of the trade off between complexity and secure tool chains with current tooling.

The title is “A Rube Goldberg Machine for Container Workflows”.
An article that describes recent container workflows.
Click here for the GitHub page of GitHub Container Registry to Amazon Elastic Container Registry Image Sync introduced in the article.

Tools

Managing tags for cloud resources is critical but also a pretty thankless task. Yor is a new tool to help, that integrates with infrastructure as code and is intended for use in a CI pipeline.

A GitHub page of “Yor”, an open source tool that helps you add useful and consistent tags across Infrastructure-as-Code frameworks such as Terraform, CloudFormation, and Serverless.
Click here for the web page.

SRE Weekly Issue #273 June 6th, 2021

Articles

Incident Management vs. Incident Response

What indeed? It depends on who you ask.

Quentin Rousseau — Rootly

It explains the similarities and differences between several competing perspectives of Incident Management and Response, and what SRE can learn from a variety of perspectives.

Cores that don’t count

This academic paper explains Google’s efforts toward identifying “mercurial” CPU cores — cores that make erroneous computations.

[…] we observe on the order of a few mercurial cores per several thousand machines […]

This one blew my mind:

A deterministic AES mis-computation, which was “selfinverting”: encrypting and decrypting on the same core yielded the identity function, but decryption elsewhere yielded gibberish.

Peter H. Hochschild, Paul Turner, Jeffrey C. Mogul, Rama Govindaraju, Parthasarathy Ranganathan, David E. Culler, and Amin Vahdat — Google

As mentioned above, an 8-page academic paper published by Google. Focusing on Mercurial core and CEEs (Corrupt Execution Errors), the discussion after Abstract is as follows.

Introduction
Impacts of mercurial cores
Are mercurial cores a novel problem?
The right metrics
What causes mercurial cores?
Detecting and isolating mercurial cores
Mitigating CEEs
Related work
Next steps and research directions

Minimizing ossification risk is everyone’s responsibility

The decisions, non-decisions, and workarounds that we implement now can have lasting effects on the Internet as a whole.

Mark Nottingham — Fastly

Full disclosure: Fastly is my employer.

Along with the title, it explains with concrete examples that by minimizing ossification, the Internet will continue to evolve smoothly and will be able to meet future challenges.

What is resilience engineering? A lightning talk with background information

A great intro to the topic of resilience engineering. Hint: resilience != high availability.

Piet van Dongen — Luminis Arnhem

A summary article with LT background and reference information explaining what resilience engineering is, why every software person should know it, and how to get started with it.
The 11-minute LT video and other reference videos are also embedded.

Dealing with new kinds of trouble

When you include people in your definition of “the system”, something that looked like a system failure where humans had to “step in” is actually a success in which the system adapted.

Lorin Hochstein

The title is explained by quoting the explanation of “graceful extensibility” from David Woods’s paper “ The Theory of Graceful Extensibility: Basic rules that govern adaptive systems “.

Please don’t count outages (or SEVs, or whatever)

I find the way this author presented this argument especially convincing. My favorite part is the real-world story toward the end.

Rachel by the Bay

The point of this content is similar to the article “Stop Counting Production Incidents” that I covered in a former edition of this blog, so it seems good to read it together.

How Facebook deals with PCIe faults to keep our data centers running reliably

Facebook presents their method for finding and dealing with PCIe errors in their infrastructure.

Ashwin Poojary, Bill Holland, Makan Diarra, and Ray Park — Facebook

As the above Editor comment and the title suggest, it explains the in-house tools and workflow as a countermeasure for errors caused by PCIe (Peripheral component interconnect express).

GitHub Availability Report: May 2021

Overflow of a 32-bit integer primary key caused a security issue.

Scott Sanders — GitHub

It explains the events, responses, and countermeasures for the two incidents that occurred in May.

Building a Healthy On-Call Culture

This caught my eye. I’ve seldom been in an on-call rotation with shifts that were not a week or two at a time.

The optimal frequency for being on call is about three days a month.

There’s also a good discussion of paying for on-call shifts, which, in my experience, goes a long way toward making on-call more palatable.

Christine Patton — SoundCloud

Since it is covered in DEVOPS WEEKLY ISSUE # 545 above, I will skip it.

Outages

HBO Max
Apple Card
Sling TV
Google Meet
GitHub
Discord
Discord had several outages this week.

KubeWeekly #265 June 11th, 2021

The Headlines

Editor’s pick of the highlights from the past week.

CloudNative TV launched this week!

We are now one week into launch of CloudNative.tv. The launch includes shows ranging from 101 explainers, to getting started contributing to projects, and highlighting the unique people that make up the CNCF’s community of doers. Week two will bring you the rest of our hosts including Solid State with Tim Banks, Cloud Native LatinX with Leonardo Murillo, CNCFaceOff with Matt Stratton, and Certs Magic with Saiyam Pathak. You can find the whole schedule here.

It introduced the start of “CloudNative TV” introduced last week and the program schedule.

Editor’s note: In observance of the Juneteenth holiday, there will be no KubeWeekly on June 18, 2021. We will resume publishing on June 25, 2021.

KubeWeekly will be off next week due to a holiday. I am personally grateful to receive such information in advance.

ICYMI: CNCF online programs this week

A weekly summary of CNCF online programs from this week.

Use your favorite programming language to build your dream cloud native platform

Matt Stratton, Pulumi

Approximately 1 hour session of live coding of code (Typescript) that uses Pulumi’s services to quickly launch and execute apps on Kubernetes.

Tackling Customer Issues in cloud native environments

Elinor Swery, Rookout

A 30-minute session explaining the different methods and strategies that engineering managers can employ to help teams manage customer issues more effectively in a cloud native environment.

Cloud native policy enforcement with Open Policy Agent

Anders Eknert, Director

A 34-minute session explaining how OPA(Open Policy Agent) can help make policy decisions at scale on cloud-native stacks.

Persist your data in an ephemeral K8s ecosystem

Eric Zietlow, MayaData

A 14-minute session explaining what, why, and how to use Kubernetes for your data.

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

A new protocol buffers generator for Go

Vicente Marti, Vitess

Since it is covered in DEVOPS WEEKLY ISSUE #545 above, I will skip it.

A deep dive into Kubernetes Schema Validation

datree.io blog

It explains how to avoid misconfigurations and which tool is best to use. It emphasizes the importance of schema validation testing and its early implementation.

I wrote a Twitter Bot using OpenFaaS to avoid missing out on CfP deadlines

Carlos Panato, OpenFaaS

It explains how to write a bot and the tools used to create your own tools and get ideas for your own projects.

Learn how to manage your functions with kubectl

Alex Ellis, OpenFaaS

It describes alternatives to the OpenFaaS API and CLI for deploying and managing functions in Kubernetes.

“Gateway Mode” in Kuma and Kong Mesh

Cody De Arkland, Kong

It briefly describes the relationships between Kong Gateway, Kuma, and Kong Mesh products and how to use them together.
An explanation video of about 6 minutes is also embedded in the Web page.

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

In the Clouds (S2E6) | CNCF’s Priyanka Sharma

Chris Short, Red Hat

A 50-minute session with CNCF GM Priyanka Sharma as a guest to explain CNCF from the beginning.
The most relaxed impression of the interview video of Priyanka Sharma I’ve ever seen.

Cloud-agnostic third party managed Kubernetes services — the unexploited opportunity

Lars Larsson, Elastisys

It explains the following five reasons to challenge the current situation, and the possibilities and ideals of cloud-agnostic managed Kubernetes services.

Ensure business continuity by letting a managed service provider handle your Kubernetes-based platform
Increase ability to target new markets, even ones that require you to run on-premise (or just not at a major cloud provider).
Increase process efficiency and minimize risk by reducing tool sprawl — in spite of deploying to multiple clouds.
Make migration much easier, because not just your application, but also your platform, is cloud agnostic.
Contribute to the cloud native community by being the community.

Introducing Kubernetes Community Days Bengaluru 2021

Neependra Khare, founder of CloudYuga and CNCF Ambassador

An introductory article about the above event to be held on 2021/06/25–26th. Last year’s event was canceled due to a pandemic, but this year it will be held virtually.
Click here to register.

Harbor operator 1.0 is available now!

Harbor project team

As the title suggests, it thanks the contributors for the GA of Harbor operator 1.0 and introduces current features, future additions, roadmaps, Harbor projects, and more.

Reminder: Take the CNCF Cloud Native Survey — Part 1 to share your thoughts on cloud, containers, and Kubernetes.

Continued reminders of “CNCF Cloud Native Survey — Part 1”.

Upcoming CNCF Online Programs

Cloud Native Live

June 16: Turbocharging AKS networking with Calico eBPF presented by Chris Tomkins, Tigera — RSVP

On-demand Webinars

June 17: Monitoring Containers in Kubernetes in a Multi-Cloud Environment presented by Amit Sharma, Splunk — RSVP
June 10: Autoscaling Event Driven Applications with Fission & Keda presented by Vishal Biyani & Gaurav Gahlot, InfraCloud — RSVP
Looking for more great curated content? Visit our Online Programs playlist on YouTube.

Learn more about CNCF Online Programs

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara