SRE / DevOps / Kubernetes Weekly Collection#79(Week 31, 2021)

12 min readAug 8, 2021

In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #553 August 1st, 2021
SRE Weekly Issue #281 August 1st, 2021
KubeWeekly # 271 August 6th, 2021

DEVOPS WEEKLY ISSUE #553 August 1st, 2021

News

A new report on devops and cloud trends, covering the evolution of cloud services, adoption of SRE, observability and CI/CD practices and more.

The title is “DevOps and Cloud InfoQ Trends Report — July 2021”.
Since it is covered in “KubeWeekly #269 July 23th, 2021 “, I will skip it.

There are nearly 100 DNS record types! This comprehensive post covers details of the different types, including providing an example domain files demonstrating everything.

The title is “(All) DNS Resource Records”.
As the Editor comment and title above, this article comprehensively covers DNS Resource Records (RRs). There were many things that I didn’t recognize other than the general RRs that it picked. I bookmarked and will check again.

A recent paper on developer productivity, with good coverage of misconceptions (about activity, individual productivity, management and more) and proposing the SPACE framework (Satisfaction, Performance, Activity, Communication and Efficiency).

The title is “The SPACE of Developer Productivity”.
It explicates several common myths and misconceptions about developer productivity and proposes the SPACE framework above.
The SPACE framework is meant to help individuals, teams, and organizations identify pertinent metrics that present a holistic picture of productivity; this will lead to more thoughtful discussions about productivity and to the design of more impactful solutions.

All projects include some level of testing. But do we agree on what we mean by testing when it comes to software development? Not always, as discussed in this post.

The title is “We need to talk about testing”.
From the perspective of “how programmers and testers can work together for a happy and fulfilling life”, several topics related to testing are taken up and explained.

Microservice environments present a novel set of reliability challenges. This post looks at some of those challenges, and some of the best practices to address them.

The title is “The Unique Reliability Engineering Requirements of Microservices”.
It discusses how managing reliability for a microservices-based app is different from working with a monolith with the point that “In short, although the fundamental principles and concepts that undergird reliability engineering are the same in any context, SREs should adapt practices to the special requirements of whichever type of environment they are supporting.”.

The first two posts in a series on the theory of monitoring, starting with defining terms and then discussing indicators and synthetics.

A series that explains the theory of monitoring so that readers can broaden their horizons about the world of tech monitoring by researching “from scratch”. The title of the above link is “Monitoring theory, from scratch — The definitions”. Summary/takeaways are below.
○ Monitoring is about efficiently and promptly observing things important to your business
○ Good monitoring is monitoring that allows a human to derive insights about your business’ operation, in order to prevent or minimize damage.
○ Good tech monitoring is monitoring that allows a human to ensure your business operates correctly, smoothly, efficiently, securely and in compliance while reducing false positives and false negatives to a minimum.
The title of the second article is “Monitoring theory, from scratch — Indicators and synthetics”. Summary/Takeaways are below.
○ Try to pick indicators that offer the smallest gap between your desired definition of ‘up’ and their ability to report it.
○ Document and train operators on such gaps and implementation limits.
○ Try to pick indicators that are as discrete as possible and leave as little room as possible for interpretation.
○ If you end up scratching your head much while looking at your indicators, revise and re-iterate.
○ Remember that indicators aren’t predictive by nature and need to be complemented with other measures/systems.
○ Remember that system state changes and you need to make sure your indicators are fresh and react to it. It’s always an on-going process.
○ Consider supporting synthetic transactions in your indicator strategy, in particular if you have a complex and distributed end to end system.

A look at one organisations tracing infrastructure, discussing various aspects of the implementation from clients to capacity planning and other challenges.

The title is “Making Tracing as a part of Engineering DNA”.
Within ShareChat ‘s platform team, they have built a solution to understand the performance and response of how 10,000 CPU cores and RAM are being utilized by hundreds of microservices under various circumstances.
They share meaningful insights from seeking detailed observability and monitoring metrics from all resources on ShareChat’s network.

Tools

Authorino is a cloud native AuthN/AuthZ enforcer for Zero Trust API protection built on top of Envoy. It provides a wide range of identity verification methods, policy enforcement options and caching options.

A GitHub page of “Authorino”, a cloud-native authentication/authorization enforcer for zero trust API protection.

Naml is a configuration management tool for describing Kubernetes configuration in Go. It has a nice tool for converting YAML (including direct from the Kubernetes API) into Go as a way to bootstrap as well.

A GitHub page of “Naml (Not another markup language)”.
You can replace Kubernetes YAML with Golang and write and deploy apps in Golang.

SRE Weekly Issue #281 August 1st, 2021

Articles

Learning from incidents — Formula 1

The incident: a formula 1 car hit the side barrier just over 20 minutes before the race was about to start. The team sprang into action with an incredibly calm, orderly and speedy incident response to replace the damaged parts faster than they ever have before.

This article is a great analysis, and there’s also an excellent 8-minute video that I highly recommend. Listen to the way the sporting director and everyone else communicates so calmly. It’s a rare treat to get video footage of a production incident like this.

Chris Evans — incident.io

Analyzing and commenting on a great incident response in F1. As commented by the Editor above, the embedded 8-minute video is very good and is highly recommended.

Observe a Service; Not a Server

The underlying components become the cattle, and the services become the new Pet that you tend to with your utmost care.

Piyush Verma — Last9

Using the metaphor of “pet vs. cattle”, it is explained in the following items.
○ Establishing Service KPIs
○ Features vs. Stability
○ Cascading
○ The images in the article are cute.

aws-samples/aws-incident-response-playbooks

AWS posted these example/template incident response playbooks for customers to use in their incident response process.

AWS

A playbook that covers some common scenarios that AWS users face. It outlines the procedure based on the NIST Computer Security Incident Handling Guide (Special Publication 800–61 Revision 2), which can be used for the following purposes:
○ Gather evidence
○ Contain and then eradicate the incident
○ recover from the incident
○ Conduct post-incident activities, including post-mortem and feedback processes

(All) DNS Resource Records

A list with descriptions of all DNS record types, even the obscure ones. Tag yourself, I’m HIP.

Jan Schaumann

Since it is covered in DEVOPS WEEKLY ISSUE #553 above, I will skip it.

What’s a Major Incident Anyway?

This one includes a useful set of questions to prompt you as you develop your incident response and classification process.

Hollie Whitehead — xMatters

The author explains it with the aim that by knowing how to define a major incident and how to respond to it effectively, you can ensure your operations are running the way they should and focus your attention on the work that actually matters.

How to be better, together

The author of this article shows us how they communicate actively, perform incident retrospectives, and even discuss “near misses” and normal work in order to better learn how their system works — all skills that apply directly to SRE.

Jason Koppe — Learning From Incidents

The author explains the following points from the idea that “some of the approaches that I have used to get better at mountain climbing can be applied to software engineering with a similar effect.”.
○ Discuss every day work
○ Learn from incidents
○ What the software industry can learn from the climbing community

The Unique Reliability Engineering Requirements of Microservices

Although the fundamental concepts of site reliability engineering are the same in any environment, SREs must adapt practices to different technologies, like microservices.

JJ Tang — Rootly

Since it is covered in DEVOPS WEEKLY ISSUE #553 above, I will skip it.

It’s Time to Rethink Outage Reports

This one uses Akamai’s incident report from their July 22 major outage as a jumping-off point to discuss openness in incident reports. The text of Akamai’s incident report is included in full.

Geoff Huston — CircleID

As the title suggests, it encourages reconsideration of the contents of Outage Reports from the following perspectives.
○ It would be a positive step forward for this industry if Akamai’s outage report were not unusual in any way. It would be good if all service providers spent the time and effort post rectification of an operational problem to produce such outage reports as a matter of standard operating procedure.

Culture & Conduct Risk: The Normalization of Deviance

Drawing from the “normalization of deviance” concept introduced in the Challenger disaster study [Diane Vaughan], this article explores the idea of studying your organization culture to catch problems early, rather than waiting to respond after they happen.

Stephen Scott

It explains the title contents with the following point.
A prevailing assumption, among firms and regulators alike, is that misconduct problems can be discovered only after they occur: a ‘detect and correct’ mindset. But we’re beginning to see the emergence of a ‘predict and prevent’ approach to managing conduct risk in organizations.

Lorin Hochstein (Netflix) [StaffEng Podcast]

This episode of the StaffEng Podcast is an interview with Lorin Hochstein, whose writings I’ve featured here numerous times. My favorite part of this episode is when they talk about doing incident analysis for near misses. One of the hosts points out that it’s much easier for folks to talk about what happened, because there was no incident so they’re not worried about being blamed.

David Noël-Romas and Alex Kessinger– StaffEng Podcast

A 48-minute podcast that talks about resilience to guests from Netflix Senior Software Engineer Lorin Hochstein, who features an article on this blog every week.

Outages

Let’s Encrypt
Snapchat
Wikipedia
To fact-check this one, I looked at their grafana dashboard. Neat!
Netflix
Venmo
Blackboard Learn
eBay
reddit

The Headlines

Editor’s pick of the highlights from the past week.

Kubernetes 1.22: Reaching new peaks

The Kubernetes release team is pleased to announce the release of Kubernetes 1.22, the second release of 2021!

This release consists of 53 enhancements: 13 enhancements have graduated to stable, 24 enhancements are moving to beta, and 16 enhancements are entering alpha. Also, three features have been deprecated. Learn more about the release from the blog, or listen to an interview with release team lead Savitha Raghunathan.

An article about Kubernetes 1.22 release on the Kubernetes Blog. There are changes in the above numbers, which are explained in the following major items. There is a lot of information, so I will gradually catch up.
○ Major Themes
○Major Changes
○ Other Updates
○ Release notes
○ Release Team
○ Release Logo
○ User Highlights
○ Project Velocity
○ Ecosystem Updates
○ Event Updates
○ Upcoming release webinar
○ Get Involved
The Release Logo for Kubernetes 1.22 is below.

CNCF Unveils Schedule for KubeCon + CloudNativeCon North America 2021 in Los Angeles and Virtual

The schedule for KubeCon + CloudNativeCon North America 2021 is finally here! Attendees will experience over 230 sessions, including keynotes and breakouts, with over 70 presentations hosted by project maintainers. From non-technical and end user case studies to advanced engineering deep dives — the conference has content for everyone interested in cloud native technology. Explore the schedule and start planning your experience today.

The schedule for “KubeCon + CloudNativeCon North America 2021” to be held from October 11th has been announced. I plan to participate online, but I forgot to buy an early bird ticket, so I bought a ticket for $ 75 and registered for participation.

ICYMI: CNCF online programs this week

A weekly summary of CNCF online programs from this week.

Cloud Native Live:Humanising your cloud native platform

Lee Briggs, Pulumi

A one-hour session that looks at the proliferation of platforms across cloud-native organizations and explains how Pulumi’s software delivery platform and infrastructure pipeline builds can satisfy and help users.

On-demand Webinars:Securing your continuous everything strategy

Abubakar Siddiq Ango, GitLab

A 13-minute session that explains the vulnerabilities that can occur at various stages of the continuous development life cycle and how to mitigate them.

Governors clusters these persistent data

James Spurin, StorageOS

A 29-minute session that focuses on CSI, which is often overlooked, and explains the following.
○ Why the CSI and the installation of an effective data plane should be a key consideration for Kubernetes deployments
○ Opportunities for improvements that include multi-tenancy, high availability, compliance with encryption at rest, ease of use with GitOps and the transition of traditional and legacy workloads, dependent on persistent data
○ Live demo, showcasing the benefits of a Kubernetes data plane
It was good that the materials and demos were very easy to see.

Visit our Online Programs playlist on YouTube for more content.

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

Two year update: Building an open source marketplace for Kubernetes

Alex Ellis, OpenFaaS Ltd

It describes the journey of the last two years from the initial creation of the Kubernetes open source marketplace to the growth of the community, the acquisition of the first sponsored app, and the next step.

Implementing traffic policies in Kubernetes

Cody De Arkland, Kong

A guest posting to CNCF Blog originally published at Kong blog.
It details how to use the latest distributed control plane, Kuma , bundled with Envoy Proxy to monitor and monitor Kubernetes traffic.

July 2021 Flux update

Flux blog

Since it was previously covered in “ KubeWeekly # 267 July 9th, 2021 “, I will skip it.

Encrypt your Kubernetes secrets with Mozilla SOPS

Thorsten Hans blog

It Explains how to use SOPS in combination with Azure Key Vault to encrypt and decrypt Kubernetes secrets (YAML files). This allows you to save your secret directly to git along with other Kubernetes manifests.

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

NSA, CISA release Kubernetes hardening guidance

National Security Administration

It details threats to the Kubernetes environment and provides configuration guidance to minimize risk.

A guide to Kubernetes chargeback

Kubecost blog

It aims to explain the benefits of the chargeback model, introduce some best practices for implementing Kubernetes chargeback reports, and help you get started with Kubercost within minutes.

Keeping KubeCon + CloudNativeCon Los Angeles 2021 safe for onsite attendees

CNCF

It introduces safety efforts for on-site participants of KubeCon + CloudNativeCon Los Angeles 2021. In addition to securing space by reducing the capacity to one-third of the venue, various other measures are being taken in cooperation with local hygiene authorities.

Cloud Native Wasm Day North America hosted by CNCF

CFP Closes, Monday, Aug 9 at 11:59 PM PST

The introduction of Cloud Native Wasm Day and the deadline for CFP are as mentioned above Monday, Aug 9 at 11:59 PM PST.

Upcoming CNCF Online Programs

Cloud Native Live

August 11 at 9am PT: Emissary and Linkerd — How to integrate your Service Mesh with K8s Ingress presented by Daniel Bryant, Ambassador Labs & Jason Morgan, Buoyant — RSVP

On-demand

August 12: Bringing your Terraform to Crossplane presented by Nic Cope, Upbound — RSVP
August 12: Choosing the right storage for stateful workloads on Kubernetes presented by Abhinivesh Jain, Wipro — RSVP

YouTube playlist submissions

Looking for more great curated content? Visit our Online Programs playlist on YouTube.

Learn more about CNCF Online Programs

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara