SRE / DevOps / Kubernetes Weekly Collection#86(Week 38, 2021)

10 min readSep 27, 2021

In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #560 September 19th, 2021
SRE Weekly Issue #288 September 19th, 2021
KubeWeekly #278 September 24th, 2021

DEVOPS WEEKLY ISSUE #560 September 19th, 2021

News

An interesting post discussing some of the edges of Terraform if you use it for cloud, Kubernetes and other resources using the same state.

The title is “Terraform is Not the Golden Hammer”.
An article that looks back on its company’s experience, explaining where, when, and how to use Terraform. It is explained in the following points.
○ How we used Terraform
○ Problems facing
○ Advises and suggestion
○ Conclusion

A post positing using SQL as the interface for cloud infrastructure. Some interesting ideas about a familiar interface and existing tooling.

The title is “Infrastructure as SQL”.
The title and the above Editor’s comments are explained in the following points.
○ Relations and Types Matter for Infrastructure
○ New Powers: Explore, Query, and Automate Your Infrastructure
○ You Don’t Need to Learn a New API (Probably)
○ You Can Test, Too
○ Recover With Ease

A discussion of the role of SREs in enabling true self service platforms and empowering developers.

The title is “The Developer Experience and the Role of the SRE Are Changing, Here’s How”.
In the Conclusion, what it wants to tell is, “Developers should take the opportunity to share their pain points and also learn about tooling and best practices from SRE teams, with the goal of “paving the path” to developer autonomy, self-service, and full service ownership.” explained at the following points.
○ Two worlds colliding: The monolith and service-oriented architecture
○ Enabling developers to own the full application lifecycle
○ Understand the changing developer experience to support developer ownership
○ Conclusion: Developers should work with SREs as collaborators, not first responders

The start of a series on API design, based around gRPC. The first post focused specifically on using Protobuf FieldMask.

The title is “Practical API Design at Netflix, Part 1: Using Protobuf Field Mask”.
Part 1 of the series post. It explains how and why Netflix Studio Engineering is used for APIs to read data.
Part 2 will explain how to use FieldMask for update and delete operations.

Another post on configuration management, focused on applying gitops practices with some good examples up to applying this approach to a multi-cluster federated service mesh setup.

The title is “Configuration as Data, GitOps, and Controllers: it’s not simple for multi-cluster”.
The contents of the title are explained at the following points using figures with a handwritten taste.
○ A basic example of declarative configuration and controllers
○ Extreme examples
○ Case study: multi-cluster GitOps with Istio
○ Federating a service mesh has unique challenges
○ Takeaways

A deep dive into Kubernetes ingress, with helpful diagrams showing how things work.

As mentioned above, “Kubernetes Ingress” is explained with a deep dive.

Tools

Kim, or Kubernetes Image Manager, provides the classic Docker build, pull, push interface with the build infrastructure deployed to Kubernetes.

The GitHub page of “kim (Kubernetes Image Manager)” which is a CLI for Kubernetes. Images can be built locally on the k3s cluster.
As stated in “STATUS: EXPERIMENT — Let us know what you think”, it is still in the experimental stage.

Kubernetes is often described as a platform for building platforms. Kratix describes itself as a framework for delivering that platform, bring conventions and tools to something lots of organisations hand roll today.

The GitHub page of “Kratix”, a framework for providing a platform.

If you’ve been following this newsletter, you’ll know eBPF is powerful, but we’re only just starting to see use cases. BMC Cache is an in-kernel cache for memcached that claims to improve throughput by up to 18x.

The GitHub page for BMC(BPF Memory Cache), the in-kernel cache for memcached.

KinK is a CLI that helps you manage KinD clusters as Kubernetes pods. Designed to ease standing up clusters for fast testing.

The GitHub page of the CLI app “kink” that makes it easy to run KinD clusters on Kubernetes pods and manages the entire life cycle of these clusters, including listing and deleting clusters.

SRE Weekly Issue #288 September 19th, 2021

Articles

Tammy Bryant Butow on SRE Apprentices

Faced with a difficult hiring market for SREs, they embarked on a well-designed, carefully thought out program to hire and train entry-level folks as SREs — and it worked!

Thomas Betts — InfoQ

It discusses the theme of training for new SREs.
Key Takeaways are below.
○ Hiring new site reliability engineers can be challenging. Dropbox decided to create a program to teach a cohort of students the skills necessary to be successful SREs.
○ A non-traditional approach to find engineers will naturally lead to a more diverse set of applicants. Bringing in people with different backgrounds can lead to new ways of looking at common problems.
○ Training should start with small tasks, letting the engineer learn by doing. Gradually these build from one-day tasks to longer, one-week, or one-month projects.
○ If your company creates a formal training program, it needs to be communicated to everyone, so there is understanding and proper expectations when the apprentices work with other employees.
○ In any new role, there is a need for understanding how to communicate with other people. Inviting junior employees to meetings allows them to see how senior members of the team interact to solve problems.

The things we find hardest in incident response

No matter how good your tooling is, how experienced you are, or how much you’ve prepared, incidents can still be hard.

Five people share about what they find hardest during incident response.

Chris Evans — incident.io

According to the content of the title, 5 people each commented on the following points. Each keyword is highlighted.
○ Working out the most highly leveraged role to play
○ Getting up to speed without disrupting the flow
○ Making decisions quickly as an individual vs context sharing and consensus
○ Keeping track of threads (virtual, not Slack)
○ Striking a balance between trusting your gut and systematically gathering evidence
○ Recovering from bad assumptions

The Developer Experience and the Role of the SRE Are Changing, Here’s How

This one has a lot of ideas about how to guide developers toward full ownership of their services in production.

Ambassador

Since it is covered in DEVOPS WEEKLY ISSUE # 559 above, I will skip it.

6 modes of system resilience

In this post, I will cover the following modes of system resilience:
* Adaptive Response
* Superior Monitoring
* Coordinated Resilience
* Heterogeneous Systems
* Dynamic Repositioning
* Requisite Availability

Ash P — Cruform

At the beginning, the definition of system resilience is confirmed, and the above six models are explained.

Useful knowledge and improvisation

Root cause of success: unpatched security vulnerability

TMW a security vulnerability allows you to break into your infrastructure, averting disaster during an incident.

Lorin Hochstein, with incident story by Eric Dobbs

It considers two elements in the title that play an important role in incident response.

Heroku Incident #2347 Follow-Up

A migration didn’t go as planned, and customer traffic lost its way.

Heroku

Follow-up information on the above Heroku incidents that occurred between 2021–08–24 00:00 UTC and 2021–08–26 19:10 UTC.

Transforming DevOps with Human-in-the-Loop Automation

I’m a big believer in human-in-the-loop automation. My favorite part of this article was this:

A further problem is that full automation — which aims to take the human out of the picture — requires a complete, nuanced understanding of a system and all potential outcomes, paradoxically resulting in heightened system complexity.

Tina Huang — Transposit

It is explained in the following points from the viewpoint of the title.
○ Debunking the myth of ‘automate everything’
○ Keeping humans in the loop is critical for effective automation
○ Human-in-the-loop automation in action

Outages

Google Voice

Assembled

For some users, Assembled’s styling was not rendering and caused the application to be unusable.

“Root cause”: CSS

Apple Store
United Airlines
TikTok
Slack
GCash
Solana (Cryptocurrency)
They posted details in later tweets::
* thread 1
* thread 2

KubeWeekly #278 September 24th, 2021

The Headlines

Editor’s pick of the highlights from the past week.

What to expect from KubeCon + CloudNativeCon North America 2021

Adrian Bridgwater, Computer Weekly

Adrian Bridgwater of Computer Weekly outlines what to expect from KubeCon + CloudNativeCon North America 2021 happening October 11–15 in Los Angeles or virtually from anywhere in the world. Learn more about the 200+ sessions, 17 co-located events, and activities. Hope to see you there!

An introductory article for KubeCon + CloudNativeCon North America 2021.

ICYMI: CNCF online programs this week

A weekly summary of CNCF online programs from this week.

Introduction to APIClarity — A Wireshark for APIs

Zohar Kaufman & Alexei Kravtsov, Cisco

An approximately 42-minute session explaining “API Clarity”, a new open source tool that acts as Wireshark.
The Webinar agenda and Key Discussion Points are below.
○ Understanding the need for, and benefits of, open API specification reconstruction
○ A survey of existing open source solutions for open API specification reconstruction
○ An API Clarity demo
○ Potential use cases of APIClarity for API security

Optimizing and securing Kubernetes workloads with Polaris and Goldilocks

Andy Suderman, Fairwinds

An approximately 55-minute session that demonstrates how to use the open source tools Polaris and Goldilocks to scan Kubernetes workloads to improve resource utilization and security.

Kong Ingress Controller — Kubernetes Ingress is a steroids

Viktor Gamov, Kong

An approximately 45-minute session that explains how to enable security declaratively, API rate limiting, and how to add native gRPC support.

Enable stateful applications on AWS with persistent storage for Kubernetes

Ananth Vaidyanathan, AWS

An approximately 25 minutes of sessions discussing different use cases, architectural techniques, and best practices for sharing and persisting data between K8s clusters using Amazon EFS serverless storage.

Operationalizing 300+ K8 clusters across the cloud

Niraj Amin, Rajarajan Pudupatti SJ, & David Botelho, Fidelity

An approximately one-hour session explaining the challenges faced by the platform team during their journey and the approaches adopted to solve them.

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

IAM roles for Kubernetes service accounts — deep dive

Maciej Jarosiewicz

It shows you the nuts and bolts of how IAM and Kubernetes work together in harmony to provide you with a great experience of calling AWS services from your pods with no hussle with the following points.
○ Introduction
○ IAM doesn’t trust service accounts, do you?
○ Let’s jot it down
○ Issues on top of issues
○ Federated identities
○ Swap That Swiftly
○ Making this work in your cluster
○ OIDC Identity Provider setup
○ IAM role setup
○ Off the hook
○ Summing up

StackRox office hours (E3): Kubernetes network policies

Mandar Darwatkar and Chris Short, Red Hat

An approximately 65-minute session that starts with simple and practical steps to protect Kubernetes and then answers live questions.

KubeMQ is now available under open source license

KubeMQ

The KubeMQ web page that introduces the community version of “KubeMQ” is now available as an open source project.
The community version supports all messaging patterns, connectors, bridges, can be deployed anywhere, and can run in production. Click here for the Github page.

APM with Prometheus and Grafana on Kubernetes Ingress

Joseph Caudle, Kong

It explains how running a Kubernetes environment using the open source Kong Ingress Controller can simplify the seemingly difficult task of deploying a full application performance monitoring (APM) stack.
A YouTube video of about 15 minutes is also embedded.

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

New Google cloud deploy automates deploys to GKE

Victor Szalvay and S. Bogdan, Google Cloud

It introduces the release of Google Cloud Deploy, a managed, opinionated, continuous delivery service that makes continuous delivery to GKE easier, faster, and more reliable. About two and a half minutes of YouTube video is embedded.

Top open source CI/CD tools for Kubernetes to know

Michael Foster & Ajmal Kohgadai, Red Hat

Here’s a list of CI / CD tools you should know about in a Kubernetes environment, in no particular order. The following are covered, providing information on PROS/CONS/RESOURCES respectively.
○ Tekton
○ Argo Project
○ GitHub Actions
○ Jenkins X
○ OpenShift Pipelines
○ Spinnaker
○ CircleCI
○ GitLab

Ask an OpenShift admin (Ep 44): Kubernetes API deprecations

Andrew Sullivan, Chris Short, Rob Szumski, Camila Macedo, & Frederic Giloux, Red Hat

As a change in Kubernetes v1.22, some APIs that were previously marked as deprecated have been removed and they’ve delved into the details to cover the point where they’re no longer available. An approximately 65 minutes of session explaining the steps required to prevent the API version from being removed and upgrade to the new API.

Macquarie Bank looks to break free of IaaS

Ry Crozier, iTnews

An article based on Macquarie Bank’s announcement at the Google cloud summit. The company plans to move to the “No Ops” model to manage the public cloud, which will ultimately be the home of all systems.

Bug Bash presented by CNCF + Sonatype

CNCF

An event registration page for the above titles scheduled to be held between October 13, 2021 8:00 to October 14, 2021 at 18:00 (PDT). If you are interested, you can register.

Upcoming CNCF Online Programs

Live Webinar

September 28 at 10am PT: Kanister — Application level data operations on Kubernetes presented by Michael Cade & Pavan Devaraj, Kasten by Veeam — RSVP

Cloud Native Live

September 29 at 9am PT: Trace-based testing with OpenTelemetry presented by Michael Haberman, Aspecto — RSVP

On-demand Webinars

September 30: Shifting security left-simplifying security for K8s & OpenShift environments presented by Jody Hunt, CyberArk — RSVP
September 30: Redefining cloud native debugging presented by Not Goldman, Rookout — RSVP
September 30: OpenEBS 3.0: What’s in it? presented by Kiran Mova, MayaData — RSVP
September 30: The thing about your software supply chain… presented by Eylam Milner, Argon Security — RSVP

Looking for more great curated content? Visit our Online Programs playlist on YouTube.

Learn more about CNCF Online Programs

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara