SRE / DevOps / Kubernetes Weekly Collection#91(Week 43, 2021)

11 min readOct 31, 2021

In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #565 October 24th, 2021

News

KubeCon finished up in LA a week and a bit ago, and we have several posts this week recapping the event, with lots of links, observations and some opinions.

It covers three articles on recap of KubeCon + CloudNativeCon NA 2021.
The title of the first article in the above link is “KubeCon NA 2021 Key Takeaways: DevX, Security, and Community”. The explanation is based on the tweets of the author and other participants.
The title of the second article is “KubeCon 2021 Los Angeles Wrapup”. Looking back with its tweets.
The title of the third article is “KubeCon 2021 Top 3 Announcements: APIClarity, HashiCorp Waypoint, and Dell EMC CSM”. The author introduces three new products that caught its eye due to their high potential of pushing back roadblocks that are currently slowing down application modernization.

An insightful post on the sometimes hard-to-define distinction between application and infrastructure. A static/dynamic linking analogy, how the Kubernetes API and Crossplane fit in, and the potential for a new type of marketplace for applications.

The title is “INFRASTRUCTURE IN YOUR SOFTWARE PACKAGES”.
By detailing the current situation and assessing how software delivery has evolved over time, it analyzes what the future of shipping infrastructure will look like alongside software.

Game servers are a super interesting scaling challenge. This post, about recent outstages for a large game, goes into some great operational, data storage and architecture details.

The title is “Diablo II: Resurrected Outages: An explanation, how we’ve been working on it, and how we’re moving forward”.
Since the launch of “Diablo II: Resurrected”, the causes of outages that have occurred on multiple server issues and the steps taken by the team in charge to provide some transparency around. It also provides insight into how we’re moving forward.

A look at how one team is evolving a large NFS file storage setup towards something that is easier to scale horizontally and automatically.

The title is “Iterating on how we do NFS at Wikimedia Cloud Services”.
As the Editor commented above, Wikimedia’s cloud services team reviewed how to run NFS and shared the improvements.

More deep internet networking insights, this time looking under the hood about what makes a valid hostname. It’s worse than you think.

The title is “What’s in a hostname?”.
It dives deep about the host name while matching it with the relevant RFCs.

A good introduction to the extensibility benefits of Kubernetes, looking at the high-level API, custom resources and the operator pattern.

The title is “Exploring Kubernetes Operator Pattern”.
Take a closer look at the Operators pattern and use as many images as possible to show which Kubernetes parts are involved in the implementation of the Operator and why the Operator feels like “first-class Kubernetes citizens”.
It explains that the Kubernetes API is probably the main driver of Kubernetes extensibility.

A post on introducing a production readiness review process, in particular in smaller teams.

The title is “How we’re building a production readiness review process at Grafana Labs”.
It describes the Production Ready Review (PRR) process and some of the best practices developed during that process.
PRR is a process that started at Google and is described in the company’s well-known SRE book as the first step in SRE’s efforts .

Tools

hcltm is a tool for describing a thread model in HCL, and then generating various outputs from it including markdown documents and data flow diagrams.

The GitHub page of “hcltm” that provides a DevOps-first approach for documenting system threat models, focusing on the following targets:
○ Simple text-file format
○ Simple cli-driven user experience
○ Integration into version control systems (VCS)

Snowcat is a tool that gathers and analyzes the configuration of an Istio cluster and audits it for potential violations of security best practices.

As mentioned above, the GitHub page of “Snowcat”, a tool that collects and analyzes the configuration of Istio clusters and audits the possibility of violating security best practices.
Click here for the GitHub page.

SRE Weekly Issue #293 October 24th, 2021

Articles

The Downside of Hospitals Becoming “Highly Reliable”

It’s one thing to say you accept call-outs of unsafe situations — it’s another to actually do it. This cardiac surgeon shares what it’s like when high reliability organizations get it wrong.

Robert Poston, MD

In the hospital, the highly reliable organization (HRO) said, “A lack of transparency and passion leaves them with a series of well packaged ideas that end up looking like high reliability but never able to operate like one.“.
Article dated November 6, 2019.

Diablo II: Resurrected Outages: An explanation, how we’ve been working on it, and how we’re moving forward

The game has been a victim of its own success, and the developers have had to put in quite a lot of work to deal with the load.

PezRadar — Blizzard

Since it is covered in DEVOPS WEEKLY ISSUE #565 above, I will skip it.

An Introduction to Incident Response Roles

This includes some lesser-known roles like Social Media Lead, Legal/Compliance Lead, and Partner Lead.

JJ Tang — Rootly

This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

The following points explain how to define an incident response role in order to build a team that works as effectively and efficiently as possible.
○ What is an incident response team?
○ Structuring incident response roles
○ Other potential incident response roles
○ Conclusion: The best incident response team is a flexible team

Postmortem Pitfalls

There are a couple of great sections in this article, including “blameless” retrospectives that aren’t actually blameless, and being judicious in which remediation actions you take.

Chris Evans — incident.io

As the title suggests, the following points explain the pitfalls of post-mortem.
○ When blameless postmortems actually aren’t
○ Incidents are always going to happen again
○ Take time before you commit to all the actions
○ Incidents as a process, not an artifact

The danger of hidden functional roles

I love the idea that chaos monkey could actually be propping your infrastructure up. Oops.

Lorin Hochstein

The story of the introduction, which unintentionally plays the role of a family alarm clock, and how to connect Chaos Monkey in the latter half are good. I had never thought of the possibility that Chaos Monkey swapped instances before the problem occurred by terminating the instance.

What’s in a hostname?

I have to say, I’m really liking this DNS series.

Jan Schaumann

Since it is covered in DEVOPS WEEKLY ISSUE 565 above, I will skip it.

Crew member yelled ‘cold gun’ as he handed Alec Baldwin prop weapon, court document shows

What? Why the heck am I including this here?

First, let’s all keep in mind that this situation is still very much unfolding, and not much is concretely known about what happened. It’s also emotionally fraught, especially for the victims and their families, and my heart goes out to them.

The thing that caught my eye about this article is that this looks like a classic complex system failure. There’s so much at play that led to this horrible accident, as outlined in this article and others, like this one (Julia Conley, Salon).

Aya Elamroussi, Chloe Melas and Claudia Dominguez — CNN

At first glance, I thought, “Why is this article?” As mentioned in the Editor’s comment above, this is taken up because it looks like a classic complex system failure.

Outages

Google Search Alerts
I feel vindicated. I knew something was wrong with my search alert RSS feeds last week! Putting SRE Weekly together without Google search alerts can be… challenging.
GitHub
Netflix
Twitter
Twitch
TikTok

KubeWeekly #281 October 29th, 2021

The Headlines

Editor’s pick of the highlights from the past week.

Kubernetes Podcast from Google: Jasmine James, KubeCon + CloudNativeCon co-chair

Jasmine James is an Engineering Manager within the Engineering Effectiveness organization at Twitter, focused on their internal developer experience. She was also the co-chair of the recent KubeCon + CloudNativeCon. Jasmine talks about the events she’s led and the ones to come, and her feelings about being in a room in front of other people — up to 3,000 of them — for the first time in a long while.

Kubernetes Podcast by Google employees. This time the Host is Craig Box and Guest Host Jimmy Moore.
Jasmine James, who is the Engineering Manager of Twitter’s Engineering Effectiveness organization and co-chair of KubeCon + CloudNativeCon NA 2021, is invited as a guest.
The topics I was interested in in the News of the week are as follows.
○ Google Cloud Next:
● BigQuery Omni is GA
Managed Service for Prometheus
○ KubeCon + CloudNativeCon
● Cilium joins the CNCF
● Cloud Native security microsurvey results
○ Kubernetes documentary trailer

ICYMI: CNCF online programs this week

A weekly summary of CNCF online programs from this week.

Securing your workload communications with Open Service Mesh

Phillip Gibson, Microsoft

An approximately 46-minute session that introduces the latest integrations and techniques for enhancing workload communication using Open Service Mesh.

Introducing Kubescape — open-source tool to test Kubernetes deployment

Amir Kaushansky, ARMO

An approximately 50-minute session that explains how to operate Kubescape , supported frameworks, key features, and CI/CD integration.

How to design a multi-cloud deployment

Dave Blakely, Snapt

An approximately 40-minute session that explains the purpose of migrating to multi-cloud, how to select a cloud provider, how to deploy to multi-cloud, and how to keep multi-cloud secure.

Project Calico network policies

Nigel Douglas, Tiger

An approximately 41-minute session that explains the content of the title with the following points.
○ How does Project Calico enable network policies in K8s?
○ How to implement basics?
○ Creating and managing policies in your clusters

Understanding GitOps usecases

Abubakar Siddiq Ango, Gitlab

An approximately 30-minute session explaining GitOps, its use cases, and if/when you need GitOps.

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

What you need to know about Kubernetes Network Policy

Mike Calizo, Red Hat

Kubernetes’ Network Policy is explained with the following points with a description example of YAML.
○ The NetworkPolicy concept
○ Applying a network policy
○ NetworkPolicy limitations
○ Summary

The life of an API gateway request (part 1)

Enrique García Cota, Kong

Part 1 of an article in a series that discusses how Kong Gateway handles requests by breaking the abstraction space into four different layers. About 13 minutes of video is embedded.

Infrastructure
Nodes
Phases
Plugins

Optimizing Kubernetes applications with Kubecost and Spinnaker

Alex Thilen, Kubecost

The content of the title is explained with the processing flow and the image of the UI. The following two videos are embedded.

Demo of Kubecost + Spinnaker integration in action
Spinnaker Workshop: Cost Optimization with Kubecost’s founders

Announcing HAProxy Kubernetes Ingress Controller 1.7

Ivan Matmati & Zlatko Bratkovic, HAProxy

The changes in line with the release of version 1.7 of HAProxy Kubernetes Ingress Controller are introduced in detail in the following points.
○ Custom Resource Definitions
○ CRD Examples
○ Distribution of connections to services/pods
○ New ALNP option
○ Implementation specific path type in ingress rules
○ Multiarch Support
○ s6 Init system
○ Nightly builds
○ External mode
○ Contributions
○ Conclusion

Connecting services to Kubernetes clusters with inlets, VPC Peering and direct uplinks

Alex Ellis, OpenFaaS Ltd.

It explains how to connect services to Kubernetes clusters using Inlets, VPC Peering, and direct uplinks.

Transitioning from Monolith to Microservices

Michael Bogan, Dev Spotlight

I think it is very good to have a configuration that introduces the transition to microservices as the title suggests, after mentioning the following points in “You might not need microservices architecture if …” at the beginning.
○ You’re not having trouble scaling.
○ Your monolithic architecture is already flexible enough to meet market demands.
○ You’re not having issues with deploying your application.

Securing a Kubernetes pod with Regula and Open Policy Agent

Becki Lee, Fugue

It shows you how to run Regula in the Kubernetes manifest to detect unsafe pods , and then explain how to protect them.

Structure testing for Docker containers

Tomas Fernandez, Semaphore CI

As a way to test Docker containers before deploying, Google introduces the open source container test tool “Container Structure Tests”.

Kustomize tutorial: Creating a Kubernetes app out of multiple pieces

Nick Chase, Mirantis

The content of the title is explained in the following items.
○ What is Kustomize?
○ Benefits of Using Kustomize
○ Installing Kustomize
○ Combining Specs
○ Managing Multiple Directories
○ Changing Parameters for a Component Using Kustomize Overlays
○ Creating a Kustomize Patch
○ Using Kubectl with Kustomize
○ Example: Kustomize Secret Generator
○ Conclusion

Kube-fledged: Cache container images in Kubernetes

Senthil Raja Chermapandian, Ericcson

It explains how to use the open source project “kube-fledged” to build and manage a cache of container images in a Kubernetes cluster.

Kubernetes logging in production

Kentaro Wakayama

The content of the title is explained in the following structure. The points are very well organized and the understanding progresses.
○ Logging Architectures
○ Logging Patterns
○ Pros and Cons
○ Putting Theory into Practice
○ Conclusion

How to develop a customer provider in Terraform

Saravanan Gnanaguru, InfraCloud Technologies

This article is intended for Terraform users who have a basic knowledge of Terraform and how to use it and are likely to develop custom Terraform providers.

Database security best practices on Kubernetes

Johnathan S. Katz, Crunchy Data

The content of the title is explained in the following items.
○ Run as an Unprivileged User
○ Encrypt your Data
○ Credential Management
○ Keep Database Software Up-to-Date
○ Follow Configuration Best Practices
○ Limit Where You Can Write
○ Securing The “Weakest Link”
○ Conclusion

How Linkerd retries HTTP requests with bodies

Eliza Weisman, Linkerd

It describes how Linkerd proxies reduce copy and allocation to minimize request body buffering performance overhead, how proxies can determine which requests can be retried, and some edge cases to consider.

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

Kubernetes co-founder Joe Beda interview

euro interview

Since it is covered it in last week’s DEVOPS WEEKLY ISSUE #564 , so I will skip it.

Kubernetes cost management and analysis guide

Kasper Siig, CloudForecast

It examines the main reasons why it’s so difficult to manage costs with Kubernetes. And as a way to significantly improve cost management, it shows you how to use the AWS Pricing Calculator to estimate the costs associated with running a workload on a custom Kubernetes cluster compared to running an EKS cluster.

I attended Kubecon 2021 in-person, here are my top six takeaways

Amanda Mitchell, Chronosphere

The author who participated in KubeCon + CloudNativeCon NA 2021 explains the following 6 takeaways.
1) A green light for more (safe) in-person events
2) Quantity isn’t everything
3) KubeCon 21 felt like old times (aka two years ago)
4) Love notes and theCube
5) Observability and other key themes
6) Inclusivity themes abound at KubeCon 21

KaaS, KPaaS & CaaS: Explained and compared

Lars Larsson, Elastisys