SRE / DevOps / Kubernetes Weekly Collection#85(Week 37, 2021)

10 min readSep 19, 2021

In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #559 September 12th, 2021
SRE Weekly Issue #287 September 12th, 2021
KubeWeekly #277 September 17th, 2021

DEVOPS WEEKLY ISSUE #559 September 12th, 2021

News

A good post on the early decisions (in this case around data storage) that can lead to cost control discussions later. You can apply this to other systems as well.

The title is “(Over) Pay As You Go for Your Data store”.
It outlines the pitfalls of “pay-as-you-go” and the guidelines they have come up with to design their “next gen” data store solution.

Details on combining ttl.sh (which provides anonymous and ephemeral container registries) and Cosign to sign the images. A few interesting use cases for this sort of thing.

The title is “ttl.sh and cosign: Signing an anonymous & ephemeral Docker image registry.”
It explains the contents in the above title and the comment of the Editor.

A critical review of the recently released Kubernetes security guidance from the NSA, including some up-to-date recommendations.

The title is “NSA & CISA Kubernetes Security Guidance — A Critical Review”.
The guidance contained in the Cybersecurity Technical Report (CTR) above is explained in three points: “The Good,” “The Bad,” and “The Complex.”

Authentication of the Docker socket is all or nothing, but you can always use a reverse proxy for finer-grained control. A good example using Caddy.

The title is “Restricting Docker Access With a Reverse Proxy”.
As the title above and the comments of the Editor, it explains how to filter the path of access to Docker by a reverse proxy using “Caddy”.

An interesting observation about the relationship between observability and the needs of auditors for compliance.

The title is “Security + Observability = Compliance”.
It briefly explains the concept of the title that the author thinks.

Whenever you’re building a new API, or consuming an API of another system, you quickly build up opinions about what a good API feels like. This post has some good advice for both processes, practices and principles.

The title is “How We Design Our APIs at Slack”.
It describes the API design principles and the new API specification, review, and testing process.
There are six “Our design principles” below.

Do one thing and do it well
Make it fast and easy to get started
Strive for intuitive consistency
Return meaningful errors
Design for scale and performance
Avoid breaking changes

There are four “Design processes” below.

Write an API spec
Internal API review
Early partner feedback
Beta testing

Tools

SLO Tracker is a dashboard application for displaying SLO and error budget information, based on integration to gather SLI data from Prometheus, Grafana, Datadog and other monitoring tools.

The GitHub page of “SLO-Tracker’’ provides a simple way to track SLOs and error budgets. SLO-tracker can be integrated with a few alerting tools via webhook integration to receive SLO violating incidents.
Click here for the web page. Click here for a blog post introducing SLO-Tracker.

EKS Anywhere is an option to run AWS EKS (the AWS Kubernetes service) on your own infrastructure. The main use case is to standardise the management side of operating a service like this.

The GitHub page of “Amazon EKS Anywhere (EKS-A)” which became GA. A CLI tool that extends the consistent cluster management experience with Amazon EKS (eksctl) to your on-premises Kubernetes cluster.

It’s better to understand that the name is similar to ECS Anywhere but it is a completely different concept, and EKS Distro (EKS-D) and Amazon EKS Connector are also included

SRE Weekly Issue #287 September 12th, 2021

Articles

Industry Interviews: Colm Doyle, Incident Commander at Slack

Lots of details about how Slack does incident response in this one.

Stephen Whitworth — incident.io

As the title suggests, it details how it became an Incident Commander (IC) at Slack, how to handle it, and the first 5 minutes after getting paged.

Five Ways Developers Can Help SREs

This list also gives an interesting insight into the way this company does SRE.

Mayank Gupta and Merlyn Shelley — Squadcast

As the title suggests, it lists the following five best practices that developers can adopt to make SRE work easier.

Scaling The Platform With The Concept Of A 12-factor App Method
Sharing Performance Testing Data Insights
Significance of Documentation and Configuration files
AIOps Supported System Admin Functionalities
Increasing Observability Of The System

Incident Review — What Was Behind the September 7 Spectrum Outage: A Case of Dr. BGP Hijack or Mr. BGP Mistake?

Oh BGP, you rascally little routing protocol.

Alessandro Improta and Luca Sani — Catchpoint

The network failure “an outage hit Spectrum cable customers in the Midwest” caused by BGP network public relations control that occurred on September 7, 2021, 16:36 UTC was analyzed and commented from the viewpoint in the title.

What is an SRE?

A comprehensive definition of SREs and Site Reliability Engineering, including what SREs do and what makes SREs different from other roles.

The article covers various facets of SRE and acknowledges that SREs can perform many roles.

JJ Tang — Rootly

It addresses questions about technical roles and positions and other questions to provide a complete definition of SRE. It also provides tips on what SRE actually does and how to help the SRE in your organization be the best they can be.

The Atlantic GLIDER, Air Transat flight 236! Explained by Mentour Pilot

Another really excellent air accident story with lots of great talk about mental models and confirmation bias. The crew saw lots of disparate indications that each didn’t point to anything in particular and each wasn’t a huge problem on its own. That, coupled with confirmation bias, helped them miss what might seem obvious in hindsight.

Mentor pilot

A YouTube video that explains one of the the most famous aviation accidents, “Air Transat flight 236”, taking up the safety recommendations with the background to the incident, how to deal with the crew, and the final report is here.

Outages

KubeWeekly #277 September 17th, 2021

The Headlines

Editor’s pick of the highlights from the past week.

Congratulations to Envoy on the 5 year anniversary of the project!

Matt Klein, Envoy

Congratulations to Envoy on their fifth anniversary of the project! Hear from Matt Klein (the project creator) on Envoy’s brief prehistory and history of the project, along with some of the lessons learned along the way.

As mentioned above, the project creator Matt Klein said to commemorate the 5th anniversary of the Envoy project. It talks about the lessons it has learned over time as the large-scale OSS project grows.

ICYMI: CNCF online programs this week

A weekly summary of CNCF online programs from this week.

Kata and Arm, a secure alternative in the 5G space

Kiel Faller, Arm

A approximately 45-minute session demonstrates the 5G O-RAN components on the Arm infrastructure and their importance in 5G space, and discusses the potential impact of using open source components, their cost savings and increased customizability.

Building an HA control plane for Tinkerbell with Kube-vip

Jason DeTiberus, Equinix

An approximately 1-hour session that checks for updates to the Tinkerbell project, explains how the control plane was built, and the role that kube-vip plays.

Moving from CLIs to control planes with Crossplane

Viktor Farcic, Upbound

A approximately 30-minute session explaining the benefits of managing infrastructure, services, and apps using the Universal Control Plane(Crossplane).

Using CSI snapshots to backup and restore your data in Kubernetes

Michael Courcy, Kasten by Veeam

A 20-minute session explaining the CSI snapshot feature and how it fits into the Kubernetes storage architecture.

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

NSA & CISA Kubernetes security guidance — A critical review

Iain Smart, NCC Group

Since it is taken up in the above DEVOPS WEEKLY ISSUE #559, I will skip it.

Top 9 file integrity monitoring (FIM) best practices

Alejandro Villanueva, Sysdig

As the title suggests, it describes four types of FIM (File Integrity Monitoring) focusing on host and container security, and the following nine best practices.
Prepare an asset inventory
1: Scope which files and directories need to be monitored
2: Define appropriate permissions
3: Define a baseline
Detect drift
4: Shift left with image scanning policies
5: Detect real-time threats with runtime policies
Notify, investigate, and respond
6: Implement an automated alert and response mechanism
7: Gather forensics data for further investigation
Compliance and Benchmarks
8: Stick to compliance requirements
9: Run automated benchmarks

DataRoaster is now open-sourced, why I created it

Kidong Lee, ITNEXT

It introduces the open source of “DataRoaster”, which provides a data platform that runs on kubernetes.
Click here for a 12-minute demo video of DataRoaster.

Why data scientists shouldn’t need to know Kubernetes

Chip Huyen

As the title suggests, it’s good for data scientists to take on the entire tech stack, but instead of retrieving YAML files, you can take advantage of good infrastructure abstraction tools that allow you to focus on real-world data science without knowing Kubernetes.

Solving API authorization challenges in multi-cloud environments

Nima Moghadam, Kong

It explains using figures and codes along with the title. The bottom line is that the use of OPA and declarative policies has become very popular, especially in API Ops, for the following reasons:
Easy to integrate
Declarative
Extremely powerful and flexible
Platform agnostic

Rate limiting with the HAProxy Kubernetes Ingress Controller

Jim O’Connell, HAProxy

This article describes how to use the overall rate limit to mitigate the effects of events such as DDoS.
However, HAProxy Kubernetes Ingress Controller offers even more fine-grained control to fend off DDoS attacks using several annotations that can help you build a powerful first line of defense on an IP-by-IP basis.

Deploy OpenFaaS to Linode with K3sup

Alex Ellis, OpenFaas

As the title suggests, the following points explain how to deploy OpenFaaS to Linode using a virtual machine and K3sup.
○ Introduction
○ Tutorial
○ Create an account on Linode
○ Create a VM on Linode
○ Pre-reqs
○ Install K3s using K3sup
○ Install OpenFaaS
○ Configure an Ingress Controller and TLS certificate
○ Wrapping up
○ Getting in touch and supporting our work

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

Prodfiler, with Thomas Dullien

Craig Box, Kubernetes Podcast from Google

Kubernetes Podcast by Google employees. This time the Host is Craig Box and Guest Host Jimmy Moore.
They have Thomas Dullien , the co-creator of “Prodfiler”, as a guest.
The topics I was interested in in the News of the week are as follows.
○ Backup for GKE
○ Kubernetes multi-cluster panel on October 6
○ Subsidiary Kubernetes Grid 1.4

Why we created the Prometheus Conformance Program

Richard Hartmann, Grafana Labs

As the title suggests, it introduces the reasons for creating the Prometheus Conformance Program.
Learn more about the Conformance Program design, available test suites, current test results, and how to apply for the official Prometheus compatibility mark in the following 10/14 session at KubeCon + CloudNativeCon NA.
○ The Prometheus Conformance Program — Richard Hartmann, Grafana Labs

Crossplane is now a CNCF incubating project

Jared Watts, Crossplane blog

As the title suggests, Crossplane reported that the maturity level was promoted from the CNCF sandbox to incubation, looking back on the following points and mentioning about the future.
○ A Consistent Vision
○ The Community Keeps Growing
○ First Major Milestone Ready for Production
○ Strong Partnerships with the Ecosystem
○ Production Adoption
○ Conformance in the Ecosystem
○ The Road Ahead

Google’s Sqlcommenter now extending the vision of OpenTelemetry to databases

Nimesh Bhagat, Google Cloud

Since it is covered it in last week’s Kube Weekly #277 , so I will skip it.

Cloud Native Chaos and Telcos — Enforcing reliability and availability for telcos

W.Watson, Vulk Coop & Karthik S., LitmusChaos

The explanation is based on the keywords in the title. The conclusion is below.
○ Borrowing from the lessons learned when applying chaos testing to cloud native environments, we should use declarative chaos specifications to test telecommunication infrastructure in tandem with its development and deployment. The CI/CD tradition of “pull the pain forward” with a focus on MTTR will produce the type of highly available and reliable systems that cloud native telecommunication systems will need to be.

7 microservices best practices for developers

Michael Bogan, Kong

The following 7 points are explained along with the title.

Small Application Domain
Separation of Data Storage
Communication Channels
Compatibility
Orchestrating Microservices
Microservices Security
Metrics and Monitoring

NSA & CISA Kubernetes security guidance

Lars Larsson, Elastisys

It summarizes the main takeaway messages of Kubernetes Hardening Guidance and provides additional insights based on its personal experience with cloud security.

KubeCon + CloudNativeCon North America preview with Constance Caramanolis and Stephen Augustus

The CUBE

As the title suggests, a 21-minute session in which two Co-chairs from KubeCon + CloudNativeCon North America are interviewed for the event and talk about the highlights.

Introducing the CNCF End User Journey Report: First up, Spotify

CNCF

The CNCF End User Community has published the first report, “End User Journey report features Spotify” and outlines in this article.
The End User Journey report focuses on active end user community members. It shows how these organizations have grown as technology leaders and have benefited from joining the CNCF end-user community.

Upcoming CNCF Online Programs

*edited as the Kubernetes 1.22 release webinar has been rescheduled

Live Webinar

September 21 at 10am PT: Introduction to APIClarity — A Wireshark for APIs presented by Zohar Kaufman & Alexei Kravtsov, Cisco — RSVP

Cloud Native Live

September 22 at 9am PT: Optimizing and securing Kubernetes workloads with Polaris and Goldilocks presented by Andy Suderman, Fairwinds — RSVP

On-demand

September 23: Kong Ingress controller — Kubernetes Ingress on steroids presented by Viktor Gamov, Kong — RSVP
September 23: Enable stateful applications on AWS with persistent storage for Kubernetes presented by Ananth Vaidyanathan, AWS — RSVP

CNCF End User Lounge Livestream

September 23 at 9am PT : Operationalizing 300+ K8 clusters across the cloud presented by Niraj Amin, Rajarajan Pudupatti SJ, & David Botelho, Fidelity — RSVP

Looking for more great curated content? Visit our Online Programs playlist on YouTube.

Learn more about CNCF Online Programs

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara