SRE / DevOps / Kubernetes Weekly Collection#85(Week 37, 2021)

Yoshiki Fujiwara
10 min readSep 19, 2021
  • In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
  • Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
  • I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #559 September 12th, 2021
SRE Weekly Issue #287 September 12th, 2021
KubeWeekly #277 September 17th, 2021

DEVOPS WEEKLY ISSUE #559 September 12th, 2021

News

A good post on the early decisions (in this case around data storage) that can lead to cost control discussions later. You can apply this to other systems as well.

  • The title is “(Over) Pay As You Go for Your Data store”.
  • It outlines the pitfalls of “pay-as-you-go” and the guidelines they have come up with to design their “next gen” data store solution.

Details on combining ttl.sh (which provides anonymous and ephemeral container registries) and Cosign to sign the images. A few interesting use cases for this sort of thing.

  • The title is “ttl.sh and cosign: Signing an anonymous & ephemeral Docker image registry.”
  • It explains the contents in the above title and the comment of the Editor.

A critical review of the recently released Kubernetes security guidance from the NSA, including some up-to-date recommendations.

  • The title is “NSA & CISA Kubernetes Security Guidance — A Critical Review”.
  • The guidance contained in the Cybersecurity Technical Report (CTR) above is explained in three points: “The Good,” “The Bad,” and “The Complex.”

Authentication of the Docker socket is all or nothing, but you can always use a reverse proxy for finer-grained control. A good example using Caddy.

  • The title is “Restricting Docker Access With a Reverse Proxy”.
  • As the title above and the comments of the Editor, it explains how to filter the path of access to Docker by a reverse proxy using “Caddy”.

An interesting observation about the relationship between observability and the needs of auditors for compliance.

  • The title is “Security + Observability = Compliance”.
  • It briefly explains the concept of the title that the author thinks.

Whenever you’re building a new API, or consuming an API of another system, you quickly build up opinions about what a good API feels like. This post has some good advice for both processes, practices and principles.

  • The title is “How We Design Our APIs at Slack”.
  • It describes the API design principles and the new API specification, review, and testing process.
  • There are six “Our design principles” below.
  1. Do one thing and do it well
  2. Make it fast and easy to get started
  3. Strive for intuitive consistency
  4. Return meaningful errors
  5. Design for scale and performance
  6. Avoid breaking changes
  • There are four “Design processes” below.
  1. Write an API spec
  2. Internal API review
  3. Early partner feedback
  4. Beta testing

Tools

SLO Tracker is a dashboard application for displaying SLO and error budget information, based on integration to gather SLI data from Prometheus, Grafana, Datadog and other monitoring tools.

EKS Anywhere is an option to run AWS EKS (the AWS Kubernetes service) on your own infrastructure. The main use case is to standardise the management side of operating a service like this.

  • The GitHub page of “Amazon EKS Anywhere (EKS-A)” which became GA. A CLI tool that extends the consistent cluster management experience with Amazon EKS (eksctl) to your on-premises Kubernetes cluster.

It’s better to understand that the name is similar to ECS Anywhere but it is a completely different concept, and EKS Distro (EKS-D) and Amazon EKS Connector are also included

SRE Weekly Issue #287 September 12th, 2021

Articles

Industry Interviews: Colm Doyle, Incident Commander at Slack

Lots of details about how Slack does incident response in this one.

Stephen Whitworth — incident.io

  • As the title suggests, it details how it became an Incident Commander (IC) at Slack, how to handle it, and the first 5 minutes after getting paged.

Five Ways Developers Can Help SREs

This list also gives an interesting insight into the way this company does SRE.

Mayank Gupta and Merlyn Shelley — Squadcast

  • As the title suggests, it lists the following five best practices that developers can adopt to make SRE work easier.
  1. Scaling The Platform With The Concept Of A 12-factor App Method
  2. Sharing Performance Testing Data Insights
  3. Significance of Documentation and Configuration files
  4. AIOps Supported System Admin Functionalities
  5. Increasing Observability Of The System

Incident Review — What Was Behind the September 7 Spectrum Outage: A Case of Dr. BGP Hijack or Mr. BGP Mistake?

Oh BGP, you rascally little routing protocol.

Alessandro Improta and Luca Sani — Catchpoint

What is an SRE?

A comprehensive definition of SREs and Site Reliability Engineering, including what SREs do and what makes SREs different from other roles.

The article covers various facets of SRE and acknowledges that SREs can perform many roles.

JJ Tang — Rootly

  • It addresses questions about technical roles and positions and other questions to provide a complete definition of SRE. It also provides tips on what SRE actually does and how to help the SRE in your organization be the best they can be.

The Atlantic GLIDER, Air Transat flight 236! Explained by Mentour Pilot

Another really excellent air accident story with lots of great talk about mental models and confirmation bias. The crew saw lots of disparate indications that each didn’t point to anything in particular and each wasn’t a huge problem on its own. That, coupled with confirmation bias, helped them miss what might seem obvious in hindsight.

Mentor pilot

  • A YouTube video that explains one of the the most famous aviation accidents, “Air Transat flight 236”, taking up the safety recommendations with the background to the incident, how to deal with the crew, and the final report is here.

KubeWeekly #277 September 17th, 2021

The Headlines

Editor’s pick of the highlights from the past week.

Congratulations to Envoy on the 5 year anniversary of the project!

Matt Klein, Envoy

Congratulations to Envoy on their fifth anniversary of the project! Hear from Matt Klein (the project creator) on Envoy’s brief prehistory and history of the project, along with some of the lessons learned along the way.

  • As mentioned above, the project creator Matt Klein said to commemorate the 5th anniversary of the Envoy project. It talks about the lessons it has learned over time as the large-scale OSS project grows.

ICYMI: CNCF online programs this week

A weekly summary of CNCF online programs from this week.

Kata and Arm, a secure alternative in the 5G space

Kiel Faller, Arm

  • A approximately 45-minute session demonstrates the 5G O-RAN components on the Arm infrastructure and their importance in 5G space, and discusses the potential impact of using open source components, their cost savings and increased customizability.

Building an HA control plane for Tinkerbell with Kube-vip

Jason DeTiberus, Equinix

  • An approximately 1-hour session that checks for updates to the Tinkerbell project, explains how the control plane was built, and the role that kube-vip plays.

Moving from CLIs to control planes with Crossplane

Viktor Farcic, Upbound

  • A approximately 30-minute session explaining the benefits of managing infrastructure, services, and apps using the Universal Control Plane(Crossplane).

Using CSI snapshots to backup and restore your data in Kubernetes

Michael Courcy, Kasten by Veeam

  • A 20-minute session explaining the CSI snapshot feature and how it fits into the Kubernetes storage architecture.

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

NSA & CISA Kubernetes security guidance — A critical review

Iain Smart, NCC Group

  • Since it is taken up in the above DEVOPS WEEKLY ISSUE #559, I will skip it.

Top 9 file integrity monitoring (FIM) best practices

Alejandro Villanueva, Sysdig

  • As the title suggests, it describes four types of FIM (File Integrity Monitoring) focusing on host and container security, and the following nine best practices.
  • Prepare an asset inventory
    1: Scope which files and directories need to be monitored
    2: Define appropriate permissions
    3: Define a baseline
  • Detect drift
    4: Shift left with image scanning policies
    5: Detect real-time threats with runtime policies
  • Notify, investigate, and respond
    6: Implement an automated alert and response mechanism
    7: Gather forensics data for further investigation
  • Compliance and Benchmarks
    8: Stick to compliance requirements
    9: Run automated benchmarks

DataRoaster is now open-sourced, why I created it

Kidong Lee, ITNEXT

Why data scientists shouldn’t need to know Kubernetes

Chip Huyen

  • As the title suggests, it’s good for data scientists to take on the entire tech stack, but instead of retrieving YAML files, you can take advantage of good infrastructure abstraction tools that allow you to focus on real-world data science without knowing Kubernetes.

Solving API authorization challenges in multi-cloud environments

Nima Moghadam, Kong

  • It explains using figures and codes along with the title. The bottom line is that the use of OPA and declarative policies has become very popular, especially in API Ops, for the following reasons:
  • Easy to integrate
  • Declarative
  • Extremely powerful and flexible
  • Platform agnostic

Rate limiting with the HAProxy Kubernetes Ingress Controller

Jim O’Connell, HAProxy

  • This article describes how to use the overall rate limit to mitigate the effects of events such as DDoS.
  • However, HAProxy Kubernetes Ingress Controller offers even more fine-grained control to fend off DDoS attacks using several annotations that can help you build a powerful first line of defense on an IP-by-IP basis.

Deploy OpenFaaS to Linode with K3sup

Alex Ellis, OpenFaas

  • As the title suggests, the following points explain how to deploy OpenFaaS to Linode using a virtual machine and K3sup.
    ○ Introduction
    ○ Tutorial
    ○ Create an account on Linode
    ○ Create a VM on Linode
    ○ Pre-reqs
    ○ Install K3s using K3sup
    ○ Install OpenFaaS
    ○ Configure an Ingress Controller and TLS certificate
    ○ Wrapping up
    ○ Getting in touch and supporting our work

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

Prodfiler, with Thomas Dullien

Craig Box, Kubernetes Podcast from Google

Why we created the Prometheus Conformance Program

Richard Hartmann, Grafana Labs

Crossplane is now a CNCF incubating project

Jared Watts, Crossplane blog

  • As the title suggests, Crossplane reported that the maturity level was promoted from the CNCF sandbox to incubation, looking back on the following points and mentioning about the future.
    ○ A Consistent Vision
    ○ The Community Keeps Growing
    ○ First Major Milestone Ready for Production
    ○ Strong Partnerships with the Ecosystem
    ○ Production Adoption
    ○ Conformance in the Ecosystem
    ○ The Road Ahead

Google’s Sqlcommenter now extending the vision of OpenTelemetry to databases

Nimesh Bhagat, Google Cloud

  • Since it is covered it in last week’s Kube Weekly #277 , so I will skip it.

Cloud Native Chaos and Telcos — Enforcing reliability and availability for telcos

W.Watson, Vulk Coop & Karthik S., LitmusChaos

  • The explanation is based on the keywords in the title. The conclusion is below.
    ○ Borrowing from the lessons learned when applying chaos testing to cloud native environments, we should use declarative chaos specifications to test telecommunication infrastructure in tandem with its development and deployment. The CI/CD tradition of “pull the pain forward” with a focus on MTTR will produce the type of highly available and reliable systems that cloud native telecommunication systems will need to be.

7 microservices best practices for developers

Michael Bogan, Kong

  • The following 7 points are explained along with the title.
  1. Small Application Domain
  2. Separation of Data Storage
  3. Communication Channels
  4. Compatibility
  5. Orchestrating Microservices
  6. Microservices Security
  7. Metrics and Monitoring

NSA & CISA Kubernetes security guidance

Lars Larsson, Elastisys

  • It summarizes the main takeaway messages of Kubernetes Hardening Guidance and provides additional insights based on its personal experience with cloud security.

KubeCon + CloudNativeCon North America preview with Constance Caramanolis and Stephen Augustus

The CUBE

  • As the title suggests, a 21-minute session in which two Co-chairs from KubeCon + CloudNativeCon North America are interviewed for the event and talk about the highlights.

Introducing the CNCF End User Journey Report: First up, Spotify

CNCF

  • The CNCF End User Community has published the first report, “End User Journey report features Spotify” and outlines in this article.
  • The End User Journey report focuses on active end user community members. It shows how these organizations have grown as technology leaders and have benefited from joining the CNCF end-user community.

Upcoming CNCF Online Programs

*edited as the Kubernetes 1.22 release webinar has been rescheduled

Live Webinar

Cloud Native Live

On-demand

CNCF End User Lounge Livestream

Looking for more great curated content? Visit our Online Programs playlist on YouTube.

Learn more about CNCF Online Programs

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara

--

--

Yoshiki Fujiwara

・Cloud Solutions Architect - AWS@NetApp in Tokyo, Japan. #AWS Certified Solution Architect&DevOps Professional, #Kubernetes, ・Opinions are my own.