SRE / DevOps / Kubernetes Weekly Collection#65(Week 17, 2021)

11 min readMay 3, 2021

In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #539 April 25th, 2021
SRE Weekly Issue #267 April 25th, 2021
KubeWeekly # 261 April 30, 2021

DEVOPS WEEKLY ISSUE #539 April 25th, 2021

News

A strong argument for why you need a platform team to really benefit from running on Kubernetes.

The title is “Why you need a platform team for Kubernetes”.
It is explained according to the title, and the conclusion is as follows.
○ If your organization is large enough and you have a dedicated team to maintain Kubernetes, you can save a lot of time and effort compared to other options for managing your computing resources.
○ If you’re a small organization and you can’t justify a dedicated Kubernetes team, the quality and reliability of your platform can be sacrificed.

In a growing organisation, ownership of services will naturally move from team to team over time. This post contains some great tips on how to make those types of transitions more successful.

The title is “How to Successfully Hand Over Systems”.
As the title and Editor’s comments above, for engineering managers, product managers, and teams, acknowledge that system ownership change is a process that should be well planned and done at a time that works best for everyone involved.

A post on using the role of incident commander to aid in addressing operational incidents smoothly.

The title is “Embrace your inner incident commander”.
Since it was previously covered in SRE Weekly Issue# 259, I will skip it.

I’m a fan of light-weight community metadata standards. a8r.io is a set of annotations for Kubernetes objects for finding things like runbooks, issue tracking, log viewers, chat channels, etc.

The title is “Annotating Kubernetes Services for Humans”.
A page outlines a convention for using annotations to help developers manage Kubernetes Services.

AsyncAPI is a project aiming to make building and working with event driven architectures easier. Open source tools and specifications similar to OpenAPI.

The title is “Building the future of event-driven architecture.”
A web page of Async API.

A post on using preemptible nodes on GCP. These are unreliable by design, so implementing chaos engineering approaches is even more critical.

The title is “Migrating to GKE: Preemptible nodes and making space for the Chaos Monkeys”.
Last year, Expel’s SRE and DDT (Device Discovery and Tasks) teams moved from a statically provisioned virtual machine (VM) legacy environment to a more dynamically scalable and reliable device task infrastructure. The update is explained according to the title.

Lots of teams are managing increasingly large Kafka clusters. This post introduces Cruise Control and some of it’s features for rebalancing and visualising cluster workloads.

The title is “Visualization in Kafka Cruise Control”.
From Teads’ engineering blog. The above title and the contents of the Editor comment are explained while showing the UI and graphs.

Jobs

Do you love solving business problems? Are you driven by translating what you see into the design and implementation? Are you looking to automate and manage day-to-day operations of software and hardware infrastructure? Optiver are hiring Site Reliability Engineers!

As Site Reliability Engineer you will deploy, maintain, monitor and improve the reliability, scalability and performance of our in-house built trading software. You will sit on the trading floor together with the end-users and set standards for the production environment — it is an engineering role, not a support role. You will have a real, direct impact on our ability to trade and trading results. You will work with short feedback loops and flat hierarchy. No two days are the same!

SRE job listings.

Events

DevX Conf is coming up this week on the 28th and 29th of April. A virtual conference dedicated to developer experience. 20+ speakers covering everything from code editors to collaboration, and build and release tooling to monitoring and security. The focus throughout is on bringing back joy and speed to our workflows.

An event page of “Dev X Conf” event. Registration is from the GitHub link on the web page.

This year’s Failover Conf won’t be like any other virtual conference, with panel discussions, lightning talks, fireside chats, dance parties, pet slideshows, tons of swag, and more. Join us for discussions on reliability, DevOps, and SRE on April 27th at 9am PDT.

A GitHub page of “Fail over Conf”. I feel their passion to make differences at a virtual conference.

Tools

ConsoleMe is a web service that aims to make AWS IAM permissions and credential management easier for end-users and cloud administrators.

A GitHub page of “ConsoleMe”, a web service that facilitates AWS IAM permissions and credential management for end users and cloud administrators.

Early, but very interesting. Zellij is on the surface just another terminal multiplexer. But it’s webassembly plugin system and plans for a browser based interface look interfacing for sharing reusable UIs.

A GitHub page of the terminal workspace and multiplexer “Zellij” written in Rust.
It aims to become a general purpose application development platform in the future.

Qovery is a high-level cloud application platform. It provides an interface based around Git and branches but deploys to your cloud environment, supporting AWS, Azure and GCP.

A GitHub page of “Qovery”, an open source abstraction layer library that makes it easy to deploy apps to AWS, GCP, Azure, and other cloud providers in just minutes.
Written in Rust, it leverages Terraform, Helm, Kubectl, and Docker to manage resources.

SRE Weekly Issue #267 April 25th, 2021

Articles

SRE Case Study: Mysterious Traffic Imbalance

Yet more proof that DNS behavior varies way more than is obvious at first glance. Who the heck thought longest common prefix matching was a good idea?

Charles Li — eBay

The title case study is described based on a fictitious website.

Fast and flexible observability with canonical log lines

The application may log multiple lines during the lifecycle of a request. Stripe has found it invaluable to also log one final line with a fully summary of the request.

Brandur Leach — Stripe

As the title suggests, it explains how to use the “canonical log line” to ensure lightweight and strong observability.

Google Incident Report — April 12, 2021

This is a followup with more detail on the G-Suite outage I reported here last week. A database issue caused two separate outages.

Google

As mentioned above, a follow-up report on Google incidents.

The top 3 mistakes companies make with SLOs, SLAs, and SLIs

Really great advice about 3 common pitfalls in implementing SL*s.

Cortex

According to the title and the comment of the above Editor, the following 3 are explained.

Unnecessary SLOs
Tracking vanity SLIs — instead of business goals
Lack of visibility and ownership around SLOs

Going solid: a model of system dynamics and consequences for patient safety — Resilience Roundup

This research paper explores the marginal boundary, a set of conditions beyond which a system enters a different operating mode and an accident is much more likely. It discusses the concept of coupling between seemingly unrelated parts of the system and shows how economic incentives can push a system toward this boundary.

Dr. Richard Cook and Jens Rasmussen (Original paper)
Thai Wood — Resilience Roundup (summary)

It describes the problems with the system transitioning from a loosely coupled state to a very tightly coupled state and the effects that can occur as a result.

Vodafone Idea BGP Leak — Global Routing System Must Implement MANRS

This is an analysis of a recent BGP leak with a discussion about how the impact from such events can be mitigated through emerging best practices.

Alessandro Improta and Luca Sani — Catchpoint

It explains the origin hijacking incident by Vodafone Idea (AS55410) that occurred on April 16, 2021.
The author proposes the implementation of “Mutually Agreed Norms for Routing Security (MANRS)” to address the threat of routing security.

How to Successfully Hand Over Systems

How do you hand over ownership of a system, transferring enough knowledge that the new owners can maintain its availability and reliability successfully?

Aleksandra Gavrilovska — SoundCloud

Since it is covered in DEVOPS WEEKLY ISSUE #539 above, I will skip it.

Resiliency Planning for High-Traffic Events

Shopify works toward Black Friday / Cyber Monday all year long, through a combination of load testing, failure mode analysis, game days, and incident analysis.

Ryan McIlmoyl — Shopify

It describes creating and maintaining resiliency plans for large development teams, testing and tools, developing incident strategies, and incorporating and improving feedback loops.
About an hour of this session video is embedded in a web page.

Outages

Microsoft Azure web portal
Microsoft 365
Discord
google.com.ar
This one’s interesting. A random person was able to buy the domain name google.com.ar, despite the fact that its registration had not expired.

KubeWeekly #261 April 30, 2021

The Headlines

Editor’s pick of the highlights from the past week.

Last chance to register for KubeCon + CloudNativeCon Europe 2021 — Virtual

KubeCon + CloudNativeCon Europe 2021 — Virtual kicks off next week on May 4–7, 2021! Join the CNCF global community for more than 100 interactive sessions and experiences.

If you haven’t registered yet, be sure to register now and begin planning your experience. Don’t forget that we have two different pass options — including a free Keynote pass. We hope to “see” you there!

Editor’s note: KubeWeekly will take a short break for KubeCon + CloudNativeCon Europe 2021 and will resume on May 21. Enjoy the show!

It in an announcement for just before KubeCon + CloudNativeCon Europe 2021 and for that KubeWeekly will be off for two weeks for this event and will resume on May 21st.

ICYMI: CNCF online programs this week

A weekly summary of CNCF online programs from this week.

Migrating from Flux v1 to Flux v2

Leigh Capili, Weaveworks

Approximately 1 hour session with live demos, including how to bootstrap a cluster with Flux v1 and how to migrate to Flux v2.

Reduce the carbon footprint of your cloud-native workloads now

Eric Riedel & Jean-Jacques Chanut, ITRenew & Andy Randall, Kinvolk

It describes how to reduce carbon dioxide emissions in today’s cloud-native workloads and achieve greater computing economics.

It is time to talk about DataMesh

Fred Chian, Brobridge Co. Ltd.

It explains how to properly handle data supply issues in the microservices implementation process and aims to create an efficient data delivery platform for microservices through the DataMesh architecture.

Using machine learning on K8s logs to find root cause faster

Larry Lancaster & Gavin Cohen, Zebrium & Aran Khanna, Reserved.ai

The content of the title is explained with the following points.

How the technology works
Live demonstration of the technology against a Kubernetes demo app
Case study: How Reserved.ai is using the technology to speed-up incident resolution time

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

Developing a Kong Gateway Plugin with Go

It explains how to create a custom Kong Gateway plugin using Golang.
Click here for a sample plugin which the author created that adds an extra layer for security between consumers and producers.

AKS cost monitoring and governance with Kubecost

A blog post on the Kubecost.com webpage.
The open source “Kubecost” that supports various self-managed and hosted Kubernetes environments, including AKS (Azure Kubernetes Service), is introduced with a diagram from the perspective of using AKS.

Annotating Kubernetes services for humans

Like the above DEVOPS WEEKLY ISSUE #539, it introduces Service annotations, so it seems good to read them as well.

Automate service mesh observability with Kuma

It describes how to set traffic metrics and traffic trace policies for immediate use for “Kuma”.

Kubernetes deployment strategies | Day 37 of #100DaysOfKubernetes

A YouTube commentary video. As described in the summary column, it explains the Kubernetes deployment strategy etc. on the following timeline.
○ 03:26 — Big-Bang
○ 05:22 — Rolling Updates
○ 07:28 — Blue-Green Deployment
○ 09:28 — A/B Testing
○ 10:56 — Canary Deployments
○ 13:03 — Progressive Delivery

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

Results from the CNCF Cloud Native Survey China 2020

CNCF

The content is exactly as the title, and the description is tailored to the reader, such as the Chinese notation in the English text and the Chinese text in the name.

Announcing Vitess 10

Alkin Tezuysal, Vitess maintainer

An overview article by the Maintainer for the release of Vitess 10.
The following points are taken up.
○ Compatibility (MySQL, frameworks)
○ Migration
○ Schema Management
○Performance Optimizations
○ User Interface
○ Benchmarking

Swallow, with Alex Palessandro

Craig Box, Kubernetes Podcast from Google

Kubernetes Podcast by Google employees. This time the Host is Craig Box and Guest Host Patrick Flynn. The previous appearances are as follows.
○ Episode 64 with Sarah D’Angelo and Patrick Flynn
The guest is Alex Palesandro, Research Assistant of Polytechnic University of Turin and co-creator Liqo.
The topics I was interested in in the News of the week are as follows.
○ Red Hat Virtual Summit announcements
○ Lens 5 Beta
○ Kubernetes moves to three releases per year

How Containers are helping IT catch up with the speed of business

Ziv Kedem, Zerto

A Forbes article that explains the content of the title for business people.

Turbocharge workloads with new multi-instance NVIDIA GPUs on GKE

Maulin Patel, Sr. Product Manager, Google Cloud and Pradeep Venkatachalam, Software Engineer, Google Cloud

The contents of the title are introduced at the following points.
○ What customers are saying
○ Creating multi-instance GPU partitions
○ Deploying containers on a multi-instance GPU node
○ Getting started

Podcast: Building a business around popular open source tools for Kubernetes with Richard Li

Justin Dorfman, Richard Littauer, & Tzury Bar Yochay, Curiefense

There are interesting stories behind the start of Datawire and various projects built from Datawire, such as Telepresence and the Ambassador API Gateway.

Reminder: Participate in CNCF microsurveys on Cloud Financial Management on Kubernetes and diversity

A survey reminder introduced several times here.

Take the 2021 CNCF Cloud Native Survey — Part 1

It Introduces Cloud Native Survey 2021. This year’s survey is divided into two parts. The theme of Part 1 is “cloud, containers, and Kubernetes”.
Part 2 will be held later this year. The theme is “CNCF projects and other cloud native technologies such as service mesh, serverless, and storage”.

Upcoming CNCF Online Programs

No Online Programs are scheduled for the week of KubeCon + CloudNativeCon Europe 2021 Virtual. We will resume the week of May 10!
Looking for more great curated content? Visit our Online Programs playlist on YouTube.

Learn more about CNCF Online Programs

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara