SRE / DevOps / Kubernetes Weekly Collection#78(Week 30, 2021)

11 min readAug 1, 2021

In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #552 July 25th, 2021
SRE Weekly Issue #280 July 25th, 2021
KubeWeekly #270 July 30th, 2021

DEVOPS WEEKLY ISSUE #552 July 25th, 2021

News

The latest State of Devops report is out, with some interesting observations based on this year’s survey around cultural blockers holding up adoption of mature devops practices.

The title is “The 2021 State of DevOps Report is here!”.
The contents of the “2021 State of DevOps Report” are excerpted and explained with the following points, but it is strongly recommended that you download the report and read it for yourself.
○ Cultural blockers are keeping mid-evolution firms stuck in the middle
○ Team identities and clear interaction paradigms matter
○ DevOps is not automation and DevOps is not the cloud
○ Read the report
○ Learn more

Discussion of the importance of team structures and communication when it comes to devops transformation. Promises, feedback and customer centricity.

The title is “After Team Topologies: Getting Past the DevOps Middle”.
At the beginning, most organizations are “stuck in the middle” with their DevOps evolution.
It is explained from the contents of the “2021 State of DevOps Report” in the article above citing three following reasons.

Lack of well-defined team structures, responsibilities, and interactions, especially with respect to internal platform teams
Insufficient feedback loops
Risk avoidance

While agreeing to recommend Team Topologies strategy, it came up with the question that “Too often Agile and Continuous Delivery optimize the flow of work, not necessarily the flow of customer value.” and gave its ideas.

It’s easy to have an SRE team or function that feels siloed from other engineering teams. This post examines why, and what you can do about it.

The title is “De-Siloing Incident Management: How to Make Reliability Engineering Everyone’s Job”.
Since it was taken up in “Last week’s SRE Weekly Issue# 279”, I will skip it.

A look at solving the thundering herd problem after clearing a higher level cache.

The title is “Solving The Three Stooges Problem”.
It outlines how traffic to Reddit’s search infrastructure is reminiscent of a sketch of the doorway to “The Three Stooges” , and an approach to remediate these request patterns.

A look on designing useful altering systems, with a good list of dos and don’ts.

The title is “Monitoring Alerts That Don’t Suck”.
Over the past year and a half, it has participated in a project to practice on-call and explain how it tried to improve the quality of alerts.

A reasoned argument for not using Kubernetes.

The title is “No, we don’t use Kubernetes”.
As the title suggests, it carefully explains why Kubernetes is not used in their environment. Even if you read only the Conclusion part, you can understand the points they want to tell.

A look at the Headlamp Kuberneters dashboard and why you may consider using it.

The title is “Kubernetes Dashboards: Headlamp”.
It explains the slick design of the Kubernetes dashboard “Headlamp”, the choice of the execution environments, and its main dissatisfaction with the recommended authentication path.

If you’re managing the nodes under a Kuberntes cluster you might find some nodes causing problems. This post looks at using the Kubernetes API and extension points to make the cluster self-healing.

The title is “Automatic Remediation of Kubernetes Nodes”.
Since it was taken up in “Last week’s SRE Weekly Issue # 279”, I will skip it.

Tools

SchemaHero is a Kubernetes-native implementation of declarative database schema management.

A web page of Kubernetes Operator “Schema Hero” for declarative schema management of various databases.
Click here for the GitHub page.

Ortelius is a platform for managing microservices. It provides a central catalog of services with their deployment specs, application teams can easily consume and deploy services across a cluster.

A web page of the microservice management platform “Ortelius”. Version control and track where microservices are deployed, along with their consumption apps, ownership, explosive range, and all important deployment metadata.
Click here for the GitHub page.

N8n is a tool for connecting services together, with a visual editor, command line tools, and commons clause self-hosted version available.

A web page of the extensible workflow automation tool “N8n”.
Click here for the GitHub page.

SRE Weekly Issue #280 July 25th, 2021

Articles

The Harmful Consequences of the Robustness Principle

The Robustness Principle (“be conservative in what you send, and liberal in what you accept”) has its uses, but it may not be best for the development of mature protocols, according to this IETF draft.

Martin Thomson

This is the 6th edition because the branch number 05 of Internet-Draft, which gives an opinion on the “Robustness Principle”, and the first edition is 00. Internet-Draft has a 6-month expiration date.

No, we don’t use Kubernetes

Docker without Kubernetes, does it make sense? These folks have a well-reasoned argument explaining why Kubernetes is not for them.

Maik Zumstrull — Ably

Since it is covered in DEVOPS WEEKLY ISSUE#552 above, I will skip it.

Personal data breach reporting for service outages (such as when your CDN is down)

Can a service outage unrelated to security count as a “personal data breach” in terms of GDPR and other regulations? If you agree with this article’s logic, then maybe it can.

Neil Brown

Guidance from both UK regulators and the European Data Protection Commission suggests that “loss of availability” is “potential for personal data breaches,” and it is considering whether it is appropriate.

When You Do DevSecOps, Don’t Forget the SREs

The interactions between security and reliability incidents can be complex and hard to navigate. The example scenarios in this article really made me think.

Quentin Rousseau — Rootly

It warned that “For all that we talk about DevSecOps, we pay almost no heed to the importance of integrating security more centrally into the incident management work performed by SREs.” and explained how to deal with it.

Solving the Three Stooges Problem

To deal with thundering herds, reddit implements caching in front of each of its microservices.

Raj Shah — reddit

Since it is covered in DEVOPS WEEKLY ISSUE # 552 above, I will skip it.

What’s allowed to count as a cause?

Incident causes are a social construct, and it may be that your organizational structure prevents something from being counted as a cause.

Lorin Hochstein

The ideas impressed me that “the sorts of things that are allowed to be labelled as causes depends on the cultural norms of an organization”. Not limited to incidents, you may feel these things in the organization when you propose something and it is not accepted without clear feedback.

IC1 Reliability Engineer — Dropbox Engineering Career Framework

Check it out, Dropbox publicly released their SRE career ladder.

Dropbox

The Career Framework according to Dropbox’s SRE level (IC1–8) is publicly released and I found it is very helpful.
The subject is consistently written with “I” from the sentence that describes “What is the role to do?” just below the position. I think it is a good design that encourages action after being aware of the expected value of each member.

Incidents, Response, and the People With Tim Nicholas

There’s a moment halfway through this episode of Page It to the Limit where they talk about blamelessness. If you just tell people to “do blameless postmortems”, but you don’t tell them how, then they’ll be afraid to talk about anything people did, and that will hamper learning.

Julie Gunderson, with guestTim Nicholas — Page It to the Limit

A podcast with the theme of “incident response and learnings, and really importantly, the culture required to be successful when we’ve got these systems that are so complex”. They talk about learning from incidents, communication patterns, psychological safety and accountability.

Migrating Facebook to MySQL 8.0

This was a monumental task, considering the 1000+(!!) internal code patches they had to port from MySQL 5.6 to 8.0.

Herman Lee, Pradeep Nayak — Facebook

As mentioned above, the story of Facebook’s migration from MySQL 5.6 to 8.0. Just looking at the number of patches in the code seemed daunting. I think it is wonderful to provide such valuable knowledge in a timely manner.

Outages

Akamai
Akamai had what they’re calling an “Edge DNS Service Incident”. It made headlines this week because it took down many of their customers, similar to the Akamai incident last month.
Let’s Encrypt
Disney park-related apps
Heroku

KubeWeekly #270 July 30th, 2021

The Headlines

Editor’s pick of the highlights from the past week.

Cloud Native Computing Foundation announces Linkerd graduation

Congratulations to Linkerd for hitting Graduated status! Linkerd was the first project to join the CNCF Sandbox, known as inception at the time, and is now the first service mesh project to achieve graduated status.

Linkerd is a service mesh that provides critical observability, security, and reliability features to cloud native applications without requiring code changes. The project was created in 2016 by Buoyant and joined CNCF in early 2017 as the foundation’s fifth project. It was the first service mesh project and the first CNCF project to adopt the Rust programming language to improve security and performance. Today, organizations like Microsoft, Nordstrom, Expedia, JPMC, Clover Health, Entain, H-E-B, and more rely on Linkerd to power mission-critical production systems.

An article telling you that “Linkerd” has achieved the CNCF maturity, Graduation.
As mentioned above, Linkered is the first project to participate in the CNCF “Sandbox” maturity level and the first service mesh project to reach Graduation.
Click here for an article on Graduation by the Maintainer.

KubeCon + CloudNativeCon North America Co-located events CFP reminder

Production Identity Day: SPIFFE + SPIRE North America hosted by CNCF

CFP Closes, Sunday, Aug 1 at 11:59 PM PST

Apparently, the above link is a link of a past event that was skipped. The following information was posted.
○ This event has passed. View the upcoming KubeCon and other CNCF Events.
○ November 17, 2020
This is the correct one. As a Co-located event for KubeCon + CloudNativeCon North America will be held on October 11, 2021 (local time). The CFP deadline is as above.

Cloud Native Wasm Day North America hosted by CNCF

CFP Closes, Monday, Aug 9 at 11:59 PM PST

Also as a Co-located event of KubeCon + CloudNativeCon North America, this is scheduled to be held on October 12, 2021 (local time). The CFP deadline is as above.

ICYMI: CNCF online programs this week

A weekly summary of CNCF online programs from this week.

Building the Telegraf Kubernetes Operator Wojciech Kocjan, InfluxData

An approximately one-hour session with Telegraf Kubernetes Operator Maintainers working together to explain how to leverage it in a Kubernetes environment and best practices for improving service processing.

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

6 Ways to leverage Insomnia as a gRPC client

Alvin Lee, Kong

Starting with an explanation of the core technology concepts, it explains how to build a fun and simple gRPC server with Node.js and use “Insomnia” to make gRPC requests on the server. I had a hard time sleeping in the summer, so I got eyes on Insomnia as a physical symptom one.

My Prometheus is overwhelmed! Help!

Ryan Dawson, ThoughtWorks

It describes some cases that may occur while using Prometheus and various options for extending Prometheus. Read the last “You Are Not Alone” section.

Verify container image signatures in Kubernetes using Notary or Cosign or both

Christoph Hamsen, SSE Blog

Introducing the version 2.0 release of the admission controller “Connaisseur”, which integrates container image signature verification and trust consolidation into a Kubernetes cluster .

Debugging apps in Kubernetes with Bridge

Thorsten Hans, Blog

An explanation and demo of how to debug an app running on Kubernetes using Bridge to Kubernetes (Bridge).
There is a Bridge extension from Visual Studio Marketplace and a Bridge to Kubernetes extension for VS Code, which provides an environment for developers to seamlessly debug and test.

Announcing Deckhouse, the Kubernetes platform from Flant is now generally available

Dmitry Shurupov, Flant

Flant announces the open source release of the Kubernetes platform “Deckhouse”. The features are the following five.

Infrastructure Agnostic
Providing everything you need to maintain your production cluster
Renders K8s usage more straightforward thanks to the NoOps approach
Deploys clusters in 8 minutes
Offers a 99.95% SLA guarantee

There are two editions, Deckhouse CE/Deckhouse EE, and the SLA has a proviso of “* This applies to the Enterprise Edition only.” on the above Deckhouse web page, so only EE is applicable.

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

Opstrace, with Sebastien Pahl

Craig Box, Kubernetes Podcast from Google

Kubernetes Podcast by Google employees. This time the Host is Craig Box and Guest Host Jimmy Moore.
The guest is Sebastien Pahl, co-founder/CEO of “Opstrace” and co-founder of Docker’s predecessor, Dotcloud.
The topics I was interested in in the News of the week are as follows.
○ Cloud Foundry Foundation releases v5
○ Chaos Mesh 2.0.0
○ Nominate yourself for the 1.23 Release Team

Cosign 1.0 is now GA

Dan Lorenc, Sigstore

As the title suggests, cosign 1.0 was released and became GA, and he talks about the future.

GKE best practices: Create a cost-optimized cluster in just a few clicks

Roman Arcea. Google Cloud

It provides a best practice guide for running GKE cost-optimized Kubernetes applications for users who are not yet ready to use GKE Autopilot, and a built-in GKE cost-optimized cluster setup guide.

How Riskfuel is using Inlets to build machine learning models at scale

Addison van den Hoeven, Riskfuel

Using”Inlets”, it describes how to securely oversee fully remote and hybrid cloud deployments.
It also touches on how to use Inlets to train machine learning models in the client’s infrastructure and send millions of control messages.

Announcing Vitess 11

Alkin Tezuysal, Vitess maintainer

As the title suggests, Vitess 11 has been released, and the following “Major Themes” are picked up.
○ Schema Tracking
○ Schema Management
○ Performance Optimizations
○ VTAdmin
○ VReplication
○ Benchmarking
For details, refer to “Release Notes”.

Take the CNCF microsurvey on Cloud Native Security:

Cloud Native Security Microsurvey

Web survey powered by SurveyMonkey.com. Create your own online survey now with SurveyMonkey's expert certified FREE…

www.surveymonkey.co.uk

Upcoming CNCF Online Programs

Cloud Native Live

August 4 at 9am PT: Humanising your cloud native platform by Lee Briggs, Pulumi — RSVP

On-demand Webinars

August 5: Securing your continuous everything strategy by Abubakar Siddiq Ango, GitLab — RSVP
August 5: Kubernetes clusters need persistent data by James Spurin, StorageOS — RSVP
Looking for more great curated content? Visit our Online Programs playlist on YouTube.

Learn more about CNCF Online Programs

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara