SRE / DevOps / Kubernetes Weekly Collection#83(Week 35, 2021)

10 min readSep 5, 2021

In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #557 August 29th, 2021
SRE Weekly Issue #285 August 29th, 2021
KubeWeekly #275 September 3rd, 2021

DEVOPS WEEKLY ISSUE #557 August 29th, 2021

News

Results from a recent survey of more than 3200 practitioners looking at the state of cloud adoption. Security, cost, skills shortages, multi-cloud, spend and lots of other interesting topics covered.

The title is “HashiCorp State of Cloud Strategy Survey: Welcome to the Multi-Cloud Era”.
As mentioned above, it shares the results of the HashiCorp survey of cloud adoption with numbers and diagrams.

An interesting post on instrumenting infrastructure deployment, in this case using Pulumi and Honeycomb and low-level libraries for each.

The title is “Observable Infrastructure as Code”.
It explains how to use Pulumi and Honeycomb to simplify the observability of IaC.

“The Domain Name System or DNS is a never-ending source of amusement and amazement.” The latest post in this series looks at TLDs.

The title is “TLDs — Putting the ‘.fun’ in the top of the DNS”.
It digs deeper into DNS top-level domains.

Another internet fundamentals posts, this one looking at email authenticity with DKIM, SPF and DMARC.

The title is “Email Authenticity 101: DKIM, DMARC, and SPF”.
It explains the elements in the title and provides information and practices you need to keep your domain’s email authentic and less vulnerable to spoofing.

Establishing trust in the context of untrusted networks is an interesting, and increasingly relevant problem given how we build software today. This post looks at approaches based around trust on first use and the role of transparency logs.

The title is “Improving TOFU With Transparency”.
It explains when TOFU (Trust-On-First-Use) works and when it does not work, and mitigation measures using transparency logs.

JSON Schema has been seeing quite a bit of development recently. This post mainly looks at the new bundling capabilities in the latest version of the specification, but touches on other recent improvements too.

The title is “JSON Schema bundling finally formalised”.
The following points are explained along with the title.
○ Bundling has renewed importance
○ Existing solutions? New solutions!
○ Bundling fundamentals
○ Bundling Simple External Resources
○ OpenAPI Specification Example
○ But what about…

A great story of debugging a production problem and seemingly innocuous changes having a large effect.

The title is “Computers are the easy part”.
At the beginning, it mentioned the accident “Controlled Flight Into Terrain (CFIT)” in which an aircraft under the complete control of the pilot was unintentionally operated on the ground without any mechanical failure in the world of aircraft safety. They shared the lessons learned from an internal outage that lasted for multiple days.

Sometimes the thing that causes an outage is getting a calculation wrong. This post features several reliability problems caused by maths errors.

The title is “You Do the Math: Reliability Issues Triggered by Math Errors”.
Along the title, it describes, at least in part, the following four incidents or issues caused by mathematical errors.

NASA’s $ 125 million math mistake
Windows Calculator fails to calculate
The math bug that cost Intel $475 million
Y2K: The math bug that (mostly) wasn’t

Events

A new virtual event coming October 19th to 21st, PREVAIL is focused on all things non-functional requirements. The call for papers is open until 10th of September.

As mentioned above, the web page of the new virtual event “PREVAIL 2021” to be held by IBM during October 19–21, 2021.

Tools

Octopilot is a CLI tool designed to help you automate your Git workflow, by automatically creating and merging GitHub Pull Requests to update specific content in Git repositories.

The GitHub page of the CLI tool “Octopilot” that helps automate GitOps workflows by automatically creating and merging GitHub pull requests and updating specific content in the Git repository.
Click here for an introductory article.

SRE Weekly Issue #285 August 29th, 2021

Articles

Computers are the easy part

What’s so great about this incident write-up is the way that entrenched mental models hampered the incident response. There’s so much to learn here.

Ray Ashman — Mailchimp

Since it is covered in DEVOPS WEEKLY ISSUE #557 above, I will skip it.

Rethinking Best Practices

The parallels between this and the Mailchimp article are striking.

Will Gallego

The following points are explained along with the title.
○ Akin to Root Cause
○ When do we decide what’s best?
○ Best Practices lack flexibility
○ Best Practice: Don’t use “Best Practice”..?

How to Improve Upon Google’s Four Golden Signals of Monitoring

This includes a review of the four golden signals and presents three areas to go further.

JJ Tang — Rootly

Since it is covered in DEVOPS WEEKLY ISSUE #556 last week, I will skip it.

Root cause of failure, root cause of success

This one thoughtfully discusses why “root cause” is a flawed concept, approaching the idea from multiple directions.

Lorin Hochstein

Since it is covered in DEVOPS WEEKLY ISSUE #556 last week, I will skip it.

IBM PREVAIL Conference: October 19–21, 2021

Check it out, a new SRE conference! This one’s virtual and the CFP is open until October 1.

Robert Barron — IBM

An introductory article on the “IBM PREVAIL Conference” featured in DEVOPS WEEKLY ISSUE #557 above.
In the above article, the editor commented “The call for papers is open until 10th of September.”, But there is a description of “Submission deadline: October 1, 2021”. Was it updated?

Notes on the Perfidy of Dashboards

To be clear, this article is about static dashboards that just contain pre-set graphs of specific metrics.

every dashboard is an answer to some long-forgotten question

Charity Majors

As the title suggests, it digs deep into the points to note and where to use (static) dashboards.
○ STATIC VS DYNAMIC DASHBOARDS
○ DEBUGGING WITH DASHBOARDS: IT’S A TRAP
○ IF WE DID MATH LIKE WE DO DASHBOARDS
○ THE LIMITATIONS OF METRICS AND DASHBOARDS
○ OTHER COMPLAINTS ABOUT DASHBOARDS:
○ IN CONCLUSION

What makes public posts about incidents different from analysis write-ups

Public incident posts give us useful insight into how companies analyze their incidents, but it’s important to remember that they’re almost never the same as internal incident write-ups.

John Allspaw — Adaptive Capacity Labs

As the title suggests, it explains why public posts published by companies about incidents differ from internal incident write-ups that represent effective incident analysis, and why this difference is important.

Heroku Incident #2300 Follow-Up

In this incident from July 7, front-line routing hosts exceeded their file descriptor limits, causing requests to be delayed and dropped.

Heroku

As mentioned above, a follow-up article on the incident that occurred on Heroku’s 2021/07/07.

TLDs — Putting the ‘.fun’ in the top of the DNS

.io, assigned to the British Indian Ocean Territory is almost exclusively used by annoying startups for content completely unrelated to the islands.

Remember, it’s all fun and games until the random country you’ve attached your business to has an outage in their TLD DNS infrastructure.

Jan Schaumann

Since it is covered in DEVOPS WEEKLY ISSUE #557 above, I will skip it.

Why Observability Requires a Distributed Column Store

If you’re curious about just what a columnar data store is like I was, this article is a good introduction.

Alex Vondrak — Honeycomb

It explains what a Distributed Column Store is, its capabilities, and why a Distributed Column Store is a fundamental requirement for achieving observability.

Outages

KubeWeekly #275 September 3rd, 2021

The Headlines

Editor’s pick of the highlights from the past week.

From incubation to augmentation, how software projects grow

The content of the title is explained using the CNCF OpenTelemetry project as a theme.

ICYMI: CNCF online programs this week

A weekly summary of CNCF online programs from this week.

Windows came second

Daniel Prizmant, Palo Alto Networks

An approximately 33 minutes of session sharing threat investigations on Windows containers for cloud native apps.

Composing your way to a control plane powered future

Dan Mangum, Upbound

An approximately 55-minute session that explains how to define your own cloud API just by writing YAML using Crossplane’s Composition.

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

Enable seccomp for all workloads with a new v1.22 alpha feature

Sascha Grunert, Red Hat

As the title suggests, it explains the new features of Kubernetes introduced in v1.22.

Managing Kubernetes seccomp profiles with security profiles operator

The content of the title is explained with the following points.
○ Security Profile Operator features
○ Installation
○ Creating a seccomp profile
○ Using a seccomp profile
○ Profile inheritance using base profiles
○ Using profile bindings
○ Recording Profiles
○ Metrics and Log enrichment
○ Wrap

Shipwright — building container images in Kubernetes

Viktor Farcic

A 21-minute video explaining the extensible framework “Shipwright” that builds container images with Kubernetes.

Distributed tracing with Knative, OpenTelemetry and Jaeger

Ben Moss, VMware

It explains how to set up distributed tracing using Knative Eventing, how it can help you better understand your program, and how it works internally.

A Kubernetes engineer’s guide to mTLS

William Morgan, Buoyant

It explains what mTLS is, how it relates to “normal” TLS, why it is related to Kubernetes, and the strengths and weaknesses of mTLS and its alternatives. It also shows how to use Linkerd to add mTLS to a Kubernetes cluster.

Service Mesh 101: The role of Envoy

Scott Lowe, Kong

It describes what is a service mesh, what it does, and where Envoy fits into the service mesh. If you want to see more detailed content that focuses on the basics of Envoy configuration on Service Meshes, see the following article, “Service Mesh 102: Envoy Configuration”.

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

Updates on Google’s continued collaboration with NIST to secure the software supply chain

Eric Brewer and Dan Lorenc, Google

A Report on participation and announcement of Google’s White House Cybersecurity Summit hosted by President Biden.
It said that they will collaborate with the U.S. Department of Commerce’s National Institute of Standards and Technology (NIST) to support and develop a new framework that will help to improve the security and integrity of the technology supply chain.
They committed to invest $10 billion over the next five years to expand zero-trust programs, help secure the software supply chain, and enhance open-source security.
A lot of information is introduced with links, so I would like to read each one.

Unicron, with Daniel Megyesi

Craig Box, Kubernetes Podcast from Google

Kubernetes Podcast by Google employees. This time the Host is Craig Box and Guest Host Jimmy Moore.
The guest is Daniel Megyesi, the maintainer and DevOps engineer of Unicron, the central platform for big data and machine learning at Adevinta. Below is an article introducing Unicron by him.
○ Introducing Unicron, our big data and Machine Learning platform
The topics I was interested in in the News of the week are as follows.
○ Google commits $10 billion to advance cybersecurity
○ ingress-nginx 1.0.0
○ VMware announces Tanzu Application Platform

How FinOps changed the way businesses approach the cloud

A Guest post to CNCF. The original article is published in the “Virtasant blog” with the same title.
FinOps keywords and reference materials such as “FinOps Foundation” and “State of FinOps Report 2021” are linked, so it seems to be a good introduction.

Docker is updating and extending our product subscriptions

An article about the expansion of product subscriptions for large-scale and commercial use of Docker Desktop.
The conditions for commercial use are as follows. Personal use remains free. It is clearly stated as free components in the article, so check the article if you are uncertain.
○ Organizations with more than 250 employees or $ 10 M/year sales will need a paid subscription, at the latest by the end of January 2022 next year
The application period starts from 2021/08/31. The grace period is until the end of January 2022 next year.

A guide to spot-readiness in Kubernetes

Michael Dresser & Alex Thilen, Kubecost blog

As the title suggests, as a guide for utilizing Spot Instances in the Kubernetes environment, the necessity and “Kubecost’s Spot-Readiness Checklist” and “Kubecost” are explained with the following points.
○ What are spot instances and why use them?
○ What’s the customer challenge today?
○ Enter… Kubecost’s Spot-Readiness Checklist
○ Implement spot nodes in your cluster using Kubecost, for free!

September 2021 update

Daniel Holbach, Flux

September update for Flux. As a recapping for August, it is explained in the following items.
○ Flux Project Facts
○ News in the Flux family
○ Upcoming events
○ In other news
○ Over and out

How Istio, Tempo, and Loki speed up debugging for microservices

Antonio Berben, Solo.io

On the Grafana Labs blog, It introduces the tools in the title, Hands-on for them and Grafana Cloud, from the perspective of “Having a diagram which displays all elements involved in a request through microservices increases the speed to find bugs or to understand what happened in your system when running a postmortem analysis.”.

Why cloud native open source is critical for Twitter and Spotify

Alex Williams and B. Cameron Gain, The New Stack

An approximately 31-minute podcast sponsored by CNCF and an overview article. It is interesting to hear the technical design process and efforts of both companies along the title.

Upcoming CNCF Online Programs

Live Webinar

September 7 at 10am PT: Kubernetes 1.22 release presented by Savitha Raghunathan, James Laverack & Jesse Butler, Kubernetes 1.22 Release Team — RSVP

Cloud Native Live

September 8 at 9am PT: Kubernetes clusters need persistent data presented by Alex Chircop, StorageOS — RSVP

Looking for more great curated content? Visit our Online Programs playlist on YouTube.

Learn more about CNCF Online Program

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara