SRE / DevOps / Kubernetes Weekly Collection#87(Week 39, 2021)

11 min readOct 3, 2021

In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #561 September 26th, 2021
SRE Weekly Issue #289 September 26th, 2021
KubeWeekly # 279 October 1st, 2021

DEVOPS WEEKLY ISSUE #561 September 26th, 2021

News

The 2021 Accelerate State of DevOps Report is out, with advice on software delivery metrics, cloud adoption, the importance of documentation and more.

The title is “2021 Accelerate State of DevOps report addresses burnout, team performance”.
The DORA (DevOps Research and Assessment) team from Google Cloud has introduced the 2021 Accelerate State of DevOps Report with the new findings from this year’s report.

An interesting interview covering client library and SDK strategy, the importance of boring tools and adopting new technologies.

The title is “How Paul Osman thinks about long-term strategies, open telemetry, and the value of boring systems”.
An article in a series of interviews with standout engineering leaders to learn what’s top of mind for them, called “Level-Up”. As the title suggests, the interviewer, or the author, has extracted the points from an interview video with Paul Osman, a Staff Platform Engineer at Honeycomb.
At the end is a video of the entire interview for about 30 minutes embedded.

A post on the evolution of distributed tracing over the past 5 years.

The title is “Five years evolution of open-source distributed tracing”.
It summarizes what the author has learned in the five years since It started working on the OSS distributed tracing project, shows the evolution of the OSS distributed tracing space in chronological order, and peek into the future.
The explanation focuses on the following projects summarized in the Reference.
○ Hawkular APM: https://hawkular.gitbooks.io/hawkular-apm-user-guide/content/
○ Jaeger Tracing: https://www.jaegertracing.io/
○ Zipkin: https://zipkin.io/
○ OpenTelemetry: https://opentelemetry.io/
○ OpenCensus: https://opencensus.io/
○ Hypertrace: https://www.hypertrace.org/
○ SigNoz: https://signoz.io/
○ Apache SkyWalking: https://skywalking.apache.org/
○ W3C Trace-Context: https://www.w3.org/TR/trace-context/
○ Grafana Tempo: https://github.com/grafana/tempo

As with most things, the DORA metrics can be used poorly or misrepresented. As the post states, when a measure becomes a target, it ceases to be a good measure.

The title is “How DevOps teams are using — and abusing — DORA metrics”.
As the title and comments from the Editor above, DORA (DevOps Research and Assessment) metrics are explained in the following points.
○ DORA metrics can be a double-edged sword
○ Here’s the problem you really need to solve
○ It all starts with building the right culture
○ Keep learning

Pull requests as an attack vector. A well explained example of a potential attack, and some specific advice for others to help avoid this soft of supply chain attack.

The title is “Anatomy of a Cloud Infrastructure Attack via a Pull Request”.
It focuses on the April 2021 vulnerability that allows malicious pull requests to the Github repository to gain access to production, and the key areas where we are improving CI/CD tools and practices. It is explained in the following points.
○ Context
○ Technical details
○ Response
○ Advice to others
○ Wrap-up

A deep dive, multi-page, look at Linux Page Cache. If you’re administering Linux machines then understanding this can help with debugging various IO issues.

The title is “SRE deep dive into Linux Page Cache”.
Chapter 0 page of the series that describes the Linux page cache. Chapter structure is as follows.
0. Linux Page Cache for SRE
1. Prepare environments
2. Essential theory
3. Basic file operations
4. Eviction and page reclaim
5. More about mmap()
6. Cgroup v2
7. Unique set and working set
8. Direct IO
9. Advanced tools

Kubernetes is a lot when it comes to operating a new system. This post is a good set of common beginner errors.

The title is “Common Kubernetes Errors Made by Beginners [2021]”.
Of the common Kubernetes beginner’s errors the author has noticed in its years of experience with Kubernetes and consultation with various clients, it has selected and explained the following six common errors.

The selector of the labels on the service does not have a match with the pods
Wrong container port mapped to the service
CrashLoopBackOff
Liveness and readiness probes
Resources — Requests and Limits
Too many load balancer–type services

Events

The Data on Kubernetes community has an upcoming event on 12th October 2021. Lots of interesting talks for anyone running databases or stateful workloads on top of Kubernetes.

The Data on Kubernetes Community (DOK) will host the following events.
○ DoK Day North America 2021 @ KubeCon
○ Tuesday, October 12 9:00 AM — 5:00 PM PDT
○ Virtual + Los Angeles, California

Tools

Jspolicy is another Kubernetes policy agent, this time focused on supporting authoring policies in Javascript or Typescript.

A web page for “jsPolicy”, a simpler and faster Kubernetes policy engine that uses JavaScript or TypeScript.
In the item “Why yet another policy engine for Kubernetes?”, there is a comparison table between OPA and Kyverno, and it is easy to understand where it appeals for its superiority.
Click here for the GitHub page.

GitOops is a tool to help attackers and defenders identify lateral movement and privilege escalation paths in GitHub organizations by abusing CI/CD pipelines and GitHub access controls.

The GitHub page of “GitOops”, a tool that helps attackers and defenders identify lateral movements and privilege escalation paths in GitHub organizations by exploiting CI/CD pipelines and GitHub access controls.

SRE Weekly Issue #289 September 26th, 2021

Articles

How SREs are unique in their approach to work

Here are some things that make SREs a unique breed in software work:

The one about Scrum caught my eye, and I followed the links through to the Stack Overflow post about SRE and Scrum.

Ash P — Cruform

It explains the following points that make SREs a unique breed in software work.
○ SREs look at the broader picture
○ SREs thrive in ambiguity
○ SREs work beyond constraints like Scrum
○ SREs don’t stay in their lane
○ SREs don’t have a monolith job description
○ Comparison with software developers

Linux Page Cache for SRE

An in-depth explainer on the Linux page cache, full of details and experiments.

Viacheslav Biriukov

Since it is covered in DEVOPS WEEKLY ISSUE #560 above, I will skip it.

Just got my first SRE job. I start tomorrow, any advice?

There’s some great advice in this reddit thread… and maybe some tongue-in-cheek advice too.

Take production down the first day they give access — then it’s nothing but up from there!

Various — reddit

The thread is full of advice for questions from people who will work as new SRE.

Dark Side of Self-Service

Using two real-world case studies, this article explains how developer self-service can go wrong, and then discusses how to avoid these pitfalls.

Kaspar von Grünberg — humanitec

It discusses how you should think about striking the right balance between golden cages and golden paths with real-life examples.

What is expected in the SRE role? We analyzed 30 job postings to find out.

What a great idea! I found it especially interesting that only 34% of SRE job postings mention defining SLIs/SLOs/error budgets.

Pruthvi — Spike.sh

Based on the results of analyzing 30 job listings of SRE from major companies such as Google, Twitter, and Slack, it analyzed the job description of the role of SRE of a major company and summarized the top responsibilities of that role as of 2021.

10 questions teams should be asking for faster incident response

For the first time, we’ve created the State of Digital Operations Report which is based on PagerDuty platform data.
[…]
we will walk through some of these findings and share 10 questions teams can ask themselves to improve their incident response.

Hannah Culver — PagerDuty

As the title suggests, the team shares the following 10 questions that they can ask themselves to improve incident response.

What is our organization’s incident classification scheme?
Which incidents require coordinated response?
How do you get the right people involved?
How can you respond faster through automation?
Can you track whether a change led to an issue?
How do you map the overall impact?
How do you collaborate when time is the most valuable resource?
How do you keep stakeholders informed?
Do you have an efficient retrospective process?
How do you incorporate learnings into your response process?

How to avoid bad assumptions during incidents

Incident response so often gets mired in assumptions that need to be re-evaluated. This article uses an incident as a case study

Lawrence Jones — incident.io.

As mentioned in the title above and the comments in the Editor, the incident is explained as a case study with the following points.
○ Trust, but verify!
○ How the incident started
○ Where the incident went wrong
○ 3 lessons we learned during this incident investigation

SRE vs. DevOps: What are the Differences?

This one lays out clear definitions of SRE and DevOps and compares and contrasts them.

Matthew Gurgel — Rootly

While unpacking the complex relationship between DevOps and SRE, with the following points, it explains the differences between SRE and DevOps and how to adopt both concepts, as the title suggests.
○ What are SRE and DevOps?
○ What are the differences between SRE and DevOps?
○ Why do you hear more about DevOps than SRE?
○ Conclusion: Using SRE and DevOps together

Merlion: A Machine Learning Library for Time Series

This week, Saleforce released Merlion, a Python library for time series machine learning and anomaly detection. Linked is an in-depth research paper on Merlin, explaining its theory of operation and experimental results.

Bhatnagar et al. — Salesforce

As mentioned above, a research paper on the Python library “Merlion” for time series intelligence released by Salesforce.

Outages

reddit
Atlassian Statuspage.io
Giant Pay
Trello
Trello had major outages on two consecutive days (here the other).

KubeWeekly #279 October 1st, 2021

The Headlines

Editor’s pick of the highlights from the past week.

What to expect at KubeCon + CloudNativeCon North America

The New Stack

Priyanka Sharma and Jasmine James sit down with Joab Jackson of The New Stack to discuss this year’s schedule and agenda, how it will all compare to KubeCon+CloudNativeCon of years past and general cloud native trends.

KubeCon + CloudNativeCon North America 2021 will finally be coming this month on October 11–15. As mentioned above, CNCF GM Priyanka Sharma and co-chair Jasmine James interviewed about the highlights of this year and “Why next year’s KubeCon + CloudNativeCon Europe 2022 will be held in Valencia, Spain’s third largest city?” and so on .

ICYMI: CNCF online programs this week

A weekly summary of CNCF online programs from this week.

Kanister — Application level data operations on Kubernetes

Michael Cade & Pavan Devaraj, Kasten by Veeam

An approximately 61-minute session explaining the framework “Kanister” that supports application level data management in Kubernetes.
For developers and ops teams interested in stateful applications in Kubernetes. It demonstrates protection operations on a live MongoDB cluster.

Trace-based testing with OpenTelemetry

Michael Haberman, Appearance

An approximately 52-minute session that introduces the Javascript framework “Malabi” based on OpenTelemetry, which takes trace data to the next level by easily using trace data.

Shifting security left-simplifying security for K8s & OpenShift environments

Jody Hunt, CyberArk

An approximately 51-minute session that explains best practices for protecting sensitive information in DevOps, GitOps, and CI/CD pipelines, including secrets and credentials, with centralized secret management, including self-service for developers.
Careful explanations with easy-to-understand images and demonstrations will help you learn.

Redefining cloud native debugging

Noa Goldman, Rookout

An approximately 18-minute session that discusses debugging challenges when migrating to the cloud and best practices and tools to make migrating to the cloud easier and more secure.

OpenEBS 3.0: What’s in it?

Call Mova, MayaData

As the title suggests, An approximately 33-minute session explaining the changes in OpenEBS 3.0 and glimpses into what is coming in 4.0 and more.

The thing about your software supply chain…

Eylam Milner, Argon Security

An approximately 33-minute session that covers various security risks and vulnerabilities within the software supply chain and recreates a live how attackers executed a few of the recent supply chain breaches.

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

Giving your legacy applications an API facelift

David La Motta, Kong

It shows how to create a very simple but powerful “Kong Gateway Lua plugin” and how to put the API in front of legacy applications that can only be accessed via the CLI.

Open EBS 3.0 release

Kiran Mova, OpenEBS

At the beginning, the outline of Open EBS is explained, and the features of the major versions so far are explained in “A quick summary”, after that “What’s new in OpenEBS 3.0?” and “What is Next after OpenEBS 3.0?” follows. The viewing of Webinar in the above “ICYMI” corner is recommended.

Flux Server-side reconciliation is coming

Daniel Holbach, Flux

As the title suggests, it explains the background, precautions, countermeasures, future prospects, etc. of migrating to the new server-side reconciler with the release of Flux 0.18.
tl; dr is as follows.
○ tl;dr: Server-side reconciliation will make Flux more performant, improve overall observability and going forward will allow us to add new capabilities, like being able to preview local changes to manifests without pushing to upstream.

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

Podman, with Daniel Walsh and Brent Baude

Kubernetes Podcast by Google employees. This time the Host is Craig Box and Guest Host Jimmy Moore.
Guests are Daniel Walsh , the leader of Red Hat’s containers team, and Brent Baude , the primary maintainer of Podman as the company’s architect .
The topics I was interested in in the News of the week are as follows.
○ Mirantis Flow “reinvents the datacenter”
○ Deis Labs introduces Hippo
○ Accelerating new features in Docker Desktop
○ CNCF DevSecOps radar

Battlesnake: KubeCon Cup 2021

CNCF

KubeCon + CloudNativeCon North America 2021 “Experiences” page refers to the introduction of “Battlesnake: KubeCon Cup 2021”. “Virtual Games Lounge” and other content information are also available on the same page.

Cloud Native Computing Foundation announces agenda for KubeCon + CloudNativeCon + Open Source Summit China Virtual 2021

CNCF

As the title suggests, KubeCon + CloudNativeCon + Open Source Summit China will be held virtually on 12/9 -12/10, 2021. There are 105 sessions, and many speakers are scheduled from like Alibaba, GitLab, Huawei, and Intel.

Automation is the future of cloud cost optimization

Laurent Gil, CAST.AI

Under the theme of “optimizing automated cloud costs,” the following points explain how automation is already helping companies to cut their cloud bills.
○ How to control cloud costs? 4 approaches
○ Manual vs. automated approach to optimization
○ Here’s an example of automated optimization
○ 4 reasons why manual cost optimization just doesn’t work in the cloud
○ Automated cost optimization — case study
○ Conclusion: Automation is becoming the new normal

Upcoming CNCF Online Programs

Live Webinar

October 5 at 10am PT: Kubernetes 1.22 release presented by Savitha Raghunathan, James Laverack, & Jesse Butler, Kubernetes 1.22 Release Team — RSVP

Cloud Native Live

October 6 at 9am PT: Next generation observability using open source monitoring presented by Scott Fulton, Opscruise — RSVP

Looking for more great curated content? Visit our Online Programs playlist on YouTube.

Learn more about CNCF Online Programs

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara