SRE / DevOps / Kubernetes Weekly Collection#57(Week 9, 2021)

  • In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
  • Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
  • I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #531 February 28th, 2021
SRE Weekly Issue #259 February 28th, 2021
KubeWeekly #253 March 5th, 2021

DEVOPS WEEKLY ISSUE #531 February 28th, 2021


Moving to a new platform can have performance implications. This post goes over how one team designed experiments to work out what was going on after moving to Kubernetes and how to fix it.

  • The title is “How We Minimized the Overhead of Kubernetes in our Job System”.
  • I will skip it because it was covered in KubeWeekly#252 last week.

A detailed post on best practice for logging in AWS focused on security use cases.

  • The title is “Security Logging in Cloud Environments — AWS”.
  • An article in its blog post series, “Continuous Visibility into Ephemeral Cloud Environments”, describes a design for a state of the art multi-account security-related logging platform in AWS.
  • Later posts of this series will cover a similar setup for both GCP and Kubernetes.

A post on how the software community came to appreciate systems administrators a little more with the hugops movement.

  • The title is “An oral history of #hugops: How tech’s first responders built a culture of empathy”
  • It explains how engineers who continue to run the cloud created their own culture of empathy, focusing on how the hashtag #hugops on Twitter spread from the story that tells the suffering history of the operation engineer.

Some good tips for scaling infrastructure as code across teams and organizations. Observations about public modules, standards, reusable code, having a formal release/versioning process and more.

  • The title is “Infrastructure as Code at Enterprise Scale: Identify the Right Approach for Your Organization”.
  • It focuses on the two largest public clouds, AWS and Azure, as tools and detailed guidelines to help extend the IaC approach.
  • It’s up to the reader how to define “enterprise” in the title. The author defines as follows.
    ○ How you define “enterprise” is up to you: whether you’re a Fortune 500 company or a garage-based upstart, this guide is for you.

JSON comes in a surprisingly large number of formats, with subtle differences. Throw in different JSON parsers in different languages and there is the potential for vulnerabilities caused by interoperability issues.

  • The title is “An Exploration of JSON Interoperability Vulnerabilities”.
  • TL; DR is as follows, and you can jump to the hands-on lab page of the GitHub page from the link.
    ○ TL;DR The same JSON document can be parsed with different values across microservices, leading to a variety of potential security risks. If you prefer a hands-on approach, try the labs and when they scare you, come back and read on.
  • The author explains JSON INTEROPERABILITY SECURITY RISKS in the following five categories.
  1. Inconsistent Duplicate Key Precedence
  2. Key Collision: Character truncation and Comments
  3. JSON Serialization Quirks
  4. Float and Integer Representation
  5. Permissive Parsing and Other Bugs

A good roundup of Linux server monitoring. Looking quickly at sar, vmstat, nethogs and monitorix.

  • The title is “Linux System Monitoring Fundamentals”.
  • It is explained according to the title, and introduces the following four Linux system monitoring tools as important and worth further investigation.
  • Sar
  • Vmstat
  • Monitorix
  • Nethogs

A post on Kubernetes robustness, showing with examples how to bring up various Kubernetes services after failure.

  • The title is “Breaking down and fixing Kubernetes”.
  • First of all, I rm -rf /etc/kuberneteswas scared by the illustration in the beginning. It introduces this command and explains how to destroy a Kubernetes cluster, delete a certificate, and recover from it.
  • There is also an etcd version of the article “Breaking down and fixing etcd cluster” by the same author, which is good for understanding the file structure and behavior of Kubernetes.

A comparison of System Manager Parameter Store and the newer Secrets Manager for managing secrets in AWS environments.

  • The title is “Parameter Store vs Secrets Manager”.
  • The illustration at the beginning of the web page is “Street II Ryu vs Ken”! ️
  • It is compared and explained according to the title with the following structure.
    ○ Round 1: Key Value Store
    ○ Round 2: Storage Limitations
    ○ Round 3: Encryption
    ○ Round 4: Rotation
    ○ Round 5: Cost
    ○ The Verdict

A nice worked example of live debugging using VSCode when you have a monorepo application and multiple container-based applications.

  • The title is “Seamless Multi-Container Live Debugging in VSCode | DevContainers on Steroid”.
  • It explains remote live debugging of multi-container workspaces or monolipo-style workspaces for containerized apps.
  • The source code can be found on this Github page.


cloudquery transforms your cloud infrastructure into SQL or Graph database for easy monitoring, governance and security.

  • A GitHub page of “cloudquery”, a tool for pulling, normalizing, publishing and monitoring cloud infrastructure and SaaS apps as SQL or Graph (Neo4j) databases.

A new bash-like shell with a few interesting features. In-line spell checking, typed pipelines, built-in testing framework, user-friendly error handling and more.

  • A GitHub page of “murex”, a Shell like bash / zsh / fish / etc.
  • It follows the same syntax as a POSIX shell like Bash, but supports more advanced features than you would normally expect from a $SHELL.

SRE Weekly Issue #259 February 28th, 2021


Increment: Reliability

This quarter’s Increment issue is about Reliability, and I haven’t had this much fun since their first issue about on-call. I’ll include a few of the articles here and more in later issues as I have a chance to review them.


  • The theme of ISSUE 16, FEBRUARY 2021 in the printed and digital magazine “Increment”, which explains how the team builds and operates software systems on a large scale, is introduced in “Reliability”. This time, the following three articles are taken up from this Increment.

[Increment: Reliability] Everything is broken, and it’s okay

Accepting that imperfect things still work is fundamental to preventing failures from becoming catastrophes.

Understanding that no system is without errors is critical to building resilient systems.

Heidi Waterhouse

  • As the subtitle states, “Accepting that imperfect things still work is fundamental to preventing failures from becoming catastrophes.” explained at the following points.
    ○ Control is an illusion
    ○ Failure is inevitable
    ○ Responding to fragility
    ○ Designing against disasters
    ○ Accept imperfection, within limits

[Increment: Reliability] How to build organizational resilience

The very first sentence sets the tone, and I love it:

Resilience is a process: something you must actively perform, not something you check off a list once.

Ryn Daniels

  • As the subtitle states, “By encoding resilience into an organization’s culture, engineering teams can be better equipped to tackle the unknown and unexpected.” It explains how to build a growth-oriented culture that can keep learning, improving, and building resilience for years to come.

[Increment: Reliability] Embrace your inner incident commander

Most of all, having an incident commander only works if everyone believes in the role. Someone stepping in to address a crisis and saying “I’m Batman” doesn’t help unless people have bought into the idea of Batman.

The next time I’m incident commander, I am totally going to jump in and say, “I’m Batman!”.

This article is a great primer on what an IC is and how to adopt incident command at your organization.

Reilly asked

  • With the following points, it explains how to fight fire affects how quickly an outage can be resolved, the appointment of an incident commander can help, and the reader can be one of them.
    ○ Enter incident command
    ○ The incident commander’s role
    ○ Making it work
    ○ You’ve got to believe
    ○ It’s your turn

Retry pattern in microservices

After reading this blog post, you will have an understanding of the retry pattern used in microservices architecture, why it should be used, a few considerations while using the retry pattern, and how to use it in Python.

I love the W. C. Fields quote.

Anand Prashant

  • The contents are as described above, and are explained with the following structure. Figures and codes may be written in an easy-to-understand manner.
    ○ Microservices
    ○ Retry pattern
    ○ Considerations
    ○ Adding delays between retries
    ○ Retrying only on certain exceptions
    ○ Few other considerations
    ○ Conclusion

2021 Site Reliability Engineering (SRE) Survey Now Open

It’s that time again! Be sure to fill out the survey, not only so they can gather useful data, but also because Catchpoint will donate $5 to charity.

DevOps Institute, Catchpoint, and VMWare Tanzu

  • An introduction on the above survey by DevOps Institute. It will create a report from the survey results and publish it.
  • The deadline is April 1, 2021, and the charity is also held as mentioned above. You can take it from”Take the survey now”.

QA Engineers, This is How SRE will Transform your Role

When considering the value of a QA test, SLIs can provide very valuable context.

SRE and QA can work hand in hand.

Emily Arnott — Blameless

  • Citing Alex Hidalgo’s “Implementing Service Level Objectives” illustration, it explains that “When implementing SRE, almost every role within your IT organization will change. One of the biggest transformations will be in your Quality Assurance teams.”.

Silent data corruption: Mitigating effects at scale

This kind of thing keeps me up at night. Silent data corruption can destroy your reliability just as quickly as a backhoe on a non-redundant link.

Harish Dattatraya Dixit — Facebook

  • From the above paper, it describes the best practices for detecting and remediating silent data corruptions on a scale of hundreds of thousands of machines.
  • Click the link for the full version of the paper “Silent data corruptions at scale”.

How Etsy Prepared for Historic Volumes of Holiday Traffic in 2020

Etsy experienced years of growth practically overnight in 2020 as quarantines set in. Here’s how they handled it.Mike Adler — Etsy

  • The contents commented by the Editor above are explained in the following structure. An organization in which the blameless post-mortem culture works.
    ○ The Challenge
    ○ Modulating Our Pace of Change
    ○ Adapting Our “Macro” Load Testing
    ○ Modeling History To Inform Capacity Planning
    ○ Cresting The Peak
    ○ Gratitude


KubeWeekly #253 March 5th, 2021

The Headlines

Editor’s pick of the highlights from the past week.

Schedule for KubeCon + CloudNativeCon Europe 2021 — Virtual is now available!

Schedule for KubeCon + CloudNativeCon Europe 2021 — Virtual is now available!

KubeCon + CloudNativeCon Europe 2021 Virtual is happening May 4–7, 2021 and the schedule is now available. Experts from organizations including Adobe, Apple, CERN, NVIDIA, and OVHcloud will deliver 100+ sessions, keynotes, lightning talks, and breakout sessions. There will also be more than 60 sessions hosted by project maintainers — spanning beginner-level introductions, end user case studies, and technical deep dives.

  • As mentioned above, the schedule for KubeCon + CloudNativeCon Europe 2021 Virtual has been released. In Japan, there is time in the latter half of GW holidays, so I have time to decide the session to watch gradually.
  • The article also introduces a community-curated schedule and I will watch the session below.
    ○ The community-curated schedule will feature sessions from leading open source technologists, including:
    ■ “Your Path To Non-code Contribution In The Kubernetes Community” — Kaslin Fields, Google; Kat Cosgrove, JFrog; Matt Broberg, Red Hat; Kohei Ota, HPE

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

Top Kubernetes Health Metrics You Must Monitor

Ajit Chelat, Logiq

  • It specifically describes Kubernetes health metrics that should be monitored.
  • The Table of Contents is below.
  1. Crash Loops
  2. Cluster State Metrics
  3. Disk and Memory Pressure
  4. Network Unavailable
  5. CPU Utilization
  6. Job Failures
  7. DaemonSets
  8. Monitoring Kubernetes Health Metrics

Troubleshooting Services on Google Kubernetes Engine by Example

Yuri Grinshteyn, Reliability Engineer, Google Cloud

  • The following two are explained.
    ○ We’ll walk through deploying a sample app to your cluster and configuring an alerting policy that will notify you if there are any container restarts observed.
    ○ From there, we’ll trigger the alert and explore how the new GKE dashboard makes it easy to identify the issue and determine exactly what’s going on with your workload or infrastructure that may be causing it.
  • The video with the above title from the “The Stack Docker(#stackdoctor)” series on YouTube’s Google Cloud Tech channel is also embedded in the Web page.

Protocol Detection and Opaque Ports in Linkerd

Charles Pretzer, Buoyant

  • The Linkerd 2.10 release adds a new feature, “Opaque Ports”. We’ve had quite a few questions about this feature from the Linkerd community on Slack and GitHub, so it focuses on one of the most important underlying features that enables Linkerd to perform this feat: Protocol Detection.

Integrating Backpressure into the Infrastructure

Simone Busoli, NearForm

  • I cannot reach the linked web page. (As of 2021/03/06 12:35 JST). I could click the blog title from the top of the web page, but did not work. What happens?

Multi-Cluster Monitoring with Thanos

Kevin Lefevre, CTO, Particle

  • It explains the limitation of a Prometheus only monitoring stack and why moving to a Thanos based stack can improve metrics retention and also reduce overall infrastructure cost.

Securing Istio Workloads with mTLS Using Cert-Manager

Josh van Leeuwen, Jetstack

  • Based on the history and current situation, it shares what they have done and what they have learned. Working with the Security WG in the Istio community, as well as a number of our customers, Jetstack’s cert-manager team has built an integration that enables cert-manager to sign workload certificates in an Istio service mesh.

Understanding the Kubernetes Event Horizon

Bryan Boreham, WeaveWorks

  • As the title suggests, Kubernetes Event is explained while showing an example of log output, and the following Warning is also described.
    ○ Warning: ‘kubectl get events’ can spew out a lot of information, especially as your cluster gets busier. Sadly it does not list the events in timestamp order, so you either have to have some idea what you are looking for, or pipe the output to a file and analyze it with the Mk 1 eyeball.

Introduction to Litmus Chaos | Rawkode Live

David McKay

  • A 90-minute Webinar video with the above title. There is also a demo, and you can jump to the part you want to see with the chapter function on the right side.

Canary Deployments using Ketch

Saiyam Pathak, Civo

  • A 7-minute Webinar video with the above title. I thought it would be nice to be kind enough to respond to comments from those who want introductory content in the comments section.

How to Manage Multi-Cluster Kubernetes with Operators

Sascha Haase, Kubermatic

  • It explains why you need multi-cluster management, how Kubermatic Kubernetes Platform leverages Kubernetes Operators to automate cluster lifecycle management across multiple clusters, clouds, and regions, and how to get started today.

Getting Started With Kubernetes: Clusters and Nodes

Sofia Parafina, Pulumi

  • It explains how to use the infrastructure as code to create basic Kubernetes objects and high-level abstractions built on them.
  • Specifically, it describes how to use Pulumi to set up a Kubernetes cluster on AWS, Azure, and GCP. Creating a cluster depends on the cloud provider, but the process is generally the same.
  • It is the first article in the series on using infrastructure as Kubernetes code. In the next article, It will explain basic Kubernetes objects such as Pod, Service, and volume.

Migrating Jenkins Freestyle Job to Multibranch Pipeline

Aman Bisht, Infracloud

  • It explains why one of their enterprise customers needed to switch to Jenkins’ multi-branch pipeline and how it made their lives easier.
    ○ Freestyle Vs Pipeline jobs
    ○ Why did we move to Multibranch Pipeline?
    ○ Sample Jenkinsfile Template
    ○ Benefits of Multi-branch Pipeline
    ○ Challenges
    ○ Conclusion:

ICYMI: CNCF online programs this week

A weekly summary of CNCF online programs from this week.

Rethinking your Company’s Cloud Security in the Shadow of the SolarWinds Attack
Amir Kaushansky & Leonid Sandler @ARMO

  • It analyzes SolarWinds Attack and explains it for a deeper understanding of vulnerabilities in cloud-native environments such as Kubernetes, and then lists effective measures to eliminate or mitigate the risks inherent in cloud environments.

Demystifying Kubernetes Network Policy

Thomas Graf @Isovalent

  • It covers everything from the basics of Kubernetes network policy to more advanced concepts.
  • It explains step by step from setting simple policies to finding and avoiding conflicting rules, checking for common mistakes, and addressing difficult questions such as investigating advanced real-world policy examples similar to those implemented by key Kubernetes users.

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

In the Clouds: DevSec + SecOps w/ Kirsten Newcomer

Chris Short and Kirsten Newcomer, Red Hat

  • Approximately 1 hour session where explanations and discussions are given on the themes such as the following along with the title.
    ○ Security isn’t just for Ops teams anymore — what do we need to do to make security a focal point of app dev as well? And why is security important for containers and Kubernetes?

How I Became a Kubernetes Maintainer in 4 hours a Week

Matthew Broberg, Red Hat

  • It shares what it learned about contributing to Kubernetes. It hopes it helps readers find the focus and time to join in.

7 Reasons to Adopt a Kubernetes Native Backup Solution

Gaurav Rishi, Kasten

  • Here are seven reasons why Kubernetes native backup solutions are the best way to protect your expanding Kubernetes environment.
  1. It accommodates Kubernetes deployment patterns.
  2. It aligns with “Shift-left” development.
  3. It simplifies operations.
  4. It accommodates multi-cluster scalability.
  5. It closes protection gaps.
  6. It bolsters security.
  7. Integration with the cloud native ecosystem.

How Fidelity Investments Built its Multi-cloud Strategy with Cloud Native Technologies


  • It is a case study article of Fidelity Investments. It is explained in the following items.
    ○ Challenge
    ○ Solution
    ○ Impact
    ○ One issue that quickly arose was that Fidelity also had distributions of Kubernetes on-prem, as well as on other cloud providers. How could they introduce, for example, a new security process across 1,000 distributed applications?
  • The web page has an embedded video “End User Panel: GITOPS in the Enterprise -Real World Experiences — Cheryl Hung” that shares case studies.

Upcoming CNCF Online Programs

This Week in Cloud Native (Livestream): Kubernetes Community Days: Ask me Anything
Bill Mulligan @CNCF

March 10, 2021
Register Now

Deploying K3s at the Edge for Multiplayer Gaming
Marco Mancini @OpenNebula

March 11, 2021
Register Now

CNCF Online Programs Playlist on YouTube
Check out our playlist for more curated content you don’t want to miss! New content is added every Friday.

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara

An infra engineer in Tokyo, Japan. Grew up in Athens, Greece(1986–1992). #Network, #Kubernetes, #CKA, #CKAD, #Certified AWS SAP

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store