SRE / DevOps / Kubernetes Weekly Collection#36(Week 41)

  • In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
  • Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
  • I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #510 October 4th, 2020
SRE Weekly Issue #238 October 4th, 2020
KubeWeekly #236 October 9th, 20209

DEVOPS WEEKLY ISSUE #510 October 4th, 2020


Open source has led to companies sharing undifferentiated code, but work on processes and policies is rarely done in the open. Given lots of companies are subject to the same compliance regimes, this is definitely wasteful. So I’m super interested to see this set of HIPAA policies.

  • The web page of HIPAA (Health Insurance Portability and Accountability Act) compliant policy, the open source version of the Datica HIPAA Compliance Policies. I think it’s great to share these policies and I am interested to see it as well.
  • Click here for the GitHub page.

This post looks at how to embrace devops at the same time as maintain least privilege controls in cloud environments.

  • The title is “The Three Ways of DevOps”.
  • It describes how to apply the concept of least privilege security to cloud instances within a DevOps environment without adding bulk or latency to your pipeline. The following three points are explained about DevOps.
    ○ The first way of DevOps is to emphasize the speed and efficiency of the entire system, instead of just your part.
    ○ The second way of DevOps is fast feedback.
    ○ The third way of DevOps is continuous learning.

A look at the concept of emergence in system design and management and how emergent behaviour can be both positive and negative.

  • The title is “Complex Adaptive Systems (ii): thinking about emergence and ITSM”.
  • It focuses on the emergence” of one particular characteristic of such a system in the context of a better understanding of its impact on IT service management.
  • I could have images of what he expressed, but I couldn’t read it smoothly because there were many phrases and expressions that I was not good at. I think it is a problem unique to me, so I will skip the details for this reason.

A post on why moving from a monolithic to microservices architecture is time consuming, due to assumptions and coupling that need to be broken down.

  • The title is “Monolith-> Services: Theory & Practice”.
  • In response to the question, “How can we get from a monolith to micro-services quickly?”, he said,
    ○ Can’t answer that question.
    ● First, “quickly” is right out the window. You didn’t make this mess in a month; you’re not going to fix it in a month.
    ● Second, you want some benefit you aren’t currently getting that you expect from micro-services. What is that benefit? Micro-services aren’t the point.
  • He explains the following points.
    ○ Managing Coupling
    ○ Not Quickly
    ○ Good News

A look at building a service registry from jsonnet files in a Git repository. Interesting ideas about the benefits of standard service descriptions and building tools on top of that to make managing infrastructure and applications easier.

  • The title is “Why you need a service registry”.
  • It describes how GoCardless built the registry and some of the use cases it found.
  • I thought that “I saw Jsonnet in this kind of usage in SRE Workbook and etc” and found that they are heavy users of GCP and made sense for me.

Scaling managing permissions is hard. This post looks at one proposed approach, using attribute based access control, and some its problems.

  • The title is “Security September: Still Early Days for ABAC”.
  • It describes exploits that allow someone to tag S3 buckets without permission, even with explicit deny, and what this means for attribute-based access control (ABAC). The following two points seem to be what the author would like to say the most.
  1. I strongly believe that at this point tags are a tool for auditing, billing and automation — not for access control.
  2. With the amount of services utilizing tags for their own purposes and subsequently tagging permissions being seen as a low risk permission, I recommend cloud security practitioners stick to traditional resource-based access control.

Balancing defaults between getting started quickly and scaling is often tricky. A good post on configuring sidecars to limit the memory consumed by large Istio clusters.

  • The title is “Istio at Scale: Sidecar”.
  • The following behavior is fine if the mesh has only a handful of services, but as the mesh begins to grow, it will consume resources (CPU, network I/O, memory) and take longer to propagate changes.
    ○ By default; all proxies on your mesh will receive all the config required in order to reach any other proxy.
    ○ It’s important to note that by “Config” — that isn’t just what you put in your DestinationRule or VirtualService, but also the state of all of the Endpoints for your services too, therefore whenever you do a Deployment, or a Pod becomes unready, or your service scales, or anything else that changes state — config is pushed to all the proxies.
  • As a solution to the above problem, the following three, Sidecar resources provided by Istio, and others are explained.
    ○ Configuring the Sidecar
    ○ Monitoring Pilot
    ○ Detecting Misconfiguration

A walkthrough of configuring a high-availability Kubernetes cluster on Azure using Terraform, looking at the various services needed for a production setup, like Calico and Active Directory.

  • The title is “Use Terraform to Create and Manage a HA AKS Kubernetes Cluster in Azure”.
  • I will skip it because it was covered in Kube Weekly # 235 last week.


Couler provides a DSL for creating and managing workflows on different workflow engines, including Argo Workflows, Tekton Pipelines, and Apache Airflow.

Pinot is a real-time distributed OLAP datastore, built to deliver scalable real-time analytics with low latency. It can ingest from batch data sources (such as Hadoop HDFS, Amazon S3, Azure ADLS, Google Cloud Storage) as well as stream data sources (such as Apache Kafka).

  • The io page of Apache Pinot, a real-time distributed OLAP data store that provides low-latency, scalable real-time analytics.
  • Click here for the GitHub page.

SRE Weekly Issue #238 October 4th, 2020


On Call Shouldn’t Suck: A Guide For Managers

Lots of really great advice in here. And really, with a title like that, I couldn’t resist reading it!

Charity Majors

  • A blog written for managers so that on-call should not suck and doesn’t give the worst results. The following are its points.

It is engineering’s responsibility to be on call and own their code. It is management’s responsibility to make sure that on call does not suck. This is a handshake, it goes both ways, and if you do not hold up your end they should quit and leave you.

Follow-up for Google Cloud Infrastructure Components Incident #20010

Last week, I mentioned a Google Cloud Platform outage that affected multiple services. Here’s the detailed post-analysis by Google.


  • 9/24 18:00 ~ 18:33 Follow-up of GCP failures that occurred in the US / Pacific.
  • It is explained from the role of Global Service Load Balancer (GSLB) and Google Front End (GFE) in the item of Pastories, so unless you know it, I recommend you to hold it down and see the following items (I would not not say that I understand all).

Team Play with a Powerful and Independent Agent: A Full-Mission Simulation Study

This one is along the lines of the classic Ironies of Automation paper by Bainbridge. In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.How can automation be a team player, and what happens when it isn’t?

Nadine Sarter and David Woods (original paper)
Thai Wood — Resilience Roundup (summary)

  • An article explaining the white paper. The points of the author of this blog are as follows.
    ○ While this paper focuses on the A320 aircraft specifically, the authors are very clear that this is not an issue with this one aircraft or even just aviation in general, but is a problem that can occur when automation doesn’t act as a team player.
    ○ Automation acts as more of a “team player” when it acts in a way that allows humans to keep track of what it’s doing and anticipate what it will do in the future. The authors even suggest that this measure, how much of a team player a given technology is, can even help predict how effective and successful that it is.
    ○ Conversely, when automation isn’t a team player, automation can be clumsy and confusing.

Applying Chaos Engineering in Healthcare: Getting Started with Sensitive Workloads

In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.How can you use chaos engineering when failures in the system can be critical and even life-threatening?

Carl Chesser — Infoq

  • Chaos engineering, an area of ​​growing interest in recent years, provides an example of the challenge of often hesitating when systems are viewed as “critical” or too important to fail.
    ○ A team at Cerner Corporation, a healthcare information technology company, shares how they found to be effective in introducing this practice with their systems.
  • Key Takeaways are below.
    ○ While chaos engineering is a proven technique for improving the resilience of systems, there is often a reluctance amongst stakeholders to introduce the practice when a system is viewed as critical.
    ○ With critical systems, it can be a good idea to first run experiments in your dev/test type environments to minimize both actual and perceived risk. As you learn new things from these early experiments, you can explain to stakeholders that production is a larger and more complex environment which would further benefit from this practice.
    ○ Using real production traffic, as opposed to a synthetic workload, can improve the usefulness of the experiments in these early stages.
    ○ A good chaos engineering practice helps you to improve both the resilience of the system, and its observability when incidents do occur.

This is your Guide for Implementing SRE in NOCs

In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.

Emily Arnot — Blameless

  • The following four points explain how to practice SRE at NOC (Network Operation Centers) to improve its functions. I used to be on the NOC side, so it’s an interesting theme.
    ○ Monitor smarter by focusing on complex metrics
    ○ Escalate and triage with classification and on-call
    ○ Get the most from incidents with meaningful response
    ○ Manage ticketing from a customer-focused perspective

Is your microservice a distributed monolith?

This article suggests using chaos engineering to tell if your microservice-based architecture is secretly a monolith in disguise.

Andre Newman — Gremlin

  • It describes what a microservices anti-pattern, a “distributed monolith” is, why it should be avoided, and how to use chaos engineering to validate whether your application falls under this anti-pattern.


Failure information of each of the above companies

KubeWeekly #236 October 9th

The Headlines

Editor’s pick of the highlights from the past week.

Cloud Native Computing Foundation announces Rook graduation

Congratulations to Rook on hitting graduated status! Rook is an open source cloud native storage orchestrator for Kubernetes, providing the platform, framework, and support for a diverse set of storage solutions to natively integrate with cloud native environments. Rook delivers its services via a Kubernetes Operator for each storage provider. Originally accepted into CNCF in 2018, it is the 13th CNCF project — and the first based on block, file, or object storage — to graduate.

  • It has announced that it has been recognized as the 13th “Graduation” of the “Rook” CNCF project.
  • Rook is an open source cloud native storage orchestrator for Kubernetes. It was accepted as a project by the CNCF in 2018.

ICYMI: CNCF Webinars

You can view all CNCF recorded and upcoming webinars here.

CNCF Project webinar: Transactional microservices with Vitess — coordination without state

Daniel Kozlowski, Minister of Engineering @PlanetScale

  • It shows how to use Cloud Native Database “Vitess” to create secure ACID distributed transactions that behave exactly like transactions on a monolithic system without increasing complexity.

CNCF Member webinar: Multi-cluster & multi-cloud service mesh with CNCF’s Kuma and Envoy

Marco Palladino, CTO & Co-Founder @Kong

  • It explains how to launch Kubernetes clusters in multiple regions and secure, route, connect, and monitor service connections with a distributed service mesh with the following points.
    ○ Use Kuma’s multi-zone deployment to spin up a multi-cluster and multi-region service mesh.
    ○ Leverage the global/remote control separation to scale reliability with HA.
    ○ Use the built-in service discovery and ingress capability for out of the box service connectivity across multiple zones, clusters and regions.
    ○ Use Kuma’s policy to determine the behavior of traffic across different clusters, like Traffic Route, mTLS, Traffic Permission and so on.

CNCF Member webinar: Kubernetes native two-level resource management for AI/ML workloads

Diana Arroyo Software Engineer @IBM Research and Alaa Youssef, Manager, Container Cloud Platform @IBM Research

  • It explains resource management for AI/ML workloads in Kubernetes-based environments, and two-level resource management using Multi-Cluster App Dispatcher addresses the challenges.

CNCF Member webinar: The evolution of cloud orchestration systems from ephemeral to persistent storage

Boyan Krosnov, CPO @StorPool

  • It explains how various orchestration systems have taken a very similar path to public clouds, OpenStack, and Kubernetes to evolve storage capabilities, and why persistent storage is important.

CNCF Member webinar: Building dynamic machine learning pipelines with KubeDirector

Tom Phelan, Fellow, Software Organization @Hewlett Packard Enterprise, Kartik Mathur, Master Technologist @Hewlett Packard Enterprise and Donald Wake, Technical Marketing Engineer @Hewlett Packard Enterprise

  • The machine learning (ML) pipeline is explained in the following points. The ML pipeline is complex to set up, more difficult to maintain if you need a constantly changing data model, and essentially requires a dynamic ML pipeline.
    ○ Discussing an example ML Pipeline centered around supporting an application that must predict travel times based upon a large data set of taxi ride data.
    ○ Walking through the development of the full ML pipeline using Kubernetes and another Open Source application called KubeDirector. You will learn how to train, register, and finally, query your model for answers.
    ○ Learning how a new capability of KubeDirector called “Connections” enables a dynamic, always up-to-date ML model.

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

Identity federation for multi-cluster Kubernetes and service mesh

Dennis Jannot,

  • A blog series article detailing specific challenge areas of multi-cluster Kubernetes and service mesh architectures, as well as considerations and approaches to solving them.
  • This one, it explains how to federate IDs between multiple clusters for authentication between services.
  • There are two types of authentication required for this environment, and this one is focusing on “authentication between services”. Click here for another one, “end-user authentication”.

Call an existing REST service with Apache Camel K

Mary Cochran, Red Hat

  • It shows how to call an existing REST service and create a Camel K integration that uses an existing data format.

An architect’s guide to APIs: REST, GraphQL, and gRPC

Bob Reselman, Code District

  • It describes the history of data exchange and compares REST, GraphQL, and gRPC APIs each other.

Self-hosted Github Actions runners in Kubernetes

Vito Botta, Dynablogger

  • As titled, Self-hosted Github Actions runners are explained on Kubernetes using the simple official image Docker in Docker.

Kubernetes integration and more in odo 2.0

Serena Chechile Nichols and Steve Speicher, Red Hat

  • It introduces the highlights of the v2.0 release of the developer CLI “Odo” for OpenShift and Kubernetes.

Provisioning Kubernetes clusters on AWS with Terraform and EKS

Kristijan Mitevski, LiveDOOH

  • The contents are as shown in the title and TL; DR below. You will want to try it with polite commentary. I bookmarked it.
    ○ TL;DR: In this guide, you will learn how to create clusters on the AWS Elastic Kubernetes Service (EKS) with eksctl and Terraform. By the end of the tutorial, you will automate creating three clusters (dev, staging, prod) complete with the ALB Ingress Controller in a single click.

Running Kubeflow Pipelines

Saiyam Pathak, Civo

  • It explains how to install a standalone Kubeflow pipeline on a Civo managed k3s cluster. I had an impression that there are more articles about the ML pipeline this week.

How to perform a CNI live migration from Flannel+Calico to Cilium

Josh Van Leeuwen, Sky Betting and Gaming

  • It explains why we need to change CNI, what the author learned about developing a live migration solution, and how everything works.
  • When I thought, “It’s something, some diagram, I’ve seen somewhere” and I checked its previous article and I covered it in my blog last month.

How to set up PostgreSQL monitoring in Kubernetes

Jonathan S. Katz

  • The content is as titled. In the future, it seems that they will publish articles on how to interpret some visualizations that come with the PostgreSQL operator monitoring stack.

storax/kubedoom: The next level of chaos engineering is here! Kill pods inside your Kubernetes cluster by shooting them in Doom!

  • A GitHub page of “Kube DOOM”, a tool to shoot and kill pods in a Kubernetes cluster on the screen of the game “DOOM”
  • When I was wondering, “Are you sure you want to do it in 3D?”, I found some people were playing like this YouTube video.

What’s new in Goldilocks 3.0.0

Andy Suderman, Fairwinds

  • It explains about the changes in v3.0.0 of Goldilocks, a tool for reviewing recommendations for setting resource requests and limits on Kubernetes deployments, and shares the story of how they got there.

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

Kubecost, with Webb Brown

Adam Glick and Craig Box, Kubernetes Podcast from Google

Helm Hub is moving to Artifact Hub

Matt Farina, Rancher Labs and Helm

  • The announcement of the migration of Helm Hub to Artifact Hub.
  • There are some reasons for the migration as follows.
    ○ Helm Hub was built to handle a limited number of Helm repositories and was designed for use cases that are slightly different from the public lists of as many chart repositories as possible.
    ○ As the number increased, there were some limitations.

Introducing DigitalOcean App Platform

Apurva Joshi, Digital Ocean

  • An introductory article about DigitalOcean’s new Paas platform “DigitalOcean App Platform”. An explanation video of the procedure up to deploying the application is also embedded.
  • There is a one who has already tried this and wrote a blog post in Japanese, so I think it will be helpful as well. “DigitalOcean App Platformを試してみる”.

What’s New in OpenShift 4.6

Red Hat OpenShift Product Management Team

  • A YouTube video of Red Hat’s OpenShift Product Management Team staff explaining the changes in OpenShift 4.6 in detail for nearly two hours in a relay fashion.

Call for questions! sig-HONK AMA KubeCon NA keynote panel

Ian Coldwater, Salesforce

Carbon-aware Kubernetes

Bill Johnson, Microsoft

  • It shows an example of how to extend the Kubernetes Scheduler to take advantage of the natural variation in carbon strength of existing power grids to minimize the amount of carbon in the atmosphere that Kubernetes clusters are responsible for. I was so surprised at the theme that I saw the title twice.

Service meshes, Envoy, and WebAssembly with Christian Posta

Christian Posta, and Chris Short, Red Hat

  • It invited Christian Posta, Global Field CTO of and former Chief Architect of Red Hat, as a guest, and asked various questions about the theme like, “Why are there so many products in Service Mesh?” “Why are so many products using Envoy as a control plane? “.

Open Service Mesh (OSM) is hosting first community call on Oct 13

OSM Community

  • An GitHub page of OSM (Open Service Mesh), a lightweight and extensible cloud-native service mesh.
  • The OSM project is based on the ideas and implementations of many cloud-native ecosystem projects such as Linkerd, Istio, Consul, Envoy, Kuma, Helm, and SMI specifications.

Upcoming CNCF webinars

You can check some Recorded Webinars and Upcoming Webinars here. The following are posted as Upcoming CNCF webinars at that moment.

Member Webinar: You can be a Kubernetes contributor too!
Jeremy L. Morris, Software Engineer @DigitalOcean
Oct 13, 2020 10:00 AM Pacific Time

Member Webinar: A full application environment for every PR–before you merge to master!
Vishal Biyani, CTO @InfraCloud
Jono Spiro, Staff Software Engineer, Engineering Operations @OpenGov
Oct 14, 2020 10:00 AM Pacific Time

Member Webinar: GitOps at scale for a multicloud, multi-region stateful application
Rick Spencer, Head of Platform @InfluxData
Oct 14, 2020 1:00 PM Pacific Time

Member Webinar: S&P experience report: multicloud serverless on Knative
Evan Anderson, Software Engineer @VMware
Mark Wang, Head of Cloud Engineering @S&P Global Ratings
Oct 15, 2020 10:00 AM Pacific Time

Member Webinar: Delivering cloud native apps to Kubernetes using werf
Dmitry Stolyarov CTO, @Flant
Oct 16, 2020 10:00 AM Pacific Time

Member Webinar: How to migrate NF or VNF to CNF without vendor lock-in
Grzegorz Sikora, VP Business Development @OVOO
Oct 20, 2020 10:00 AM Pacific Time

Member Webinar: Deploying Kubernetes to bare metal using cluster API
Seán McCord, Principal Senior Software Engineer @Talos Systems, Inc.
Oct 21, 2020 1:00 PM Pacific Time

Member Webinar: K8s audit logging deep dive
Randy Abernethy, Managing Partner @RX-M
Oct 22, 2020 10:00 AM Pacific Time

Member Webinar: Building 12 factor streaming data apps on Kubernetes
Stelios Charmpalis, Frontend Engineer
Francisco Perez, Senior Backend Engineer
Oct 23, 2020 10:00 AM Pacific Time

Member Webinar: Admission controllers: one part of your Kubernetes security and governance toolkit
Gunjan Patelm, Cloud Architect @Palo Alto Networks
Robert Haynes, Cloud Security Evangelist @Palo Alto Networks
Oct 28, 2020 7:00 AM Pacific Time

Member Webinar: Security in the world of service meshes
John A. Joyce, Principal Engineer @Cisco
Nov 4, 2020 1:00 PM Pacific Time

Member Webinar: Developer-friendly platforms with Kubernetes and infrastructure as code
Lee Briggs, Staff Software Engineer @Pulumi
Nov 6, 2020 10:00 AM Pacific Time

Member Webinar: Metal³: Kubernetes-native bare metal host management
Maël Kimmerlin, Senior Software Engineer @Ericsson Software Technology
Dec 10, 2020 10:00 AM Pacific Time

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara

An infra engineer in Tokyo, Japan. Grew up in Athens, Greece(1986–1992). #Network, #Kubernetes, #CKA, #CKAD, #Certified AWS SAP

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store