SRE / DevOps / Kubernetes Weekly Collection#24(Week 29)

Image for post
Image for post
  • In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
  • Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
  • I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #498 July 12th, 2020
SRE Weekly Issue #227 July 12th, 2020
KubeWeekly #225 July 17th, 2020

DEVOPS WEEKLY ISSUE #498 July 12th, 2020

A good discussion of antipatterns when trying to implement continuous delivery practices, looking at under-investing in build automation.

  • The title is “Addressing Build & Deploy Anti-patterns for Continuous Delivery Success”.
  • The author explained that “Everyone needs to be involved and actively listening in on this feedback loop.” to know how to address anti-patterns on CDs.
  • As the conclusion, he described “Software needs to be delivered. Software needs to be used.” and ,“Real users will tell us if the software fulfills their requirements.”.

A post on Porter, a load balancer for Kubernetes designed for bare metal environments. Good introduction to load balancing in Kubernetes and BGP.

  • The title is “Porter: An Open Source Cloud Native Load Balancer in CNCF Landscape”.
  • An article that describes the OSS load balancer “Porter” for bare metal Kubernetes clusters.
  • Porter entered the CNCF Landscape last week and it also touched on the parent project, KubeSphere.

I mentioned Vector last week in the tools section. Anyone interested in a new monitoring tool should find this post interesting. Basic installation and usage instructions, and some observations about design choices.

A detailed walk through of the various options and moving parts related to networking in Kubernetes.

  • The title is “Kubernetes and Networks-why is this so dang hard?”.
  • Tim Hockin of Google explained “Why is Kubernetes’s network so difficult?” with the slide in Speaker Deck.
  • Information is added based on the same figure for easy viewing. The trade-offs of each structure are explained carefully.

An argument for the importance of operators and sidecars in the distribution of software.

  • The title is “Operators and Sidecars Are the New Model for Software Delivery”.
  • The original article was published on The New Stack. An article that describes the Operator model, sidecar model, and the transition from a single runtime to a multi-runtime application architecture.

A look at the btrfs file system, specifically at features, reliability and performance and a large scale case study.

  • The title is “Btrfs at Facebook”.
  • An article of I didn’t know the existence of, but said, “Over the years LWN has grown with Linux and become one of the definitive Linux news sites.” It was interesting that Facebook occasionally drought down tried out entire data centers to test how well its disaster-recovery mechanism works and identify / improve the problem.

An interesting look at the low-level kernel details when it comes to accepting a TCP connection, exploring ways of scaling services before the connection is made.

  • The title is “Scaling Linux Services: Before accepting connections”.
  • An article summarizing what happened in the process of investigating errors in a GitHub production system by the author.
  • It focuses on some standard resource limits that exist before the client socket is passed to the application. I will read it again.

A new Choria subproject called Scout, focused on components for building system health monitoring systems.

  • The title is “Introducing Choria Scout”.
  • Click here for the article introducing Scout components.
  • The new project “Choria Scout” is “a highly scalable system health monitoring framework and monitoring data pipeline released under the Apache 2.0 license”.

Dockle is a Docker image scanning tool that helps enforce best practices when building images.

  • The GitHub page of the OSS tool “Dockle” with the following three features.
  1. Container image linter for security
  2. Help build Docker images based on best practices
  3. Easy to get started
  • I encountered some images of contributors I’ve seen somewhere.

Krane is a simple Kubernetes RBAC static analysis tool. It identifies potential security risks in K8s RBAC design and makes suggestions on how to mitigate them.

  • The simple Kubernetes RBAC static analysis tool “Krane” GitHub page.
  • It identifies potential security risks in the K8s RBAC design and suggests the ways to mitigate them.
  • The Krane dashboard shows the current RBAC security status and guides you through its definition.

SRE Weekly Issue #227 July 12th, 2020

A Terrible, Horrible, No-Good, Very Bad Day at Slack

This is the first of a pair of articles this week on a major Slack outage in May. This one explores the technical side, with a lot of juicy details on what happened and how.

Laura Nolan — Slack

  • An article that details Slack’s first serious outage on May 12, 2020 from a technical side.
  • The top highlight says, “ The broken monitoring hadn’t been noticed partly because this system ‘just worked’ for a long time, and didn’t require any change. “. It is a terrifying sentence from the perspective of managing the infrastructure. However, the author’s sincere attitude is conveyed throughout the article, including the closing words.
  • It is also touching upon the ongoing migration from HAProxy to Envoy Proxy, so I hope that another article will be published in the future.

All Hands on Deck

This is the companion article that describes Slack’s incident response process, using the same incident as a case study.

Ryan Katkov — Slack

  • An article that describes “the response process” of dealing with the same outage as the one above Slack article. I’m impressed with Slack’s established culture and philosophy that puts lessons from outages and response processes into feedback loops.

Improving Incident Retrospectives at Indeed

The author saw room for improvement in the retrospective process at Indeed. The article explains the recommendations they made and why, including de-emphasizing generation remediation items in favor of learning.

Alex Elman

  • Through the “Incident Retrospectives” conducted at Indeed, which the author belongs to, this article was trying to improve the process by considering that “ It became apparent to me that we were not using every incident to realize our full potential to learn.”.

Google Cloud Networking Incident #20005 Follow-Up

The datacenter was purposefully switched to generator power during planned power maintenance, but unfortunately the fuel delivery system failed.

  • The Google Compute Engine (GCE)-based components in the us-east1-c and us-east1-d zones were affected. Due to a fuel supply system failure that occurred during maintenance by a power equipment provider in the us-east1 region.

Towards More Effective Incident Postmortems

This is a good primer on the ins and outs of running a post-incident analysis.

Anusuya Kannabiran — Squadcast

  • Article summarizing the importance of effective incident postmortems, perspectives covered, and six processes for writing
  1. Start with an incident timeline
  2. Conduct a postmortem meeting with anyone internal to the team who was affected by the incident
  3. Define roles and owners along with having a moderator
  4. Determine the urgency of an incident by setting the right thresholds
  5. Devil’s in the Details — incident metrics and other key information captured
  6. Publish and track post mortems promptly

Setting SLOs: observability using custom metrics

This article goes through an interesting technique for setting up SLO metrics and alerts in GCP using Terraform and OpenCensus.

Cindy Quach — Google

  • An article by Google Cloud that explains “How to create SLOs out of the box with service monitoring, and how to create SLOs for custom metrics to get better observability for your customer-focused metrics.” using an example of creation using Terraform and OpenCensus.

Introducing the GitHub Availability Report

GitHub is committing to publishing a report on their availability each month with detail on incidents. This intro includes the reports for May and June with a description of 4 incidents.

Keith Ballinger — GitHub

  • An article introduced by GitHub, which has released the “Availability Report” as new investments.
  • They issue a GitHub availability report (including a description of any incidents that may have occurred and update you on how they are evolving their engineering systems and practices in response) on the first Wednesday of each month.
  • They also said that “You should expect these updates to include a summary of what happened, as well as a technical explanation for incidents where we believe the occurrence was novel and contains information that helps engineers around the world learn how to improve product operations at scale.” and I appreciate it and will use it as a reference in the future.

Blameless’ SRE Journey

This is neat: Blameless transitioned from “startup mode” toward an SRE methodology, becoming customer 0 of their own product in the process.


  • An article explaining Blameless’s SRE efforts and how they implemented important best practices.
  • I thought “Blameless’s article is persuasive and realistic even when explaining principles and theories”. Now I found that it’s because they overcame difficult situations with the implementation and improvement of SRE best practices.


KubeWeekly #225 July 17th

Editor’s pick of the highlights from the past week.

Announcing Fluent Bit v1.5: Lightweight and High-performance Log Processor

Fluent Bit, a sub-project under the umbrella of CNCF graduated project Fluentd, has reached its version v1.5.

One of the biggest highlights of this major release is the joint work of different companies contributing with Fluent Bit core maintainers to bring improved and new connectors for observability cloud services provided by Google, Amazon, LogDNA, New Relic and Sumo Logic within others.

Learn more about the latest release here or by joining the webinar on July 17 at 1pm PT.

  • An article introducing Fluent Bit, the subproject of Fluentd, has reached version 1.5.
  • The highlight was the joint work of different companies contributing with Fluent Bit core maintainers to bring improved and new connectors for observability cloud services provided by Google, Amazon, LogDNA, New Relic and Sumo Logic within others. The following three new output connectors joined.
    ○ Amazon Cloudwatch Logs
    ○ LogDNA
    ○ New Relic

KubeCon + CloudNativeCon EU Virtual — Complimentary pass option!

This week, we launched a complimentary pass option for KubeCon + CloudNativeCon EU Virtual. Unsure which pass is the best option for you? We’ve outlined the benefits of the full pass and complimentary option below.

Image for post
Image for post

A reminder that the complimentary pass offer is only available through August 7, 23:59 PM CEST — so act fast!

  • Complimentary pass option announcement for KubeCon + CloudNativeCon EU Virtual.
  • You can join All Keynote Sessions, Sponsor Showcase, Sponsor Demo Theater, and Engage with Project Maintainers + Leads with this option.
  • If you get the Full Event Pass, 50% off of training + Exam Bundle of CKA or CKAD is included, so it is beneficial for those who plan to take the exam and apply for the future.

KubeCon + CloudNativeCon EU Virtual Session Spotlight

The countdown to KubeCon + CloudNativeCon EU Virtual on August 17–20, 2020 is on! As we approach the event, we curated a few recommended sessions that we don’t want you to miss. Please see the feature for this week and be sure to register today!

DayZero Co-located Event: NSMCon

Hosted by the Network Service Mesh Community

August 17, 2020

Why Attend NSMCon? Are you running workloads in multiple clusters? Across multiple clouds: on-premises, hybrid, multicloud, or public cloud? Do they need to interact with legacy workloads running in less “cloudy” environments? Network Service Mesh (NSM) ties them all together, at the granularity of individual workloads, not cluster/VPCs/data centers. NSM is a community-driven, CNCF Sandbox project that is rapidly gaining momentum because of its ability to simplify connectivity between workloads, independent of where they are running. It extends an IP reachability domain to workloads running in multiple clusters, legacy environments, on-premises, or in a public cloud, communicating with the protocols they are currently using.

Join the people building and using NSM at NSMCon for a day of tutorials, deep dives, and use cases to learn how NSM works, what it can do for you, and, most importantly, what’s coming next.

Register now!

  • KubeCon + CloudNativeCon EU Virtual spotlights on “Day Zero Co-located Event: NSMCon” session. Schedule: 8/17 (Monday) 13:00 CEST (Central European Summer Time) start.
  • Network Service Mesh (NSM) Community one-day tutorial, deep dive session to learn about use cases, how NSM works, what NSM can do, and, most importantly, what happens next.

Weekly recap of CNCF member and project webinars that you might have missed.

You can view all CNCF recorded and upcoming webinars here.

CNCF Member Webinar: Securing and Accelerating the Kubernetes CNI Data Plane with Project Antrea and NVIDIA Mellanox ConnectX SmartNICs

Antonin Bas, Maintainer of Project Antrea and Staff Engineer @VMware Moshe Levi, Sr. Staff Engineer @NVIDIA and Itay Ozery, Director, Product Management for Networking @NVIDIA

CNCF Member Webinar: How Alibaba Extends K8s scheduler to support AI and big data workloads

Zhang Kai, Staff Engineer @Alibaba & Wang Qingcan, Senior Engineer @Alibaba

  • The language for this webinar is Chinese.
  • The Senior Engineers at Alibaba explained Kubernetes scheduling framework, how to support complex workloads such as AI/ML, mixed resource environment such as GPU, and how to implement Job level scheduling policy,and etc.

CNCF Member Webinar: Advancing image security and compliance through Container Image Encryption!

Brandon Lum, Senior Software Engineer @IBM

  • It explained Container Image Encryption, a function introduced recently.

CNCF Member Webinar: Serving Millions of Customers with Cloud Native and DevSecOps

Chris Hollies, CTO, Oracle Practice @Capgemini Akshai Parthasarathy, Principal Director, Cloud Native and DevOps @Oracle Cloud

  • It explained how they re-architected their client(a leading financial organization with millions of customers)’s system by applying Cloud Native and DevSecOps.

CNCF Member Webinar: Kubernetes and storage. Kubernetes for storage. An overview.

Kiran Mova, Chief Architect at MayaData and core maintainer of OpenEBS @MayaData

  • Kiran Mova, chief architect and community leader for the CNCF project OpenEBS, discussed “Container Attached Storage”, an approach that enables storage on a per-workload basis with Kubernetes itself.
  • There is also a brief demo explaining how to operate OpenEBS’s Container Attached Storage with other open source software and services.

Tutorials, tools, and more that take you on a deep dive into the code.

A visual guide to Lens: A new way to see Kubernetes

Jessica Cherry,

A visual guide to Lens: A new way to see Kubernetes

Jessica Cherry,

  • An article that describes Lens as an IDE (Integrated Development Environment) tool that manages Kubernetes clusters with a GUI instead of a CLI.

Runtime Code Profile for Kubernetes Operators

Dominique Vernier, Red Hat

  • An article that describes how to create operators with profile reporting and deploy them in a Kubernetes environment to check code coverage.

Continuous Deployment to Kubernetes with Gitea and Drone

Christine Dodrill

Kubernetes NodePort and iptables rules

Ronak Nathani, LinkedIn

  • The author digs a little deeper into the iptables rules for the NodePort type Service, sharing some of what he learned and sharing answers to the following questions he encountered while working with kube-proxy.
  • “What happens when a non-kubernetes process starts using a port that’s allocated as a NodePort to a service?”
  • “Does the service endpoint continue to route traffic to pods if kube-proxy (configured in iptables mode) process dies on the node?”

OIDC issuer discovery for Kubernetes service accounts

Nandor Kracser, Banzai Cloud

  • As the author prefaced, “Today’s post is going to be rather technical, since we’ll be discussing authenticating Kubernetes applications with external systems through OIDC issuer discovery. “, he explained technically with easy-to-read diagrams and code blocks.
  • He was using Vault on Kubernetes as an OIDC (OpenID Connect) consumer and he was using a simple client application running in a cluster to access a Vault instance with a ServiceAccount token provided.

5G Core Deployment Using Kubermatic KubeOne


  • An article that describes how to deploy a 5G core PoC using the KubeOne project. Implemented using the AWS environment.

[Tutorial] How to Set Multiple Rate Limits per Client ID with Envoy Proxy

Betty Junod,

  • It is a tutorial as the title says, with a sample YAML configuration file and a demo video.

Multi-Cluster Service Discovery in Kubernetes and Service Mesh Engineering Team

  • The first post in the blog series. They started with ”We will dig into specific challenge areas for multi-cluster Kubernetes and service mesh architecture, considerations and approaches in solving them. For our first post, we’ll focus on service discovery and how we need to approach it for multi-cluster environments.”.

Building Kubernetes Operators with Ansible Workshop

Chris Short, Matt Dorn, and Michael Hrivnak, Red Hat

  • Twitch video at Red Hat Summit titled “Building Kubernetes Operators with Ansible Workshop”.

Building images using Podman and cron

Tom Sweeney (Red Hat)

  • An article explains the problem and the solution the author recently experienced and commented that “I recently experienced a problem where image version information wasn’t updating after a build trigger event. This issue left us with out-of-date images. Clearly, not acceptable.”.

About Kubernetes Authentication

Sudip Sengupta, Hackernoon

  • The three steps Kubernetes uses to enforce security access and permissions are Authentication, Authorization, and Admission, and this article focuses on Authentication.

WireGuard on Kubernetes with Adblocking

Ameya Shenoy, BrowserStack

  • The story of the author trying to block ads at the DNS level using WireGuard VPN to improve the web browsing experience and privacy of the family.
  • He uses the uBlock Origin extension in his browser to block ads.

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

5 Problems with Kubernetes Cost Estimation Strategies

Robert Brennan, Fairwinds

  • They discussed the 5 key issues that impact cost estimates, and looked at the strategies used to overcome these issues in Fairwinds Insights cost dashboards.
  1. Bin Packing
  2. Resource Asymmetry
  3. Cost Attribution
  4. Resource Ranges
  5. Noisy Neighbors
  • They said that “We’ll present a more in-depth review of right sizing in a forthcoming blog.”

Scaling the Hottest Apps in Tech on AWS and Kubernetes

Forrest Brazeal, A Cloud Guru

  • EKS migration of Basecamp’s mail application “HEY”. It was an interesting story such as the migration from GKE to EKS and the utilization of Spot Instances.
  • HEY is getting more talked about, but I haven’t touched it yet so I decided to give it a try.

Application Configuration Management in Kubernetes

Puja Abbassi, Giant Swarm

  • The first article in the blog series. It takes a closer look at some of the most popular and interesting technologies aimed at addressing configuration management issues.
  • It explores the following tools to see what they offer and understand their pros and cons. Helm Package Manager and Kustomize were already released (On August 7th, 2020), so if you are interested, you can click the link below.
    Helm Package Manager
    ○ Kapitan
    ○ Tanka
    ○ Burmese
    ○ Capt

Chaos Engineering for a More Secure Kubernetes

Mike Elsmore,

  • It analyzes the principles of chaos engineering, understands how and why it works, and describes how to use chaotic engineering to protect your Kubernetes environment.

What Are the Hardest Parts of Kubernetes to Learn?

Gedalyah Reback,

  • Through explaining some of the core concepts behind Kubernetes, it focuses on providing five resources to help you get it right.
  1. Cluster Administration
  2. Networking
  3. Kubernetes Storage
  4. Learn Kubernetes Security
  5. Governor Observability

SIG-Windows Spotlight

Kaslin Fields, Kubernetes Upstream Marketing Team

  • It tells the story of how Kubernetes contributors work together to provide a container orchestrator that works for both Linux and Windows with the spotlight of “SIG-Windows”.
  • Kubernetes’ contributor gives a spotlight on “SIG-Windows” as he talks about how he provided a container orchestrator that works on both Linux and Windows.

LOTE #13: Dave Sudia on Kubernetes Local Dev, Building a PaaS, and Platform Personas

Ambassador Podcast

  • A transcript of the “Livin’ on the Edge” podcast. This week’s guest, Dave Sudia (Senior DevOps Engineer at GoSpotCheck).
  • They talked about creating an effective local developer experience for Kubernetes, migrating from Heroku, building a Kubernetes-based platform (PaaS) as a service, and more.
  • The story of “How his team developed an understanding of all of the personas involved with creating a platform” was interesting.

API Security: best practices to protect your digital channels

Mia Platform Blog

  • Starting with“We are living in the API economy.”, “they talked about “API Security, what it is and how to approach it” and explained best practices with infrastructure strategies.

Kubernetes Disaster Recovery using Kasten K10 platform

Hrishikesh Deodhar, InfraCloud Technologies

  • It describes the K10 Data Management Platform” by Kasten.
  • The platform functions as a way to perform backup/restore of Kubernetes applications and their volumes.

Announcing the New Special Interest Group on Contributor Strategy


  • CNCF announced the start of a new SIG (Special Interest Group) CNCF SIG Contributor Strategy “.
  • Check Charter and Contributing Guide for details.
  • Chairs are Paris Pittman, Josh Berkus, and Stephen Augustus.
  • They meet Bi-weekly Thursdays with 1 hour regular meetings from 10:30a.m. PT (17:30 UTC).

Upcoming CNCF webinars

You can check some Recorded Webinars and Upcoming Webinars here. The following are posted as Upcoming CNCF webinars at that moment.

Member Webinar: Kubernetes Security Anatomy and the Recently Disclosed CVEs
Gadi Naor, CTO & Co-Founder @Alcide
July 21, 2020 10:00 AM Pacific Time

Ambassador Webinar: Building application management platform with Open Application Model
Lei Zhang (Harry), Staff Engineer @Alibaba
This webinar will be delivered in Chinese.
July 22, 2020 10:00 AM China Standard Time

Member Webinar: Kubernetes Secrets Management: Build Secure Apps Faster Without Secrets
Jody Hunt, Director of DevOps Security @CyberArk
July 22, 2020 7:00 AM Pacific Time

Member Webinar: Implementing Canary Releases on Kubernetes w/ Spinnaker, Istio, and Prometheus
Oleg Chunikhin, CTO @Kublr
July 22, 2020 1:00 PM Pacific Time

Member Webinar: Observability of multi-party computation with OpenTelemetry
Antoine Toulme, Engineering Manager @Splunk
Dave McAllister, Sr. Technical Evangelist @Splunk
July 23, 2020 10:00 AM Pacific Time

Member Webinar: One large cluster or lots of small ones? Pros, cons and when to apply each approach
Flavio Castelli, Distinguished Engineer @SUSE
July 24, 2020 10:00 AM Pacific Time

Member Webinar: Kubernetes Policies 101
Eran Leib, Founder, VP Product Management @Apolicy
Spenser Paul, Director of Sales, North America @DoiT International
July 28, 2020 10:00 AM Pacific Time

Member Webinar: GitOps Continuous Delivery with Argo and Codefresh
Dan Garfield, Chief Technology Evangelist @Codefresh
July 29, 2020 1:00 PM Pacific Time

Member Webinar: Cluster API — Yesterday, Today, Tomorrow
Saad Malik CTO & Co-Founder @Spectro Cloud
Jun Zhou Chief Architect @Spectro Cloud
July 30, 2020 10:00 AM Pacific Time

Project Webinar: How We Doubled System Read Throughput with Only 26 Lines of Code
TiKV team
July 31, 2020 10:00 AM Pacific Time

Member Webinar: Comparing eBPF and Istio/Envoy for Monitoring Microservice Interactions
Roko Kruze, Solutions Engineer @Flowmill
Mike Cohen, Co-Founder and COO @Flowmill
Aug 4, 2020 10:00 AM Pacific Time

Member Webinar: Debugging your debugging tools; What to do when your service mesh goes down in production?
Neeraj Poddar, Co-founder and Chief Architect @Aspen Mesh
Aug 5, 2020 7:00 AM Pacific Time

Member Webinar: Making Data Work for Developers with Kubernetes & Cassandra
Chris Splinter, Sr. Product Manager — Developer Solutions @DataStax
Patrick McFadin, VP of Developer Relations @DataStax
Aug 5, 2020 1:00 PM Pacific Time

Member Webinar: Maximizing M3 — Pushing performance boundaries at scale in a cloud-native distributed metrics engine
Ryan Allen, Senior Software Engineer @Chronosphere
Aug 6, 2020 10:00 AM Pacific Time

Member Webinar: Hardware for Kubernetes, Peeling Back the Layers
Erik Reidel, SVP Compute & Storage Solutions @ITRenew
Aug 11, 2020 10:00 AM Pacific Time

Project Webinar: Kubernetes 1.19
Kubernetes release team
Aug 28, 2020 10:00 AM Pacific Time

Member Webinar: Getting started with container runtime security using Falco
Loris Degioanni, CTO and Founder @Sysdig
Sept 2, 2020 1:00 PM Pacific Time

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara

Written by

An infra engineer in Tokyo, Japan. Grew up in Athens, Greece(1986–1992). #Network, #Kubernetes, #GCP, #AWS SAP, #National Tour Guide for English

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store