SRE / DevOps / Kubernetes Weekly Collection#32(Week 37)

Image for post
Image for post
  • In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
  • Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
  • I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #506 September 6th, 2020
SRE Weekly Issue #234 September 6th, 2020
KubeWeekly # 232 September 11th, 2020

DEVOPS WEEKLY ISSUE #506 September 6th, 2020

A good inside view of a series of incidents. Load balancer firmware, network connectivity issues and how it feels to be on the other side of an emerging incident.

  • The title is “Inside a CODE RED: Network Edition”.
  • A blog post that looks back on the three obstacles by Basecamp.
  • It’s based on a deeper and personal perspective behind the scenes of his colleague Jeremy’s post.
  • This post is meant for both people who want a deeper, technical understanding of the outage, as well as some insight into the human side of incident management at Basecamp.

A set of principles for managing feature toggles in teams, from making them visible to ensuring they are short lived.

  1. Feature toggles should be flippable
  2. Feature toggles should be used by default
  3. Feature toggles should be added per story
  4. Feature toggles should be visible
  5. Feature toggles should be short-lived
  6. Feature toggles should be tested
  • I’m in a GitFlow sect, and I couldn’t dig deeper into the content because I don’t see any major issues or significant benefits to TBD at the moment. If anyone is addicted to TBD or is familiar with it, please let me know.

A discussion of complex adaptive systems in relation to IT service management. Interesting points about the importance of constraint to limit negative emergent behaviour.

  • The title is “Complex Adaptive Systems (ii): thinking about emergence and ITSM”.
  • Part 2 of a series that explores the concept of complex adaptive systems in the context of Complexity Science and ITSM under the theme of “Complex Adaptive Systems”.
  • Click here for Part 1 which is an introductory content of the concepts that appear.

DevOpsDays made a comeback in Chicago, with an online version. Talks on resilience engineering, growing a local devops community, chaos engineering as well as ignites and breakouts.

Tips for using third party software packages and images from public repositories, including considering availability, rebuilding from source and local caches.

  • The title is “Consuming Upstream Content in Your Software or Service”.
  • An article that touches on how developers are contributing to and using more upstream content and explaining the need to protect the entire ecosystem, projects, products, or services.

Open Policy Agent is powerful, but like any new tool has a learning curve. This 30 minute tutorial takes you through learning the basics of the Rego language.

  • The title is “Courses on Unified Policy”.
  • Free OPA course by Styra. It includes 30 video lessons and quizzes. Once you register for an account, you can get started right away.
  • OPA’s co-creator, Styra’s CTO & co-founder Tim Hinrichs may explain using easy-to-understand slides.

An excellent introduction to the basics of Kubernetes, covering core components, the general architecture and deploying your first applications.

  • The title is “Kubernetes 101”.
  • An article that summarizes the introductory content of Kubernetes.
  • The advertisements that appeared multiple times in the first half personally made me very annoyed when reading the article. It was disappointing at that point but the content was explained carefully.

oso is an open source policy engine for authorization that you can embed in your Java, Python, Ruby or Node application. It provides a consistent DSL and some good getting started documentation.

  • The GitHub page of “oso”, an open source policy engine for authorization embedded into your application.
  • It provides a declarative policy language for expressing authorization logic.
  • Using oso consists of two parts:
  1. Writing oso policies in a declarative policy language called Polar
  2. Embedding oso in your application using the appropriate language-specific authorization library

Gitleaks is a handy tool for detecting secrets in Git repositories, with integration with GitHub Actions and the ability to scan all repos in an organisation.

  • The GitHub page of Gitleaks, a SAST tool for detecting hard-coded secrets such as passwords, API keys, and tokens in Git repositories.
  • It aims to be the easy-to-use, all-in-one solution for finding secrets, past or present, in your code.

Continuous Machine Learning (CML) is an open-source library for CI/CD in machine learning projects. Automate model training and evaluation, comparing ML experiments across your project history, and monitoring changing datasets.

  • The GitHub page of “CM (Continuous Machine Learning)”, an open source library for implementing CI/CD in machine learning projects.
  • It automates parts of your development workflow, including model training and evaluation, comparison of ML experiments across project history, and monitoring of changing datasets.

SRE Weekly Issue #234 September 6th, 2020

How to Build Your SRE Team

I love the way this article portrays SRE by placing less emphasis on specific skills and more on a holistic approach to reliability.

Emily Arnott — Blameless

  • It explains the following points as some of the many roles that SREs can play and how to find people with those skill sets.
    ○ Common pathways to becoming an SRE
    ○ SREs as engineers of reliability
    ○ SREs as stewards of reliability
    ○ SREs as leaders who align reliability with business needs
    ○ SREs as ambassadors of reliability culture
    ○ Common team structures

Incident Reviews in High-Hazard Industries: Sense Making and Learning Under Ambiguity and Accountability

Incident review is an important part of the organizational learning process, but it can be practiced in a way where the focus shifts away from learning to fixing.

John Carrol (original paper)

Thai Wood — Resilience Roundup (summary)

  • It checks the paper on nuclear power plants with the above title, and he started his explanation of issues with the following phase “While software is unique in some ways, we still are subject to the same constraints as other complex, socio-technical systems and run into similar problems when attempting to learn from incidents.Incident review is an important part of the organizational learning process, but it can be practiced in a way where the focus shifts away from learning to fixing. This creates a few issues such as:”.
    ○ Root cause seduction
    ○ Sharp end focus
    ○ Solution driven searches
    ○ Account adaptability

AD 0001

My latest adventures in (negligently) running It started with a surprise AWS bill, and then it got kinda weird…

Lex Neva

  • A blog post by Lex Neva, the editor of SRE Weekly.
  • Starting with the irony of his own site below, his AWS bill was double what it usually is. Twice a pittance is still a pittance for him, but it was curious, so he dug in a little. Hilarity ensued and he wrote this one.
    ○ The not-so-subtle irony of SRE Weekly is that itself is really not very reliable at all[1].
    ○ [1] Please don’t DDoS Please! It’s not funny and you’ll just make me sad. ♥

Inside a CODE RED: Network Edition

Deep technical details on a series of recent incidents involving Basecamp.

Troy Toman — Basecamp

  • Since it is covered in DEVOPS WEEKLY ISSUE # 506 above, I will skip it.

Questionable Advice: War Rooms? Really?!?

Here’s why eyes-on-glass constant monitoring won’t help and can be actively harmful.

Charity Majors

  • An article summarizing the author’s thoughts on anonymous questions.
  • The questioner is under pressure from his company that recently began pushing for them to build and staff out what he can only describe as “command centers”. They’re picturing graphs, dashboards…people sitting around watching their monitors all day just to find out which apps or teams are having issues.
  • The author is critical of the idea and has the following opinions:
  1. That extra human layer is worse than useless; it is actively harmful. By insulating developers from the consequences of their actions, you are concealing from them the information they need to understand the consequences of their actions. You are interfering with the most basic of feedback loops and causing it to malfunction.
  2. The best time to find a bug is as soon as possible after writing it, while it’s all fresh in your head. If you let it fester for days, weeks, or months, it will be exponentially more challenging to find and solve. And the best people to find those bugs are the people who wrote them.

GitHub Availability Report: August 2020

In August, we experienced no incidents resulting in service downtime. This month’s GitHub Availability Report will dive into updates to the GitHub Status Page and provide follow-up details on how we’ve addressed the incident mentioned in July’s report.

Keith Ballinger — GitHub

  • It is a blog post of “GitHub Availability Report: August 2020” published by GitHub every month.
  • As mentioned above, they experienced no incidents resulting in service downtime and dive into their updates and follow-up details on the incident mentioned in July’s report.

Analysis of Today’s CenturyLink/Level(3) Outage

Here are Cloudflare’s thoughts on what happened with Sunday’s Internet trouble.

Matthew Prince — Cloudflare

  • An article by Cloudflare analyzing the failure of CenturyLink / Level (3) that occurred on August 30, 2020.
  • Cloudflare users have also been affected and it includes a timeline of failures, mitigations taken by Cloudflare, and possible root causes of failures.

CenturyLink / Level 3 Outage Analysis

This is ThousandEyes’s analysis of the outage, which goes along similar lines to Cloudflare’s and includes a lot more detail.

Angelique Medina and Archana Kesavan — ThousandEyes

  • An article by ThousandEyes analyzing CenturyLink / Level (3) failures that occurred on August 30, 2020.
  • It describes the root cause of the failure, interactive visualization monitors, various monitors, service impacts, mitigation measures taken by each company, lessons learned from the failure, and more.

KubeWeekly # 232 September 11th

Editor’s pick of the highlights from the past week.

CNCF launches End User Technology Radar: Observability, September 2020

Today, CNCF is publishing the second of our quarterly CNCF End User Technology Radars; the topic for this Technology Radar is observability.

In June, we launched the CNCF End User Technology Radar, a new initiative from the CNCF End User Community. This is a group of more than 140 top companies and startups who meet regularly to discuss challenges and best practices when adopting cloud native technologies. The goal of the CNCF End User Technology Radar is to share what tools are actively being used by end users, the tools they would recommend, and their patterns of usage. Read the blog post and full report here.

  • It introduces the September 2020 edition of CNCF End User Technology Radar.
  • The first edition came out in June, and this is the second one. A Webinar video that explains the contents of this time by the Radar Team is embedded for about 1 hour.

Register for KubeCon + CloudNativeCon North America 2020 Virtual and save $50!

Registration is now open! Don’t miss out on THE event of the fall — KubeCon + CloudNativeCon North America 2020 Virtual, November 17–20! The CFP is now closed, and we are eagerly putting together a schedule that will fit our at-home, online-event lifestyles. Stay tuned for more details!

  • The registration of KubeCon + CloudNativeCon North America 2020 Virtual (11/17 ~ 11/20) is now open. Free participation is accessible on Keynote and Slack Network Lounge only.
  • Paid participation is priced as follows depending on the time of your application. Don't forget to apply for a great deal. I'm done.
    ○ Early-Bird(Sep 9–30, 2020): $50 USD
    ○ Standard(Oct 1–31, 2020): $75 USD
    ○ Late(Nov 1–20, 2020): $100 USD

You can view all CNCF recorded and upcoming webinars here.

CNCF Member Webinar: Arm Developer experience spanning cloud, 5G and IoT

Darragh Grealish, Co-Founder of 56K.Cloud & Marc Meunier Sr. Manager, SW Ecosystem Development @Arm

  • It describes how the developer experience is shaped and how the CNCF project and the Arm initiative are making this change possible.
  • With reference to actual examples, they touch on various layers where applications exist, from input devices to edge nodes to the cloud.

CNCF Member Webinar: Building a cloud-native technology stack that supports full cycle development

Daniel Bryant, Product Architect @Datawire

  • It describes four essential features for full-cycle development: container management, progressive delivery, edge management, and observability.
  • It talks about the technology requirements above and describes common anti-patterns and how to avoid them.

CNCF Member Webinar: Highly scalable SaaS apps on Kubernetes: Real life case studies

Ram Kailasanathan, Senior Director Product Management @Oracle & Richard Bair, Senior Director Engineering @Oracle

  • It explains how to build a global cloud-native SaaS application at DevOps speed.
  • Through real-world case studies, covering all foundations from a security perspective, it describes scaling to hundreds of clusters across regions, achieving high availability, and providing the necessary monitoring and tracing.

CNCF Member Webinar: Kubernetes and Networks: why is this so dang hard?

Tim Hockin, Principal Software Engineer @Google

  • It presents different models for integrating Kubernetes into your network in both single-cluster and multi-cluster environments.
  • It describes IP, gateway configurations, how to navigate security boundaries, and the strengths and weaknesses of each solution to help developers make the best choice for their particular environment.

Tutorials, tools, and more that take you on a deep dive into the code.

Introduction to Tekton and Argo CD for multicluster development

Ryan Cook, Red Hat

  • A simple description of the author’s development process for catalogs and process tools.
  • It introduces the components involved, explains a bit about how Tekton Pipelines works, and introduces tools that you can share with your organization or team.

Scaleable multiplayer game design with OpenShift

Erik Jacobs, Roddie Kieley, Michael Clayton, Jared Sprague, and Derek Reese, Red Hat

  • A monthly series published on Red Hat’s YouTube.
  • The series uses containers and OpenShift to explore what it takes to design a scale-out multiplayer video game architecture. Live coding, philosophical design discussions, and everything in between.
  • This is the first episode. It selects the genre of the game and discusses how game design (rules and internal game system) informs the architecture.

Deploy a deep learning model on Kubernetes

Chaimaa Zyani, Kubermatic

  • It explains how to use the Kubermatic Kubernetes Platform to deploy, scale, and manage deep learning models that provide image recognition prediction.

Preventing malicious use of Weave Scope

Steve George, Weaveworks

  • An article by Weaveworks written in response to a report by Intezer and Microsoft that “ TeamTNT hackers are using Weave Scope to aid their intrusions”.
  • Since Weave Scope is a management tool, it has powerful features and it is important to secure the installation.
  • It explains how to use scopes and how to protect them during Kubernetes installation to prevent misuse of scopes.

Cert-manager hits version 1.0.0


  • Cert-manager version 1.0.0 GitHub release page. With the release of v1.0, cert-manager has officially stated that it is a mature project and promises compatibility with the v1 API.
  • There are some “Urgent Upgrade Notes” for upgrade, so be sure to check before upgrading.

Service proxy, pod, sidecar, oh my!

Ivan Velichko

  • An article that gets hands dirty to understand the problem of the component in the title in a demo format. The following is a closing phrase.
    ○ Make code, not war!

The death of Kubernetes AuditSink

Omri Cohen, Palo Alto Networks

  • As the title suggests, he introduced Kubernetes audits in his latest blog post “Kubernetes Audits Introduction”, and as a follow-up article, he was preparing for a dynamic backend.
  • But the dynamic backend (also known as the AuditSink API object) notified that it would be removed in Kubernetes V1.19, so he wrote it in this article.
  • He asked a question on Kubernetes’ slack channel and got the following answers:
    ○ TL;DR: the feature did not progress for the last 1.5 years since the group responsible for it couldn’t agree on it’s future.

Interview with Honeycomb engineer Chris Toshok: Dogfooding OpenTelemetry

Shelby Spees, Honeycomb

Warning: Helpful warnings ahead

Jordan Liggitt, Google

  • A blog post by the maintainer of Kubernetes on about adding features in Kubernetes v.1.19.
  • Until now, feature development, bug triage, and sharing of support questions were limited to out-of-band methods such as release notes, announcement emails, documents, and blog posts.
  • Kubernetes v1.19 adds the ability for the Kubernetes API server to send warnings to API clients.

Continuous blue-green deployments with Kubernetes

Tomas Fernandez, Semaphore

  • It explains how to create a CI/CD pipeline that deploys apps on Kubernetes using the Blue-Green method.
  • The generality of blue-green deployment was explained in a previous post.

Kubernetes YAML generator


  • Personally I love it and highly recommend it.
  • You can create a YAML file with a drop down and optional input for Kubernetes. And there are considerations and links for each resource and option.
  • Very convenient and educational UI.
  • Currently selectable resources are Deployment, StatefulSet, DaemonSet.

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

Airbnb, with Melanie Cebula

Kubernetes Podcast from Google

Advancing the future of CI/CD together

Tracy Miranda, CD Foundation

  • It introduces the Continuous Delivery Foundation (CDF).
  • The CI/CD landscape was new to me (the link says “Continuous Delivery Landscape”) and I didn’t know that the CNCF landscape was called “hellscape”.

New, free training course teaches fundamentals of serverless on Kubernetes

CNCF and LF Training

Balancing open source sacrifice and success

Alex Ellis, The ReadMe Project

  • An article by Alex Ellis (Founder of OpenFaaS, CNCF Ambassador) on the ReadMe project on GitHub about OSS and his business so far.
  • The ReadME project amplifies the voice of the open source community, where maintainers, developers, and teams contribute to move the world forward every day.

You can check some Recorded Webinars and Upcoming Webinars here. The following are posted as Upcoming CNCF webinars at that moment.

Member Webinar: ChubaoFS Best Practices
Wei Ding, Staff Engineer
Sept 15, 2020 10:00 AM Pacific Time

Member Webinar: How To Run Kubernetes Securely and Efficiently
Joe Pelletier, VP, Products Fairwinds @Fairwinds
Robert Brennan, Director, Open Source @Fairwinds
Sept 16, 2020 7:00 AM Pacific Time

Member Webinar: Effective Kubernetes Onboarding
Kathleen Juell, Developer, DODX @DigitalOcean
Sept 16, 2020 1:00 PM Pacific Time

Member Webinar: Declaratively managing apps in a multi-cluster world
Fernando Ripoll, Solution Engineer @Giant Swarm
Sept 17, 2020 10:00 AM Pacific Time

Member Webinar: Critical DevSecOps considerations for multicloud Kubernetes
Nutanix and Sysdig
Sept 18, 2020 10:00 AM Pacific Time

Member Webinar: Using KubeVirt in telcos
Abhinivesh Jain, Distinguished Member of Technical Staff @Wipro
Sept 23, 2020 7:00 AM Pacific Time

Member Webinar: Mitigating Kubernetes attacks
Wei Lien Dang, Head of Strategy @StackRox
Sept 23, 2020 1:00 PM Pacific Time

Member Webinar: AWS controllers for Kubernetes — AWS services, now Kubified!
Jay Pipes, Principal Open Source Engineer @Amazon Web Services
Sept 24, 2020 10:00 AM Pacific Time

Project Webinar: Kubernetes 1.19
Kubernetes Release Team
Sept 25, 2020 8:00 AM Pacific Time

Member Webinar: VanillaStack as a platform for a truly vendor-agnostic open-source ecosystem
Karsten Samaschke, CEO @Cloudical
Sept 29, 2020 10:00 AM Pacific Time

Member Webinar: Self service Kubernetes for enterprises
Jim Bugwadia, Founder and CEO @Nirmata
Sept 30, 2020 10:00 AM Pacific Time

Member Webinar: Dapr, Lego for microservices
Mark Chmarny, Principal Program Manager @Microsoft
Oct 1, 2020 10:00 AM Pacific Time

Member Webinar: Transactional microservices — The final frontier
Daniel Kozlowski, Minister of Engineering @PlanetScale
Oct 2, 2020 10:00 AM Pacific Time

Member Webinar: Multi-Cluster & multi-cloud service mesh with CNCF’s Kuma and Envoy
Marco Palladino, CTO & Co-Founder @Kong
Oct 6, 2020 10:00 AM Pacific Time

Member Webinar: The evolution of cloud orchestration systems from ephemeral to persistent storage
Boyan Krosnov, CPO @StorPool
Oct 7, 2020 8:00 AM Pacific Time

Member Webinar: Kubernetes native two-level resource management for AI/ML workloads
Diana Arroyo Software Engineer @IBM Research
Alaa Youssef, Manager, Container Cloud Platform @IBM Research
Oct 7, 2020 10:00 AM Pacific Time

Member Webinar: Building dynamic machine learning pipelines with KubeDirector
Tom Phelan, Fellow, Software Organization @Hewlett Packard Enterprise
Oct 8, 2020 10:00 AM Pacific Time

How was it? Did you find any articles or information that interest you?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara

Written by

An infra engineer in Tokyo, Japan. Grew up in Athens, Greece(1986–1992). #Network, #Kubernetes, #GCP, #AWS SAP, #National Tour Guide for English

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store