SRE / DevOps / Kubernetes Weekly Collection#26(Week 31)

  • In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
  • Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
  • I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #500 July 26th, 2020
SRE Weekly Issue #229 July 26th, 2020
KubeWeekly #227 August 1st, 2020

DEVOPS WEEKLY ISSUE #500 July 26th, 2020

500 issues

Here are 5 posts still worth reading from Devops Weekly issues through the years.

Issue #1

But what is Devops? I know a number of people have signed up to this newsletter having only recently come across the term. It’s safe to say Devops means different things to different people at this early stage, but I’m going to start out by pointing everyone to James Turnbull’s WHAT DEVOPS MEANS TO ME

  • The title is “What DevOps Means to Me.”
  • First, it explains the background of the need for DevOps, “Why should we merge or bring together the two realms?” And to the question in the title “What DevOps means to me?”, the author closed the article with this phrase “The best thing about the movement for me is that it is trying to foster behaviours and environments where people work together towards joint goals rather than at cross-purposes or at odds. That’s a world I’d much rather use my skills in.”

Issue #100

Interesting set of blog posts, describing protocols or patterns for Devops adoption. The first two talk about the advantages of starting small and fixing a real problem quickly and about configuration management and limiting manual changes.

  • The titles are “Devops Protocols: Start Small” (linked above) and “ Devops Protocol: No Manual Changes “.
  • A blog series explaining how to apply DevOps and patterns. “Start Small” and “No Manual Changes” are divided into each three points. It is easy to understand because you can imagine the actual situation through those explanations.

Issue #200

If you’re just getting to grips with monitoring it can be difficult to know where to start. This presentation gives you a quick overview of the last 5 years. Lots of ideas for things to improve.

  • The title is “5 Years of Metrics & Monitoring”.
  • Slides uploaded to Speaker Deck. There is a video link, but the link is broken.
  • Looking back on the metrics and monitoring for five years, he asks himself questions and answers. The chart and dashboard examples are easy to understand what is wrong.

Issue #300

I attended the recent Operability conference in London and these two posts nicely summarise the various talks and provide lots of links to the slides.

Issue #400

Good tips for ensuring the software release process is robust, emphasising that this means clear ownership and treating deployment software like the critical production system it is.

  • It says first that “Rolling out new software is the proximate cause for the overwhelming majority of incidents, at companies of all sizes.” and “most issues are still caused by humans and our pesky need for “improvements”.” and explains what can be done.
  • It explains based on the following five points.
  1. Get someone to own the deploy software
  2. Value the work
  3. Create a culture of software ownership
  4. LOOK at what you’ve done after you do it
  5. Be suspicious of new versions until they prove themselves


It’s that time of year again. The regular Puppet State of Devops survey is open. The focus this year is the relationship between change management, continuous delivery and self-service platforms.

  • “Puppet’s 2020 State of DevOps survey” web page by Puppet. It seems that it is being carried out at this time every year.

Documentation and design serve a critical role in building robust systems. This post looks at why design documents are useful and what sort of thing they should include.

  • The title is “Design Docs at Google”.
  • “One of the key elements of Google’s software engineering culture is defining software design through design documents,” he touches and explains: Bookmark this because I want to read it back later.
  • These are relatively informal documents that the primary author or authors of a software system or application create before they embark on the coding project.
  • The design doc documents the high level implementation strategy and key design decisions with emphasis on the trade-offs that were considered during those decisions.

A new report on the state of public Terraform code security. Some useful data and some good tips for anyone using Terraform to configure services.

  • The title is “Introducing the State of Open Source Terraform Security Report”.
  • Introductory article of “State of Open Source Terraform Security Report “.
  • They analyze the public Terraform registry that contains thousands of open source modules used to provision cloud resources.
  • It uses OSS static analysis tool “Checkov” to scan the registry and measure module compliance across categories and cloud providers.
  • I was shocked by the findling “Nearly 1 in 2 modules used to build resources for AWS, Azure, and Google Cloud is misconfigured.”.

A look at using Azure Pipelines to validate a sysmon configuration automatically.

  • The title is “Using Azure Pipelines to validate my Sysmon configuration”.
  • The author has been maintaining a Sysmon repository for two years with manual generation from the included script.
  • Thanks to a pull request by Ján Trenčanský that utilised GitHub Actions, it sparked the idea to take this a step further.
  • It describes the easiest solution for him to run, configuration, and tips for getting it working with Azure Pipelines.

A good story of migrating a low-level component at scale, in this case an application server. Canary rollouts, upstream contributions, performance and other interesting topics.

  • The title is “How we migrated application servers from Unicorn to Puma”.
  • GitLab’s application server has been migrated from Unicorn to Puma. It has been running on Puma since Gitlab 12.9, and has been running on Puma by default since 13.0.
  • Both are for Ruby on Rails, but the big difference is that Unicorn is a single-threaded process model and Puma is a multi-threaded model.
  • They started the investigation and implemented the migration from the viewpoint of solving memory problems and scalability.

Embracing cloud native technologies and ways of working comes with challenges, some of which this post documents, including security, lack of expertise, slow release cycles and more.

  • The title is “Top 7 Challenges to Becoming Cloud Native”.
  • It describes the 7 most common issues enterprise companies face in the cloud native journey with the following introduction.
  1. While great in theory, the problem with cloud native computing is that it isn’t always easy or straightforward to implement — especially if you’re an enterprise with long-standing, legacy applications.
  2. Not only must you adopt cloud native tools that suit your unique requirements, you must nurture their use with cultural shifts.
  3. Change should be implemented incrementally but holistically.
  • The 7 most common issues as follows.
  1. Slow release cycles and accelerated pace of change
  2. Outdated technologies
  3. Service provider lock-in and limited flexibility for growth
  4. Lack of technical expertise to handle data
  5. Security
  6. High operational and technology costs
  7. Cloud native concepts are difficult to communicate


kube-iptables-tailer does just what you expect. It exposes the underlying iptables data to kubectl, Handy for spotting services trying and failing to communicate to one another in Kubernetes.

  • Kubernetes GitHub page for the OSS tool “kube-iptables-tailer” that gives you more visibility into network issues in your cluster.
  • It detects traffic denied by iptables and surfaces corresponding information to the affected Pods via Kubernetes events.

Kconmon is a Kubernetes connectivity monitoring tool that runs frequent tests (tcp, udp and dns), and exposes Prometheus metrics that are enriched with the node name, and the locality information (such as zone), enabling you to correlate issues between availability zones or nodes.

  • Kubernetes GitHub page of the OSS tool “Kconmon” that monitors connections between nodes.
  • It runs frequent tests (tcp, udp, dns) and exposes Prometheus metrics enriched with node names and locality information (zones, etc.).

SRE Weekly Issue #229 July 26th, 2020


“How could they be so stupid?”

More details have emerged about the Twitter break-in last week, leading some to utter the quote above. Here’s a take on how to see it as not being about “stupidity”.

Lorin Hochstein

  • An article about a case where Twitter accounts were hijacked one after another on July 15th (US time) and abused for bitcoin remittance fraud.
  • The comment “How cloud they be so stupid?” in the title was cast on the Internet by a Twitter engineer because it was reported that a hacker got a credential with a message from Twitter’s internal Slack. It was
  • The author said that “I don’t personally know any engineers at Twitter, but I have confidence that they have excellent engineers over there, including excellent security folks. So, how do we explain this seemingly obvious security lapse?”.
  • He carried out the following analysis and explained that it is necessary to deepen the understanding of the system to eliminate the problem that motivated the workaround.
    ○ There are countless possibilities for why people employ workarounds. Maybe some system that’s required for doing it the “right” way is down for some reason, or maybe it simply takes too long or is too hard to do things the “right” way. Combine that with production pressures, and a workaround is born.
    ○ Some workarounds, like the Twitter example, are dangerous. But simply observing “they shouldn’t have done that” does nothing to address the problems in the system that motivated the workaround in the first place.
    ○ When you see a workaround, don’t ask “how could they be so stupid to do things the obviously wrong way?” Instead, ask “what are the properties of our system that contributed to the development of this workaround?”

Data Consistency Checks

The data in your database should be consistent… but then again, incidents shouldn’t happen, right? Slack accepts that things routinely go wrong with data at their scale, and they have framework and a set of tools to deal with it.

Paul Hammond and Samantha Stoller — Slack

  • Blog by Slack engineers. In the beginning, the author says that “An entire ecosystem of monitoring and administrative tools exist for operating our databases, making sure they replicate, scale and are generally performant. Similarly, a number of tools accompany the databases’ query language from linters and beautifiers to query builders and object mappers. But after our application has written data, there is very little tooling to verify that the data is as expected and remains as such.” and explains mainly “Consistency Check Pattern.”

Obstacles to Learning from Incidents

I learned a lot from this article. My favorite obstacle is “distancing through differencing”, e.g. “we would never have responded to an incident that way”.

Thai Wood — Learning from Incidents

  • The following points explain “Obstacles to Learning from incidents” in the title. As with the editor of SRE Weekly, “Distancing through differencing” remains to me.
    ○ Distancing through differencing
    ○ Overconfidence
    ○ Root cause
    ○ Only trying to learn from “bad” things
    ○ High pressure reporting requirements
    ○ Making sure this never happens again
    ○ Confusing writing, distribution, or meetings with learning

You don’t need SRE. What you need is SRE.

[…] SRE, that is SRE as defined by Google, is not applicable for most organizations.

Sanjeev Sharma

  • Here are the points the author wanted to convey in the title and the first section.
    ○ You do not need SRE. Don’t get me wrong, you need (service/system) Reliability Engineering.
    ○ You still need to automate repetitive, typical tasks in operations.
    ○ You just don’t need to, and really should not do it the Google way.
    ○ You are not Google. Very few organizations are.
  • The points the author wanted to convey in the second section “SRE for the Enterprise” are as follows.
    ○ You are not replacing your current Ops team, your sys admins with software Engineers. You need your ops team.
    ○ Renaming your DevOps teams as SRE is a no-no. DevOps is DevOps. SRE is SRE.
    ○ According to the article dated February 20, 2020, at that time, if you cooperated with Survey and sent a screenshot, it seems that there was a 20-minute “free consultation between yourself and the team regarding SRE”. Survey is now closed.

Questionable Advice: “What’s the critical path?”

Expert advice on what questions to ask as you try to figure out what your critical path is (and why you would want to know what it is).

Charity Majors

  • The author was questioned “Any advice/reading on how to establish a team’s critical path?” and wrote down her thoughts.
  • Her answer is “ “What makes you money?” This leads to the following actions.
    ○ The idea here is to draw up a list of the things that are absolutely worth waking someone up to fix immediately, night or day, rain or shine.
    ○ That list should be as compact and well-defined as possible.
    ○ This allows you to be explicit about the fact that anything else can wait til morning, or some other less-demanding service level agreement.
  • ““what makes us money?” is a substitute for the actual question below.
    ○ “what actions allow us to survive as a business?”
    ○ What do our customers care the absolute most about?
    ○ “What makes us us?

Thinking About Your Humans With J. Paul Reed

This podcast episode was kind of like a preview of J. Paul Reed and Tim Heckman’s joint talk at I love how they refer to the pandemic as a months-long incident, and point out that if you’re always in an incident then you’re never in an incident.

Julie Gunderson and Mandi Walls — Page it to the Limit

Rebuilding messaging: How we bootstrapped our platform

I love a good dual-write story. Here’s how LinkedIn transitioned to a new messaging storage mechanism.

Pradhan Cadabam and Jingxuan (Rex) Zhang — LinkedIn

  • Part 2 of the “Rebuilding Messaging” series describes a major migration of existing data to a new database, or bootstrapping of data from a legacy system to a new system, as is commonly mentioned. ing.


KubeWeekly #227 August 1st

The Headlines

Editor’s pick of the highlights from the past week.

Scheduling, with David Oppenheimer

David’s work with Borg, Omega and now Kubernetes over the past 13 years, puts him, in the words of Tim Hockin, “among the world’s experts in scheduling systems”. On this week’s episode of the Kubernetes Podcast from Google, David talks about his experience with scheduling systems, how learnings from Omega became Kubernetes features, and what he thinks the biggest challenges facing the cluster management space are today.

Jaeger Project Journey Report: A 917% increase in companies contributing code

CNCF staff

This week, CNCF published a project journey report for Jaeger. This is the sixth such report compiled for CNCF graduated projects. The report assesses the state of the Jaeger project and how CNCF has impacted its progress and growth.

Jaeger is an open source, end-to-end distributed tracing platform built to help companies of all sizes monitor and troubleshoot their cloud native architectures. Contributors to Jaeger include many of the world’s largest tech companies, such as Uber, Red Hat, Ryanair, IBM, and Ticketmaster as well as fast-growing mid-size companies like Cloudbees. Read the full report.

  • CNCF released the Jaeger Project Journey Report, each report for graduated products in CNCF.
  • They tried to objectively evaluate the current state of Jaeger and how the CNCF has impacted development and growth.
  • No doubt that Uber, the developer/donator, is the top contributor by company, while Red Hat is growing the percentage by company, but Others is also growing, and the countries/companies to which the contributor belongs are diverse. It has been seen as a healthy growth.

Register by August 3, 2020 at 23:59 PDT and you will be entered into a drawing* to win one of the below gift boxes.

Keep Cloud Native Delighted Swag Box (500 available) which includes: KubeCon + CloudNativeCon Europe t-shirt
Keep Cloud Native Connected patch
Project logo face mask
Diamond sponsor surprise
Kubernetes fidget spinner
CNCF socks

Grand Prize! Keep Cloud Native Delighted Deluxe Swag Box (10 available) which includes:

All the above items PLUS
$150 gift card to the CNCF online store

*The drawing is open to both pass types, Full Event and Keynote + Expo Hall only, whether already registered or registering between now and August 3. Winners must be registered by the August 3 deadline AND attend the conference. Drawing will be held and winners notified by email on August 24, 2020. Limit of (1) box per participant.

Not only do you have the opportunity to win swag but by registering now, time is blocked on your calendar so you won’t miss a thing. It’s a win-win!

Register now!

  • Information on KubeCon + CloudNativeCon Europe drawing. 2020/08/03 23:59 If you registered by PDT, you would get the lottery right. A lottery of 500 people will be held automatically, and the winners will be notified by email on 8/24. It seems that the target tickets are both $75 for full session participation and tickets for free keynote + sponsor session.

ICYMI: CNCF Webinars

You can view all CNCF recorded and upcoming webinars here.

CNCF Member Webinar: One large cluster or lots of small ones? Pros, cons and when to apply each approach

Flavio Castelli, Distinguished Engineer @SUSE

  • It explains the pros and cons of both approaches, running Kubernetes clusters in “one large cluster” and “lots of small clusters”, and which solutions can be used to alleviate some of their drawbacks there.
  • The aim is to understand the trade-offs of both approaches, and to help evaluate which path to follow based on your requirements.

CNCF Member Webinar: Kubernetes Policies 101

Eran Leib, Founder & VP Product Management @Apolicy and Spenser Paul, Director of Sales, North America @DoiT International

  • Kubernetes policy is explained focusing on the following topics.
    ○ What type of Policies exist?
    ○ How do we define and enforce Policies?
    ○ What best practices are available?

CNCF Member Webinar: GitOps Continuous Delivery with Argo and

Codefresh Brandon Phillips, Solutions Architect @Codefresh

  • Argo and Codefresh are examples of how to use GitOps to repeatedly achieve reliable and fast release.

CNCF Member Webinar: Event-Driven Cloud Native Workflows Use Cases and Patterns

Sebastien Goasguen, CTO @TriggerMesh

  • The explanation focuses on the following points. The commentary also mentions the products of TriggerMesh.
    ○ Discuss a set of serverless use-cases, from LEGO to HSBC, and highlight common patterns.
    ○ Show how these patterns can be reproduced with technologies like k-native and the cloud event specification.
    ○ Finish by waiting for the pros and cons of going serverless directly in the cloud vs. running some of the backing infrastructures yourself.

CNCF Member Webinar: Cluster API — Yesterday, Today, Tomorrow

Saad Malik CTO & Co-Founder @Spectro Cloud Jun Zhou Chief Architect@Spectro Cloud

  • It describes cluster APIs and common Kubernetes lifecycle management options.

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

Domain-Oriented Microservice Architecture

Adam Gluck, Uber

  • It introduces a general approach to microservice architecture called “DOMA (Domain-Oriented Microservice Architecture)” which Uber is working on.
  • The following points are highlighted early in the text.
    Our goal with DOMA is to provide a way forward for organizations that want to reduce overall system complexity while maintaining the flexibility associated with microservice architectures.
  • I would like to deepen my understanding and image of architecture. I bookmarked this article as well.

12 Container image scanning best practices to adopt in production

Pawan Shankar, Sysdig

  • “In this blog, we will cover many image scanning best practices and tips that will help you adopt an effective container image scanning strategy.” he explains 12 best practices below.
  1. Bake image scanning into your CI/CD pipelines
  2. Adopt inline scanning to keep control of your privacy
  3. Perform image scanning at registries
  4. Leverage Kubernetes admission controllers
  5. Pin your image versions
  6. Scan for OS vulnerabilities
  7. Make use of distroless images
  8. Scan for vulnerabilities in third-party libraries
  9. Optimize layer ordering
  10. Scan for misconfigurations in your Dockerfile
  11. Flag vulnerabilities quickly across Kubernetes deployments
  12. Choose a SaaS-based scanning solution

Certificate management on Istio

Szabolcs Berecz, Banzai Cloud

  • Following on from a recent post on the blog titled “Certificate management on Kubernetes”, it focuses on the differences that the Istio service mesh makes.
  • The hands-on explanations and diagrams are easy to see and are substantial.

Kubernetes Secrets: A Secure Credential Store for Jenkins

Vasumathy Seenuvasan and Ravi Bukka, eBay

  • A story that utilizes the function of Kubernetes secret to manage eBay’s Jenkins credentials.
  • The company is containerizing Jenkins to provide a continuous build infrastructure for Kubernetes clusters, enhancing the e-commerce marketplace experience.

Conftest joins the Open Policy Agent project

Gareth Rushgrove, Snyk

  • I’ll skip it because it’s an article I covered last week.

GitOps Continued: Using Tekton for CI and Argo for CD

Mario Vazquez, Ryan Cook, Chris Short, Red Hat

  • A nearly 100-minute Twitch video featuring Tekton CI and Argo CDs by Red Hat members.

The Seccomp Notifier — New Frontiers in Unprivileged Container Development

Christian Brauner

  • It details the new seccomp notification feature they have developed in both the kernel and user space. It’s been explained in great detail, and I haven’t read it completely yet.

Introduction to instrumenting applications with Prometheus

Brian Brazil, Sysdig

  • The following two points are mainly looked at.
  1. Analyzing your service to find the most useful places to add metrics, how to add that instrumentation, getting it exposed and scraped.
  2. Basic query to publish and scrape it, then use these metrics in a graph

Deny Rules! Fine-Grained Kubernetes Access Controls with Kyverno

Shuting Zhao

  • It introduces how to easily manage fine-grained access control as a custom policy using Kyverno.

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

The Kubernetes CPU Manager: a Ghost Story

Michael Vigilante

  • The CPU was influenced by the “ghost” that lives in the Kubernetes cluster, so the story was investigated and the cause was clarified.
  • The following two measures were implemented as countermeasures.
  1. Set -reserved-cpus option to kubelet
  2. Enabled static CPU management policy

An Introduction to the Cloud Native Landscape

Catherine Paganini, Kublr

  • It decomposes CNCF cloud Native Landscape Map and provides an overview of the entire landscape, layers, columns, and categories.
  • This is the first article in the series. Subsequent articles will expand each layer and column to explain in detail what each category is, the problem it solves and how.
  • Personally, I always looked at this landscape somehow, so I will take this opportunity to review it in a structured way.

A Kubernetes Ghost Story

Guinevere Saenger, GitHub

  • The speaker was speaking on a dark background, so I felt “it’s creepy” along with the title in this YouTube video.
  • How can we go under the CLA (Contributor License Agreement) check?

An Interview with CNCF GM & Heavybit Advisor, Priyanka Sharma

Mina Benothman, Heavybit Industries

  • An interview article with Priyanka Sharma who became a new GM of CNCF.
  • It covers her backgrounds that form her career, marketing, OSS, and people who have paved the way for her.

KUDO or how to simply create your Kubernetes operator with Denis Jannot (in French)

Electro Monkeys podcast

  • A podcast delivered in French, it is covered by this blog sometimes. This time, the guest is Denis Jannot ( Sales Engineer at D2iQ .
  • He explains various framework issues and why D2iQ chose to create KUDO.

How Policy Engines Make Day 2 Easier

Emily Omier, Nirmata

  • It demonstrates some specific ways to streamline Day 2 operations by automating configuration using an intelligent policy engine.
  • Kyverno is mentioned as a specific example .

Announcing Vitess 7

Deepthi Sigireddi, Vitess maintainer

  • Vitess 7 release article. The original article was by Deepthi Sigireddi, maintainer at
  • The four main themes are as follows.
  1. Improved SQL Support
  2. Stability
  3. Innovation
  4. Tutorials

Upcoming CNCF webinars

You can check some Recorded Webinars and Upcoming Webinars here. The following are posted as Upcoming CNCF webinars at that moment.

Project Webinar: How We Doubled System Read Throughput with Only 26 Lines of Code
TiKV team
July 31, 2020 10:00 AM Pacific Time

Member Webinar: Comparing eBPF and Istio/Envoy for Monitoring Microservice Interactions
Roko Kruze, Solutions Engineer @Flowmill
Mike Cohen, Co-Founder and COO @Flowmill
Aug 4, 2020 10:00 AM Pacific Time

Member Webinar: Debugging your debugging tools; What to do when your service mesh goes down in production?
Neeraj Poddar, Co-founder and Chief Architect @Aspen Mesh
Aug 5, 2020 7:00 AM Pacific Time

Member Webinar: Making Data Work for Developers with Kubernetes & Cassandra
Chris Splinter, Sr. Product Manager — Developer Solutions @DataStax
Patrick McFadin, VP of Developer Relations @DataStax
Aug 5, 2020 1:00 PM Pacific Time

Member Webinar: Maximizing M3 — Pushing performance boundaries at scale in a cloud-native distributed metrics engine
Ryan Allen, Senior Software Engineer @Chronosphere
Aug 6, 2020 10:00 AM Pacific Time

Ambassador Webinar: GitOps, DSL and App Model — Getting Started Building Developer Centric Kubernetes
Lei Zhang, Staff Engineer @Alibaba
Aug 7, 2020 10:00 AM Pacific Time

Member Webinar: Hardware for Kubernetes, Peeling Back the Layers
Erik Reidel, SVP Compute & Storage Solutions @ITRenew
Aug 11, 2020 10:00 AM Pacific Time

Member Webinar: The Open-Source Observability Playbook
Hen Peretz, Head of Solutions Engineering @Epsagon
Aug 12, 2020 7:00 AM Pacific Time

Member Webinar: Migrating Real-Time Communication Applications to Kubernetes at Scale: Learnings from 8×8’s Experience
Michael Laws, Sr. Site Reliability Engineer/DevOps at 8×8
Pankaj Gupta, Sr. Director at Citrix
Aug 12, 2020 1:00 PM Pacific Time

Member Webinar: MLOps automation with Git Based CI/CD for ML
Yaron Haviv, Co-Founder and CTO, Iguazio
Aug 26, 2020 1:00 PM Pacific Time

Project Webinar: Kubernetes 1.19
Kubernetes release team
Aug 28, 2020 10:00 AM Pacific Time

Member Webinar: Getting started with container runtime security using Falco
Loris Degioanni, CTO and Founder @Sysdig
Sept 2, 2020 1:00 PM Pacific Time

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara

An infra engineer in Tokyo, Japan. Grew up in Athens, Greece(1986–1992). #Network, #Kubernetes, #CKA, #CKAD, #Certified AWS SAP

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store