SRE / DevOps / Kubernetes Weekly Collection#42(Week 47)

8 min readNov 27, 2020

In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #516 November 15th, 2020
SRE Weekly Issue #244 November 15th, 2020
KubeWeekly : ←No Updates

A series of videos on building a modern CI/CD pipeline for a typical Java application using ArgoCD and Tekton.

The title is “Video course on cloud-native CI/CD with Tekton & Argo CD”.
It explains how to use Tekton and ArgoCD to implement a suitable continuous delivery pipeline for modern enterprise Java projects. It embeds the nine YouTube videos the author explains on the web page.

A talk (video and slides) about measuring continuous delivery though lead time, deployment frequency, change failure rate and time to recovery.

The title is “Measuring DevOps”.
Google studies have developed and validated four metrics that provide a high-level system view of software delivery and performance and predict the ability of an organization to reach its goals. This video shows GCP and Tekton automation that automates the generation and collection of these four metrics.
○ Deployment Frequency
○ Lead Time to Change
○ Change Fail Rate
○ Time to Restore

A look at building on top of the new Pulumi automation API, and some thoughts about the emergence of platform teams.

The title is “Self-Service Platform Development Made Easy with Pulumi”.
Based on his experience consulting dozens of internal platform teams, he explains Puppet’s “The State of DevOps Report 2020”. He was surprised that all sections were about “Scaling DevOps practices with internal platforms.”.
The author introduces “cloud platform”, which was created as a prototype and the code is open sourced. It provides an easy way for platform teams to create self-service.

A post talking about a specific example of helping development teams address security problems (in this case leaking sensitive data in log files) and how to embed a culture of security in engineering organizations.

The title is “Fixing leaky logs: how to find a bug and ensure it never returns”.
It introduces a case where security is put into the hands of the developer. It explains how he and his colleague’s developer have successfully identified data breaches in the logs, fixed the problem, and implemented a method to prevent it from happening in the future.

A good post on the different challenges posted by edge environments around hardware discovery, manageability, provisioning and more.

The title is “Facing Challenges at the Edge”.
It details some of Edge’s platform and app manageability challenges, potential designs and solutions.

The latest Puppet State of Devops Report is available, with some interesting industry stats and analysis, in particular around platform teams and change management.

It guides you through two articles about DevOps. The first title of the above link is “2020 State of DevOps Report is here!”. Research reveals a strong link between DevOps evolution and the use of in-house platforms, and analyzer results show approval processes (orthodox and adaptive), automated testing and deployment, and advanced risk mitigation techniques.
The following four different approaches to change management based on are revealed.
○ Operationally mature
○ Engineering driven
○ Governance focused
○ To this.
The second title is “Two secret weapons DevOps can use to take over the entire enterprise”. This also touches on the article above and shares ideas about the in-house platform team.

Events

WTF is Platform as a Product? Companies are going full speed ahead into treating their platforms as products. But WTF does that mean? And WTF are the advantages? In this free 90-minute event on 19 November, you’ll get insight from Matthew Skelton, co-author of Team Topologies, and Jamie Dobson, CEO of Container Solutions, with a special appearance by Dave Farley! Register now.

This week it continues to feature Container Solutions events. This time, it has Matthew Skelton, co-author of “Team Topologies”, as a guest. As mentioned above, a 90-minute course was held. 11/19 (Thursday) 14:15 CET (Central European Time zone).

Books

SRE: The Cloud Native Approach to Operations explains how SRE, or Site Reliability Engineering, can help your organisation balance innovation with reliability. In this new e-book from Michael Mueller, a managing director at Container Solutions, you’ll learn what SRE is, and why you might need it; the differences between SRE and DevOps; best practices, and more. Get your free e-book.

It introduces the free e-book “SRE: The Cloud Native Approach to Operations” provided by Container Solutions.
Enter your full name and email address and you will be directed to the download page by email. The amount is 30 pages.

Tools

ctlptl aims to make it easier to grab an ephemeral, local, Kubernetes cluster for development purposes. Rather than competing with Docker Desktop, KIND, Minikube or similar tools it provides a higher-level user interface.

The GitHub page of the CLI tool “ctlptl” for declaratively setting up a local Kubernetes cluster.
Goals and Non-Goals are clearly written, and it is good because it mentions and introduces other tools too.

Athenz is a platform for X.509 certificate based service authentication and fine grained access control in dynamic infrastructures. It supports provisioning and configuration (centralized authorization) use cases as well as serving/runtime (decentralized authorization) use cases.

The GitHub page of the open source platform “Athenz” for X.509 certificate-based service authentication and fine-grained access control in dynamic infrastructure.
It supports provisioning and configuration (centralized authentication) use cases and serving / runtime (distributed authentication) use cases. The Athenz authentication system utilizes x.509 certificates and industry standard mutual TLS-bound oauth2 access tokens.
The name “Athenz” comes from “AuthNZ” (N for authentication, Z for authorization).

PowerfulSeal is a chaos testing tool for Kubernetes. Describe scenarios in YAML and PowerfulSeal can kill running resources and check services are still running, and export results to Prometheus and other monitoring tools.

I will skip it because it was covered in KubeWeekly last week.

K0s is a new small Kubernetes distribution intended for anything from local development usage to large-scale edge deployments.

It introduces the new Kubernetes distribution “K0s” which is developed by the Lens team.
Click here for the io page.
The answer to “Why is it zero?” Is below.
○ The “zero” in k0s really captures our aspiration to not compromise as we build the ultimate Kubernetes distribution:
● Zero Friction
● Zero Dependencies
● Zero Overhead
● Zero Cost
● Zero Downtime

SRE Weekly Issue #244 November 15th, 2020

Articles

Type in the exact number of machines to proceed

If you’re gonna operate on a pile of computers all at once that numbers 6+ figures, making you type that number in is a way to make you pause and think about what you’re doing.

Rachel by the bay

When operating with the CLI on many machines and generating a confirmation prompt as a sanity check, instead of asking for the Y/N type and inputting it, it proposes a method to read and input the number as follows.
○ Blah blah blah 123,456 machines will be affected. Proceed?
Enter number of machines to confirm: 123456
OK! Continuing.
Certainly, even if you are asked your intention, you only press Yes, so if you are displayed the target and confirm it and enter the number, you will not make a mistake in the range of influence, and you will not unknowingly involve a large number of machines.

IT metrics: Why the five 9s must go

Find out why they decided to focus less on nines, and what they did instead.

Robert Sullivan

It points out the problems/issues of shooting the percentage of uptime for “9s”, and introduces the following approaches evolved by the company’s SRE, L1 / L2 support, and operation team.

Count all the minutes that affect business performance. These are “impacted” minutes.
Not all incidents are the same, so it’s important to agree on definitions. Here’s what we decided:
Count all impacted minutes (global, partial, and degraded) against the total number of minutes in a month.
Meet with business leadership regularly (we do this weekly) to discuss the numbers and the impact of service interruptions on the business.
Track instances in which your monitoring leads to action that avoids impacted time. (We refer to these as “mitigated events.”)
Count the minutes that high-availability services were not fully redundant.

Rule 1: It’s ALWAYS DNS

Reminds me of the classic:

It’s not DNS There’s no way it’s DNS It was DNS

— (ssbroski on reddit)

Mike S.

It’s a 2017 article, so it’s quite a while ago. But still DNS is difficult, and I laughed when I saw the link to “network solutions haiku”.

Moving OkCupid from REST to GraphQL

Their front-end made duplicate calls to the new API to test load and response time prior to cutting over.

Michael P. Geraci — OkCupid

A case study from OkCupid that made the transition from REST API to GraphQL API1 on a site with millions of users without compromising performance.
I also share the following four processes and “what I should have done” that I learned through the release.

Pick an appropriate page to convert
Build the schema
Add a shadow request to call the new API while still fetching data via the REST API
Do an A/B test with real users that changes the data source

New Arctic Air Crash Aftermath Role-Play Simulation Orchestrating a Fundamental Surprise

This is really cool. The researchers created a role-play scenario based on a real plane crash. They tried to get participants to blame “human error”, so that they could then surprise them with all of the (many) contributing factors that were involved.

Emily S. Patterson, Richard I. Cook, David D. Woods, Marta L. Render

A “fundamental surprise” generated by deliberate discrepancies between lessons learned during role-playing and potential lessons, allows oversimplified assumptions of how complex systems are unchallenged.

From Sysadmin to SRE

Tips from one Sysadmin’s journey to becoming an SRE.

Josh Duffney — Octopus Deploy

In line with the title, I share my thoughts on the following items for those who are new to the industry.
○ It isn’t about tools, but…
○ Learn to code from the command-line
○ Start at the source
○ Pull requests mean deployments

Outages

YouTube
Macs
Mac users had issues launching applications, owing to an outage of ocsp.apple.com. Apple confirmed the issue.
PrometheusKube
The link points to their awesome writeup of what went wrong and the on-the-fly reworking they had to do to fix it.
Instagram
Hotmail
Various stock trading platforms
There’s some speculation that this was a result of increased trading volume following Pfizer’s announcement about vaccine trial results.
Robinhood
Increased Error Rates

KubeWeekly : ←No Updates

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.
Bye now!!

Yoshiki Fujiwara