SRE / DevOps / Kubernetes Weekly Collection#45(Week 50)

14 min readDec 15, 2020

In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #519 December 6th, 2020
SRE Weekly Issue #247 December 6th, 2020
KubeWeekly #243 December 11th, 2020

A post on moving from systems administration to information security, with good observations of the advantage of being new to a topic and how a variety of skills help when moving to different roles.

The title is “A SYSADMIN MOVES TO INFOSEC”.
The author explained how he joined the infosec team after working as a system administrator and software developer for a while, and recommended information security to those who have similar experience.
There are four throws below that ask yourself if the roll changes like the author are right for you.

DO YOU LIKE INVESTIGATING AND TROUBLESHOOTING?
DO YOU LIKE LEARNING?
MAYBE YOU LIKE MAKING CONTENT AND GIVING PRESENTATIONS AND TRAINING?
DO YOU LIKE TECHNICAL WRITING?

Hindsight is an interesting design tool for future systems. This post, from someone ver familiar with its working, looks at what applying hindsight and some opinions to Kubernetes might mean if you build a new container orchestrator.

The title is “A better Kubernetes, from the ground up”.
The author has chatted with Apple Senior Site Reliability Engineer Valley Lancey and found that conversation so stimulating that he felt the need to write things down.
A masterpiece that explains along with the following items.
○ Guiding principles
○ Mutable pods
○ Version control all the things
○ Replace Deployment with PinnedDeployment
○ Explicit orchestration workflows
○ Explicit field ownership
○ IPv6 only, mostly
○ … Or just don’t
○ Security is yes
○ gVisor? Firecracker?
○ Very distributed clusters
○ VMs as primitives
○ How to storage?
○ The end

An interesting post on scaling CI/CD pipelines across development teams, with a focus on self-service.

The title is “How to Build a CI / CD Process That Deploys on Kubernetes and Focuses on Developer Independence”.
I covered it in KubeWeekly last week, so I will skip it.

A breakdown of the recent large AWS outage, based on the public information but with a useful diagram to understand the apparent architecture and a handy list of the proposed plans to avoid similar issues in the future.

The title is “Kinesis Outage”.
The author read a summary of the outages that originated in Kinesis on AWS and reviewed his own summary page of Donella Meadows’ “Systems Thinking in Practice” and explained the outage by linking related points.
For the Dependency of Kinesis and other services, refer to the schematic diagram below.

Software is always rooted in when it was first written. This post touches on GitHub’s architecture and technology choices and also how, and why, that’s evolving.

The title is “GitHub’s journey towards microservices and more: ‘We actually have our own version of Ruby that we maintain”.
An article interviewing Sha Ma of GitHub’s VP of Software Engineering at the QCon Plus virtual developer event.
The following questions are the parts that I was personally interested in as a move since the M&A by Microsoft in 2018.
○ GitHub was largely hosted on its own data centres. Is that moving to Azure?
● “We’re exploring things potentially to move,” Ma said. “We actually still have things hosted on AWS. For example, a lot of our data analytics is on AWS and we’ve started a project to look at migration into Azure, especially since we get internal pricing which is more favourable for us. But a large part, I would say 80–90 per cent of our stuff is hosted in data centres that we physically maintain.”
The following conclusions are excerpted, thinking that they are the core of the discussion of microservices and monolithic architecture.
There are strong experts in the company about Ruby and MySQL in the title, so they will continue to use it. Performance-critical code is written outside the monolith in Golang.
“Microservices is not your solution to technical debt and bad architecture,” Ma told us. “I think there’s been a trend of people who went down the microservices path and are now going back into monolith because microservices became too unwieldy for them. Microservices doesn’t replace good architecture. Going through things like, what should be grouped together? How should we look for things that cross domain boundaries? How should we set up teams and on-call? pushed us towards better architectural practices that benefit us both in the monolithic and microservice world. A lot of the preparatory work we’re doing, we’re actually doing in the monolith before extracting it.”

Details of scaling datastores, from active/active MySQL clusters to using Vitess.

The title is “Scaling data stores at Slack with Vitess”.
I covered it in KubeWeekly last week, so I will skip it.

The Kubernetes API is designed to be extended with new resources. This post looks at a more flexible Deployment resource which supports more fine grained control around rollout and running multiple versions of a service at the same time.

The title is “Introducing Kubernetes Pinned Deployments”.
The author, who is working on the project “PinnedDeployment Kubernetes CRD “ which is of the v1alpha1 API group, introduces it as the title suggests. It seems that feedback is being solicited for interesting features, so if you would like to comment, please return it to the author. It is said that “Once some of the rough edges of the implementation are sorted, it will be upgraded to a beta API”.

Tools

Opstrace is a new horizontally-scalable metrics and logs platform, optimised for installation on cloud platforms. It exposes a prometheus-compatible API, as well as working with a variety of agents like those from Fluentd and Datadog.

As mentioned above, the GitHub page of the horizontally scalable metric and logging platform “opstrace”.
It combines open APIs with the simple user experience of large service providers to deploy secure, horizontally scalable open source observability to your own cloud account.

Replicate is a new tool that aims to solve version control problems for machine learning. It’s a python library that allows for snapshots to be saved in S3 or Google Cloud Storage and tools for retrieving and reusing those versions.

The Web page of “Replicate”, a tool for version control for machine learning.
Click here for the GitHub page.

Nydus is a set of tools that aims to improve over the current OCI image specification in terms of container launching speed, image space and network bandwidth efficiency. The tutorial is a nice way of understanding how it works.

The GitHub page of the project “Nydus” that implements a userspace file system on top of a container image format that improves the current OCI image specification.

SRE Weekly Issue #247 December 6th, 2020

Articles

2020 09 25 Incident: Infrastructure connectivity issue impacting multiple systems

This incident report from a September Datadog outage has an interesting tidbit about scaling external incident response in tandem with internal.

Alexis Lê-Quôc — Datadog

An article dated 10/06/2020. It reports Datadog’s incident occurred in US region between September 24, 2020, 14:27 UTC and September 25 00:40.

Google Cloud Issue Summary — Google Drive — 2020–11–16

This is Google’s write-up for an interesting issue that involved repeated re-sending of invitations to edit a Google Drive document.

Google

The first notification email sent when sharing Drive resources with “users or groups whose profile email address contains uppercase letters” using Google Drive’s shared web UI, was duplicated repeatedly as noted in the Editor’s comments above too.

What I Wish I Knew About Incident Management

I basically want to immediately absorb any article with this title, unless it’s just clickbait spam. This one definitely isn’t.

Ronak nathani

It shares the incident management practices he has learned over the years as LinkedIn SREs.
Not limited to oncall, which is the theme, it would be a reference for alert responding.
In the article, there is a guide saying “In this post, I am not going to talk about how to debug linux or distributed systems or the various debugging tools. (For stories from the frontlines, check out Software Misadventures Podcast!)”. When I checked, I found Kelsey Hightower as a guest in # 1.
○ Click here for the YouTube video of Software Misadventures Podcast # 1.

Scaling Datastores at Slack with Vitess

Lots of juicy details in this one about the difficulty Slack has had in scaling their DB layer and how Vitess solved their problems.

Arka Ganguli, Guido Iaquinti, Maggie Zhou, and Rafael Chacón — Slack

I covered it in KubeWeekly last week, so I will skip it.

Mitigate Connection Leaks in Production via Proxies

Hitting file descriptor limits is such an annoying kind of outage. Some good tips here, clearly coming from hard-won experience.

Utsav shah

It describes several approaches that can be used together to mitigate Connection Leaks and the trade-offs between them.
○ Singletons/Dependency Injection
○ Client Count Metrics
○ File Descriptor Count Alerts
○ Sidecar Processes

Improving the Resiliency of Our Infrastructure DNS Zone

They used two providers synced with OctoDNS.

Ryan Timken and Kiran Naidoo — Cloudflare

It describes how to increase the reliability of your infrastructure’s DNS zones by leveraging their own DNS products running at the edge and third-party DNS providers using multiple primary name servers.
OctoDNS is used to manage zones of multiple providers independently and simultaneously.

Root Cause Analysis For Reliability: A Case Study

This is all about understanding the whole system (people and technology) and building learning, rather than finding a superficial “root cause”.

Piyush Verma — Last9

From “Why RCA(Root Cause Analysis) is necessary for reliability?”, The content of the title is explained based on the author’s experience.

Outages

Solana
Poloniex
New Zealand Reserve Bank
OneDayOnly
Local e-commerce site OneDayOnly is running Black Friday discount deals again today, after the shopping site was down for a few hours last Friday.
Infura
MobileCause
This outage occurred on Giving Tuesday, a very important day for nonprofits to raise funds.

KubeWeekly #243 December 11th, 2020

The Headlines

Editor’s pick of the highlights from the past week.

Kubernetes 1.20: The Raddest Release

Kubernetes 1.20 Release Team

The final Kubernetes release of 2020 is “the raddest release”, bringing 42 enhancements to the project as well as bug fixes and performance improvements. Check out the release notes or listen to the Kubernetes Podcast interview with the release team lead Jeremy Rickard.

An article with title from Kubernetes Blog on Kubernetes.io. And the following summary of Kubernetes 1.20, “Major Themes / Major Changes / Other Updates / Release notes / Availability of release / Release Team / Release Logo / User Highlights / Project Velocity / Ecosystem Updates / Event Updates / Upcoming release webinar / Get Involved “ item of Explanations and related information are linked for each.
○ 42 enhancements: 11 enhancements have graduated to stable, 15 enhancements are moving to beta, and 16 enhancements are entering alpha.
The Kubernetes Podcast “#131 Kubernetes 1.20, with Jeremy Rickard” by a Google employee is also linked.
The topics I was interested in in the News of the week are as follows. AWS re: Invent related items are omitted here, but personally I check everything.
○ CNCF launchese Cloud Native Security Whitepaper
○ Kuma 1.0
○ Anthos on bare metal is now GA
○ The photo of the cat below is the “Release Logo” from Kubernetes 1.20.

ICYMI: CNCF Webinars

You can view all CNCF recorded and upcoming webinars here.

CNCF Member webinar: Fundamentals of OpenTelemetry

Ted Young, Director of Developer Education @Lightstep

It tries to explain everything you need to know to get started with OpenTelemetry.

CNCF Member webinar: A look at how Hackers exploit Prometheus, Grafana, Fluentd, Jaeger & more (hacking monitoring for fun and profit)

Omer Levi Hevroni, Application Security Engineer @Synk

They perform threat modeling of the tools in the title(Prometheus, Grafana, Fluentd, Jaeger & more) and check what risks they pose.
You gain a better understanding of some ideas on how to better protect your surveillance infrastructure and how to perform threat modeling on your own system.

CNCF Member webinar: Preventing Kubernetes misconfiguration: static analysis and beyond

Matt Johnson, Developer Advocate Lead @Bridgecrew

The following points describe best practices for large-scale infrastructure creation, testing, and maintenance using policy-as-code on both CI/CD and Kubernetes clusters at runtime.
○ Compare methods for securing infrastructure using open-source tools including Checkov
○ Review sample use cases that showcase the benefits of different approaches
○ Cover the current state of open source repositories and Kubernetes manifests found in the wild

CNCF Member webinar: SPIFFE and SPIRE in practice

Dan Feldman, Principal Software Engineer @Hewlett Packard Enterprise Umair Khan, Product Marketing Lead @Hewlett Packard Enterprise

It introduces the key use cases of two projects (SPIFFE and SPIRE) and how they are used to build secure systems in some of the world’s largest and most security-conscious organizations.
You can learn if SPIFFE and SPIRE are right for your needs and how to use them to improve your organization’s security posture.

CNCF Member webinar: Metal³: Kubernetes-native bare metal host management

Maël Kimmerlin, Senior Software Engineer @Ericsson Software Technology Feruzjon Muyassarov, Experienced Developer @Ericsson Software Technology Pep Turro Mauri, Senior Software Engineer @Red Hat

It introduces the Metal³ (called “metal kubed”) project and its motives, and outlines what has been achieved so far.

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

Using Snyk and Podman to scan container images from development to deployment

Matt Jarvis, Red Hat

It focuses on the container scanning capabilities available via the “Snyk CLI” and how to integrate it with the new Podman APIs available in Podman and Podman 2.x and available in Red Hat Enterprise Linux 8.3.
The Snyk and Podman APIs provide container image scans directly on the local command line to help developers and administrators scan images from the beginning of the image development process to check for vulnerabilities.

Kubernetes: Efficient multi-zone networking with topology aware routing

Bob Killen, Google Cloud

I covered it in this section last week, so I will skip it.

OpenShift/Kubernetes failure stories at scale — Lessons learned from large and dense deployments

Naga Ravi Chaitanya Elluri, Red Hat

As a member of Red Hat’s Performance and Scalability team, what happened while building tools, workloads, and automation to simulate a real-world production environment and monitor cluster health? The following three scenarios are explained, such as how to debug and how to prevent such a situation.
○ Scenario 1: Rogue DaemonSet Took Down the 2000 Node Cluster
○ Scenario 2: Too Many Objects in Etcd Lead to Writes Being Blocked
○ Scenario 3: Hosting Etcd on Slower Disks Created Havoc on the Cluster

Automating volume expansion management — an operator-based approach

Raffaele Brusholi, Red Hat

A blog post in the “Storage, How-tos, Operators, Prometheus, automation” category of Red Hat. A previous blog post described some best practices for metrics used when monitoring applications.
This time, they used the same metrics to construct dashboards and predict when something might break. They also discussed how, in certain cases, is it possible to preemptively intervene and prevent the failure from occurring (preventive maintenance).

OPA the easy way feat. Styra DAS!

Amey Deshmukh, InfraCloud Technologies

It describes hands-on with a focus on configuring OPA as an admission controller for Kubernetes clusters using StyraDAS.

How to use a policy engine to improve your security posture

Nirmata

It touches on the fact that most of the recent security breaches are due to “misconfiguration” and explains the necessity and the role of the policy engine.

Service discovery in Kubernetes — combining the best of two worlds

Ivan Velichko, Booking.com

As the title suggests, it describes Kubernetes service discovery.
Below, Disclaimer describes the items not mentioned in this article and the reasons.
○ Disclaimer: This article intentionally omits the questions of external service (Service type ExternalName) discovering and discovering of the Kubernetes services from the outside world (Ingress Controller). These two deserve a dedicated article each.

Kernel privilege escalation: how Kubernetes container isolation impacts privilege escalation attacks

Kamil Potrec, Snyk

It explains how the isolation of Kubernetes containers impacts privilege escalation attacks as titled.

GSoD 2020: Improving the API reference experience

Philippe Martin

As part of the “ GSoD (Google Season of Docs) 2020” project, we announced the results of improving the documentation of the Kubernetes API Reference.

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

Why Linkerd doesn’t use Envoy

William Morgan, Linkerd

While paying homage to Envoy, he explains the title “Why Linkerd doesn’t use Envoy” as follows.
○ Let me also state upfront that this is not an “Envoy sucks’’ blog post. Envoy is a great project, is clearly a popular choice for many, and we have nothing but respect for the fine folks who work on it. We recommend Envoy to Linkerd users every day in the form of ingress controllers like Ambassador, and there are production systems around the world today where you can find Envoy and Linkerd working side by side.
○ So in this article I’m going to do my best to lay out the reasons why in a frank and engineering-focused way. After all, Linkerd is built by engineers and for engineers, and if there’s one thing I’m proud of, it’s that we’ve made decisions on the basis of engineering tradeoffs rather than marketing pressure.
○ In short: Linkerd doesn’t use Envoy because using Envoy wouldn’t allow us to build the lightest, simplest, and most secure Kubernetes service mesh in the world.
The FAQ is also carefully written and I think it is a very conscientious article.

2021 Predictions: The year that cloud native transforms the IT core

Bill Mann, Steering

Looking at the articles with these titles and contents, you can feel the year-end and New Year holidays. Styra’s 2021 forecast Top 5 is as follows. I can’t help but worry about the numbers in the various data.

Kubernetes in production will continue to skyrocket, creating new challenges for security and compliance
We will see significant open source consolidation
Service mesh will become critical as enterprises scale microservices
Security and DevSecOps will see expanded responsibilities as new attack
There will be a complete transformation of the IT core

Upcoming CNCF webinars

You can check some Recorded Webinars and Upcoming Webinars here. The following are posted as Upcoming CNCF webinars at that moment.

Member Webinar: Power to the people — making root/Docker a reality inside a Gitpod Container
Christian Weichel, Chief Architect @Gitpod
Alban Crequy, Director of Kinvolk Labs @Kinvolk
Dec 11, 2020 10:00 AM Pacific Time
REGISTER NOW »

Member Webinar: Implementing automated managed k8s service
Mason Choi, Senior Engineer @Samsung SDS
Kangsub Song, Senior Engineer @Samsung SDS
Dec 15, 2020 10:00 AM Korea Time
REGISTER NOW »

Member Webinar: Reducing your Kubernetes cloud spend
Webb Brown, CEO @Kubecost
Dec 15, 2020 10:00 AM Pacific Time
REGISTER NOW »

Member Webinar: Argo: Real Enterprise-scale with K8s
Al Kemner, Principal Software Engineer and Architect @New Relic
Daniel Jimbel, Staff Engineer @New Relic
Caleb Troughton, Product Manager, Telemetry Data Platform @NewRelic
Dec 16, 2020 7:00 AM Pacific Time
REGISTER NOW »

Member Webinar: Machine learning for K8s logs and metrics
Larry Lancaster, Founder and CTO @Zebrium
Dec 16, 2020 1:00 PM Pacific Time
REGISTER NOW »

Member Webinar: Kubernetes configuration — Auditing for enterprise best practices through open source tooling
Kendall Miller, President @Fairwinds
John Wynkoop, Cloud CTO @IGNW
Dec 18, 2020 10:00 AM Pacific Time
REGISTER NOW »

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara