SRE / DevOps / Kubernetes Weekly Collection#44(Week 49)

13 min readDec 8, 2020

In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #518 November 29th, 2020
SRE Weekly Issue #246 November 29th, 2020
KubeWeekly #242 December 4th, 2020

DEVOPS WEEKLY ISSUE #518 November 29th, 2020

News

A look at a number of different security incidents involving AWS customers, inferring patterns and making some recommendations.

The title is “Learning from AWS (Customer) Security Incidents”.
You can learn from the security incidents of tremendous AWS user companies.
It is not a problem of the service provider, but a setting error on the user side or information taken out by an internal person.
There is a lot of learning because cases, trends, recommended settings, etc. are well summarized.

An up-to-date list of books to read on the various facets of devops. Most interesting is the choose-your-own-adventure tool to help with deciding what you should read next.

The title is “The DevOps Reading List: Choosing your next DevOps book”.
An article for choosing the next DevOps book to read. It has a Decision Tree.
I have already read some of them, but I would like to read all of the ones listed here.

Two good, detailed, writeups from the recent KubeCon virtual event. Key takeaways and reviews of a selection of the sessions on Kubernetes and related technologies.

Two detailed summary articles of KubeCon NA.
The first title is “Kube Con NA 2020 Key Takeaways: Platforms, Safety, and End Users”. The explanation is based on the following five Takeaways.

Platforms as product: Self-service is essential
Safety (and speed): Move fast and don’t break things
DevSecOps: Security is a day 0 concern
Multi-cluster: The K8s future is connected
End user focus: Learning and telling stories is important

The second title is “ KubeCon North America 2020 Wrapup “. This is introduced in chronological order. I’m talking about the author and other callers on Twitter, so I can feel the presence.

A writeup of the recent AWS US-East-1 outage. Reading incident reports is always interesting, but few are at quite this scale.

The title is “Summary of the Amazon Kinesis Event in the Northern Virginia (US-EAST-1) Region”.
AWS’s recent outage report in Northern Virginia.
I was looking at the outage from a distance, but I was convinced by the following word and understood the link of the services.
○ CloudWatch uses Kinesis Data Streams for the processing of metric and log data.
○ Amazon Cognito uses Kinesis Data Streams to collect and analyze API access patterns.

How do we know that databases really behave as advertised at scale? The Jepsen testing tool is helping here, and the new checker described in a recent paper helps identify even more issues.

The paper featured this time is “Elle: inferring isolation anomalies from experimental observations”.

A KubeCon session on disaster recovery. Discussion of outbreaks, being on-call, emergency response and eventual prevention, with the interesting twist being a comparison of site reliability engineering and the ongoing pandemic healthcare response.

The title is “KubeCon: Lessons in Disaster Recovery from COVID-19 and Site Reliability Engineering”.
It talks about themes in the medical and software axes.

Fire Drills: A Guide to Preparing for Your Next Incident shows you how to design and run a fire drill for your SRE team. In this Whitepaper from Elieser Pereira — Customer Reliability Engineer at Container Solutions, you will see the roles of the game master, the on-call engineer, a sample scenario and what questions to expect when all seems bleak during an outage. Download the free PDF.

It introduces a free guide to troubleshooting simulations using SRE by Container Solutions.

Jobs

I’m on the lookout for an experienced product leader for a Director of Product Management role at Snyk. The job would involve owning the Snyk Container product, so a strong focus on Kubernetes and container security and on helping developers build secure applications. I’m mainly looking for candidates in the UK or Israel in order to be close to the engineering teams. More details on the Snyk jobs site.

It introduces the job posting of the company to which the editor belongs. It is looking for a Product Management Director in Israel or the UK. I would like to stay and work in both countries for a while, but I am sorry that the situation and my skill set does not match. I want to increase and expand what I can move with my own skills, rather than moving somewhere.

Events

The ThoughtWorks Technology Radar has been around for 10 years and covers emerging techniques, tools, platforms, languages and frameworks. Next week, on the 7th of December, join 2 of the people involved in putting together this year’s edition to hear about what’s new.

It introduces Webinar above. I want to participate, but I will skip it this time due to late time in JST.

Tools

Lots of interest in user interfaces for Kubernetes at the moment, the latest being Headlamp. Interestingly you can run the headlamp dashboard locally or in-cluster. It also features a plugin system for extending its core functionality.

Easy-to-use and extensible OSS Kubernetes Web UI “Headlamp” web page.
Click here for GitHub page
I would like to try this for comparison IDE tools.

Cloudquery is a tool that queries the state of a public cloud infrastructure and saves a snapshot of that information in a SQL database for offline querying.

An OSS tool “cloudquery” that exposes cloud configurations and metadata as SQL tables and provides powerful analysis and monitoring without writing code.
Click here for GitHub page.

Setting up dashboards for every service can end up being a manual and repetitive process. Legend builds and publishes Grafana dashboards for your services with prefilled metrics and alerts, integrating with cloudwatch, prometheus, influxdb and more.

The GitHub page of “Legend”, a tool for creating and publishing Grafana dashboards for services and pre-filled metrics and alerts.

K6 is an interesting new load testing tool, written in Go but configured using a powerful Javascript DSL.

The GitHub page of the load testing tool “k6”, which is based on Load Impact’s many years of experience in the load and performance testing industry.

SRE Weekly Issue #246 November 29th, 2020

Articles

One Year of Load Balancing

DNS-based load balancing is a nice simple solution, but unfortunately it doesn’t work well in certain circumstances. Read to find out how Algolia evolved their load balancing system in response.

Paul Berthaux — Algolia

As mentioned above and in the title, an article summarizing the company’s efforts for load balancing, which has continued to evolve, starting without load balancer.

Your Percentiles are incorrect P99 of the times.

We use percentiles all the time, so it’s really important to actually understand what they say (and what they don’t).

Piyush Verma — Last9

Thanks to An anonymous reader for this one.

The theme is “Latency,” another SLI that is almost impossible to be accurate.
It talked about how SLO can mislead earlier, entitled “ SLOs that lie, “ where SLI was Uptime.

My journey to SRE into 2020 and beyond

The author started out as an embedded systems developer and moved into SRE. Here’s what they learned.

Eric Uriostigue — effx

It explains the author’s thoughts on the growth cycle as an SRE.

How to apologize for server outages and keep users happy

Some great tips here. It’s hard to sound sincere in a public incident report, especially if you post a lot of them.

Adam Fowler

It touches on the following three elements of apology defined by psychologists and explains them according to the theme.
○ Acknowledgement
○ Empathy
○ Resolution

Democratizing Fare Storage at scale using Event Sourcing

In this blog, we discuss how we built Fare Storage, Grab’s single source of truth fare data store, and how we overcame the challenges to make it more reliable and scalable to support our expanding features.

Sourabh Suman — Grab

The contents are as above. The figures are used effectively and are easy to read.

Simple streaming telemetry

This article covers Netflix’s gNMI-gateway, their open source tool for collecting metrics from network devices in a highly available and fault-tolerant manner.

Colin McIntosh and Michael Costello — Netflix

In the context of content distribution, it touches on the CDN “ Open Connect “ at the beginning and explains the background of the OSS project “ gNMI Gateway “, why it was created, and how readers monitor their networks.

A guide to the reliability talks at AWS re:Invent

This year, re:Invent is online only, so you still have a chance to attend if you’re interested.

Ana M Medina — Gremlin

As the title suggests, the guide for “AWS re: Invent” is given at the following points.
○ Talk
○ Why I’m Excited
○ Options to attend

A Byzantine failure in the real world

Cloudflare’s API service was impaired early this month. This is their incident report that describes a grey failure in a switch and downstream impact to etcd and their database system.

Tom Lianza and Chris Snook — Cloudflare

As explained in the title and above, it explains the failure that occurred at Cloudflare on 11/2 in chronological order from the viewpoint of “Redundancy at each layer is something we review and require. So — how could things go wrong?”.

Outages

Slack
Giphy
Spotify
Currys PC World
By Dash
Amazon Prime Video
AWS
This link points to Amazon’s detailed report on the outage.

KubeWeekly #242 December 4th, 2020

The Headlines

Editor’s pick of the highlights from the past week.

Don’t Panic: Kubernetes and Docker

Kubernetes is deprecating Docker as a container runtime after v1.20.

tl;dr Docker as an underlying runtime is being deprecated in favor of runtimes that use the Container Runtime Interface(CRI) created for Kubernetes. Docker-produced images will continue to work in your cluster with all runtimes, as they always have.

The above explanation is sufficient, but if you are new to this news, I recommend that you read the following article as a supplement. I think that you can read it as it is, but with dockershim as the keyword, “Docker(dockershim) used as a runtime” and “Docker used in a workflow” are different things.
○ Kubernetes Docker Deprecated Wait, Docker is deprecated in Kubernetes now? What do I do?
In response to this turmoil, Docker and Mirantis, which acquired Docker’s enterprise division, announced that they will maintain dockershim. The Mirantis Container Runtime (MCR), a runtime for Mirantis users, is CRI compliant.
Other than Kubernetes, vendors are already moving to replace runtimes as well.
For example, ECS Fargate has improved the container runtime by switching from Docker to containerd with PV 1.4.0. There are quite a few AWS users and ECS Fargate users, but I haven’t observed any noticeable commotion or impact. I was also plainly verifying the difference. It was a release with major changes other than line time, such as EFS, jumbo framework support, and replacement of ECS agent with Fargate agent.
○ Containerd is replacing Docker as the container runtime

ICYMI: CNCF Webinars

You can view all CNCF recorded and upcoming webinars here.

CNCF Member webinar: Next-gen data protection for next-gen applications

Deepak Verma, Director of Product Strategy @Zerto

It shares the key findings that shaped the cloud-native data protection architecture “data protection as code” and points to consider when designing.

CNCF Member webinar: Discover, analyze, and secure your APIs…anywhere

Pranav Dharwadkar, VP of Products @ Volterra.io Jakub Pavlik, Director of Engineering @ Volterra.io

It explains the real world security attack on the application/API and the four viewpoints in the title, and shares the following information.
○ Real-world examples of App/API security attacks that are not blocked by network-level or WAF security tools.
○ Four key tenets that users should look for in any App/API security solution
○ How users can utilize the four key tenets to solve their App/API security needs.

CNCF Member webinar: Take the pain out of multi-cluster Kubernetes with Lens

Edward Ionel, Developer Relations @Mirantis

The following points explain tips/tricks/best practice when using Lens in a multi-cluster environment, focusing on demos. At the beginning, the slide does not move smoothly and it is struggling, so it seems better to skip it.
○ Navigating cluster-sprawl and switching between KUBECONFIGs and clusters
○ Using the “”smart console”” to seamlessly move from cluster to cluster — even where different client versions are required
○ Quickly searching or browsing specific Helm charts for kubernetes-deployable services, with one click install functionality that deploys the Helm chart directly into the cluster

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

Kubernetes: Efficient multi-zone networking with topology-aware routing

Google Open Source Blog

It explains the issues to be solved by Topology Aware Routing of Services, how to use them, and future plans.

Geographically distributed stateful workloads Part Two: CockroachDB

Raffaele Brusholi, Red Hat OpenShift

Part 2 of the series article. It describes the deployment of CockroachDB on OpenShift across multiple regions.
Load tested to validate the deployment and ensure it is suitable for enterprise workloads.

Deployments without YAML using Ketch

Saiyam Pathak, Civo

It explains the concept of “Ketch” and deploys it on Civo’s managed k3s cluster without YAML using Ketch as the title suggests.

Kubernetes self remediation (AKA Poison Pill)

Nir Yehia, Red Hat OpenShift

As everyone wants their applications to run in a high available manner with Kubernetes, “Poison Pill k8s Node Remediation” is introduced as a recovery measure for stateful workloads that will be affected if the underlying infrastructure fails.

k0s — Yet another Kubernetes distro

Saiyam Pathak, Civo

It was an article I had already seen about “k0s”, so I thought it was covered in this blog, but it seems that it wasn’t there yet. As of 11/16, it briefly explains the architecture, installation method, and comparison with k3s.

Shining a light on the Kubernetes user experience with Headlamp

Kinvolk blog

Since it is covered in DEVOPS WEEKLY ISSUE # 518 above, I will skip this.

Scaling datastores at Slack with Vitess

Arka Ganguli, Guido Iaquinti, Maggie Zhou, and Rafael Chacón, Slack

It provides an overview of the design considerations and technical challenges associated with selecting and adopting Vitess, as well as current usage of Vitess.

Kubernetes distros with flavour of fleet

Saiyam Pathak, Civo Darren Shepherd, SUSE

For more than two hours, they have been talking with Darren from Rancher → SUSE (by M & A) as a guest.
The YouTube video has a timestamp for each theme, so maybe you can watch only the theme you are interested in.
Even a small excerpt of the content, such as kine/k3s debugging/k3c/k3v/ k0s.

YAML tips for Kubernetes

Salman Iqbal

The content of the title is explained in the YouTube video.

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

GitOps decisions

Ian Miell

It explains about the company which adopts GitOps as a small scrappy startup and talks about a company that grows into a regulated multinational company, and some of these challenges.

How Kubernetes accelerates multiplayer game development

Emily Omier, The New Stack

As the title suggests, it explains that the development period of multiplayer games was shortened with Kubernetes, there were some problems, and they were solved by using “ Agones “.

Kubernetes will deliver the app store experience for enterprise software, says Weaveworks CEO

Matt Asay, TechRepublic

Regarding Kubernetes, the following expressions were easy to understand, but I can’t imagine how simple it is. Will it be implemented with an educational UI that is quite abstract like an app and picks up the points, and will it be possible to display recommendation settings etc.?
○ “Kubernetes for the last few years has been a bit like buying a mobile phone in 1993 when it was too heavy to pick up”
○ “Pretty soon it will be like iPhones and wristwatches where everything is slick, easy, and simple because the industry is pushing it in that direction.”
Finally, the author confirms his position and the content of the text is his own opinion as follows.
○ Disclosure: I work for AWS, but the views expressed herein are mine. I am not part of the AWS team responsible for Amazon EKS Distro.

Understanding Governor operator

Agustin Romano, Caylent

It describes the “Kubernetes Operator” from the perspective of a tool designed to drive automation.

Kubernetes Operators: Automating complex application lifecycles

Kubermatic

Another article for “Kubernetes Operator”. The life cycle of the application is explained from the viewpoint of automation with the following points.
○ Why are Kubernetes Operators Important?
○ How Does a Kubernetes Operator Work?
○ Kubernetes Operator Example
○ Best Practices for Writing Kubernetes Operators
■ Use the Operator SDK
■ Keep Reconcile Functions Clean
■ Modify One Custom Resource at a Time

How to build a CI/CD process that deploys on Kubernetes and focuses on developer independence

Liron Cohen, Riskified

The title is supplemented with the following points and explained.
○ So how can you build a proper CI/CD process that focuses on developer independence?
○ How can you, as a DevOps Engineer, give the proper tools to the developers in your company, so they can build and operate their CI/CD processes on their own?

Upcoming CNCF webinars

You can check some Recorded Webinars and Upcoming Webinars here. The following are posted as Upcoming CNCF webinars at that moment.

Member Webinar: A look at how hackers exploit Prometheus, Grafana, Fluentd, Jaeger & more
Omer Levi Hevroni, Application Security Engineer @Synk
Dec 8, 2020 10:00 AM Pacific Time
REGISTER NOW »

Member Webinar: Preventing Kubernetes misconfiguration: Static analysis and beyond
Matt Johnson, Developer Advocate Lead @Bridgecrew
Jakub Pavlik, Director of Engineering @Volterra.io
Dec 9, 2020 7:00 AM Pacific Time
REGISTER NOW »

Dan Feldman, Principal Software Engineer @Hewlett Packard Enterprise
Umair Khan, Product Marketing Lead @Hewlett Packard Enterprise
Dec 9, 2020 1:00 PM Pacific Time
REGISTER NOW »

Member Webinar: Metal³: Kubernetes-native bare metal host management
Maël Kimmerlin, Senior Software Engineer @Ericsson Software Technology
Dec 10, 2020 10:00 AM Pacific Time
REGISTER NOW »

Member Webinar: Power to the people — making root/Docker a reality inside a Gitpod Container
Christian Weichel, Chief Architect @Gitpod
Alban Crequy, Director of Kinvolk Labs @Kinvolk
Dec 11, 2020 10:00 AM Pacific Time
REGISTER NOW »

Member Webinar: Implementing automated managed k8s service
Mason Choi, Senior Engineer @Samsung SDS
Kangsub Song, Senior Engineer @Samsung SDS
Dec 15, 2020 10:00 AM Korea Time
REGISTER NOW »

Member Webinar: Reducing your Kubernetes cloud spend
Webb Brown, CEO @Kubecost
Dec 15, 2020 10:00 AM Pacific Time
REGISTER NOW »

Member Webinar: Argo: Real Enterprise-scale with K8s
Al Kemner, Principal Software Engineer and Architect @New Relic
Daniel Jimbel, Staff Engineer @New Relic
Caleb Troughton, Product Manager, Telemetry Data Platform @NewRelic
Dec 16, 2020 7:00 AM Pacific Time
REGISTER NOW »

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara