SRE / DevOps / Kubernetes Weekly Collection#25(Week 30)

16 min readAug 11, 2020

In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #499 July 19th, 2020
SRE Weekly Issue #228 July 19th, 2020
KubeWeekly #226 July 24th, 2020

DEVOPS WEEKLY ISSUE #499 July 19th, 2020

News

The fundamental importance of learning from incidents in building resilient systems is often hard to fully understand when fighting to fix issues. This presentation neatly summarises a bunch of recent research.

The title is “Presentation: “Findings From The Field””.
A slide the author delivered a few weeks ago at a virtual event “DevOps Enterprise Summit London”. The video is waiting for release.

There’s a line-up of phrases I’ve seen, so its’ from John Allspaw, who I’ve covered in this blog before. This is a summary of research results that he has been closely related to incidents on site for two years.

A good post on migrating to the cloud via a lift and shift strategy. When is this the best choice, and what do you need to do to make it work in the long term.

The title is “BRAZEAL: THE LIFT-AND-SHIFT SHOT CLOCK”.
It explains the following three points when migrating to the cloud with “Lift-and-shift”, the transition of the balance between value and danger, and the true cloud transformation should not be put off.

You need SRE
You need governance
You need a culture of comprehensive learning

A presentation on the impact of uncertainty on operating systems. Elasticity, scalability, devops practices, playbooks; lots of good operations topics covered.

The title is “OPERATING UNDER UNCERTAINTY”.
A page that publishes videos and slides from the titled webinar.
The Presenter, Certified Partner of AWS/Kubernetes, explains the best practices and architecture of AWS. The topics are very organized and I recommend it.

What is SRE? This post discusses the evolution of SRE, the different component parts, popular team structures and how to get started.

The title is “What is SRE?”
An article summarizing the elements of SRE and how to get started. You can also check useful sources of information.

A tutorial on using Conftest and Open Policy Agent to test Dockerfiles for common security problems.

The title is “Dockerfile Security Checks using OPA Rego Policies with Conftest”.
A scenario that you can learn and practice through hands-on with katacoda is also introduced.

A set of patterns for designing continuous delivery pipelines. Configuration in code, separating build and release, the importance of audit trails and more useful tips.

The title is “7 Pipeline Design Patterns for Continuous Delivery”.
An article that describes the title “7 Design Patterns for CD (Continuous Delivery)” to help an organization make a huge leap in speed and stability and to execute a team at an elite level.

The question of whether you should directly call one Lambda function from another comes up regularly in Serverless architecture conversations. This post has some tips why this isn’t always a good idea and when to avoid.

The title is “Are Lambda-to-Lambda calls really so bad?”
The author examined the subject of this title, and concluded “It depends!”. It provides a decision tree to help readers’ decisions.

A discussion of the benefits of structured logging, with a good Python and Elasticsearch example.

The title is “Logging — let’s do it right!”
The author who changed jobs to a startup noticed that his own logging method in the past was wrong, and explains the past and present methods by giving concrete examples.

Tools

Terraform CDK allows for writing Terraform code using Python or Typescript, rather than HCL.

The GitHub page of the CDK (Cloud Development Kit) for Terraform, an OSS tool that allows developers to define cloud resources in a programming language using Terraform.

SRE Weekly Issue #228 July 19th, 2020

Articles

Change Advisory Boards Don’t Work

They don’t. They just don’t.

[…] one slow block device can affect the performance of processes even when those processes don’t use the slow block device.

Alex Yates — Octopus Deploy

The author explains his idea with the title, “Change Advisory Boards Don’t Work”. CAB(Change Advisory Board/Change Approval Board ) reviews the changes to the production environment.
It's a painful sentence, but could be true.
○ It’s a noble goal but, unfortunately, CABs mostly do more harm than good.
Finally, he gives advice on "What to do if your organization is thinking about forming a CAB" and "What to do if you already have a CAB".

Google Cloud Issue Summary: Gmail 2020–06–30

Whoops, forgot to include this one last week.

On June 30, Google’s email delivery service was targeted in what we believe was an attempt to bypass spam classification. The result was delayed message processing and increased message queuing.

A follow-up article on Google Cloud’s Gmail issue.
Delays in sending and receiving emails, and an increase in the number of emails judged as spam occurred.

Postmortems and More With J. Paul Reed

My favorite part is the focus on blame awareness:

But it’s not enough to just be blameless — it’s also important to be blame-aware. Being blame-aware means that we are aware of our biases and how they may impact our ability to view an incident impartially.

Isabella Pontecorvo — PagerDuty

Read J. Paul Reed (Senior Applied Resilience Engineer at Netflix)’s talk about postmortems, best practices, and what steps you can take to succeed.
It’s a good idea to vote for the top five follow-up actions after postmortem and look back six weeks later. A sense of accomplishment can be obtained and leakage can be prevented. It is essential to focus on “how an incident was triggered” instead of “who caused it” in the previous stage in order to list the elements of outages.

Keeping Customers Streaming — The Centralized Site Reliability Practice at Netflix

Netflix has a team dedicated to the overall reliability of their service.

Practically speaking, this includes activities such as systemic risk identification, handling the lifecycle of an incident, and reliability consulting.

Hank Jacobs– Netflix

Tech blog explaining the company’s centralized SRE best practices by Hank Jacobs (Senior Site Reliability Engineer) of the CORE (The Critical Operations and Reliability Engineering) team responsible for reliability of the entire Netflix service.
The “Service Ownership Model” and the composition and roles of the CORE team are very interesting.

As the highlight of this article of me, it says that “Incident management at Netflix doesn’t follow common management practices like the ITIL model.” and describes their model.

What is SRE?

Another good reference if you’re looking to bootstrap SRE at your organization.

Rich Burroughs — FireHydrant

I will skip it because I mentioned it in DEVOPS WEEKLY ISSUE #499 above.

The Tail at Scale Approximation

Bill Duncan’s back with an easy and very close approximation for the “Tail at Scale” formula. The question it answers is: how many nines do you need on all of your backend microservices for X nines on the frontend?

Bill Duncan

Here’s a quick and easy approximation to the probabilistic formulas for measuring customer experience described in my two articles that I’ve previously covered in this blog.
○ The Tail at Scale
○ The Tail at Scale Revisited

The Essential List of Top SRE Resources

Tons of great links in here with enticing descriptions to make you want to read them. Includes books, tools, hiring, certification, and general SRE goodness.

Emily Arnot — Blameless

In this blog post, it lists SRE’s important resources and explains for those who want to “look to get up to speed on SRE fundamentals with the best SRE books and best DevOps books” or hope to “expand its SRE knowledge into new domains”.
I appreciate this comprehensive post including tools and adoption as topics.

Advocating for a Product Mindset within Platform Teams and How We Do it at HelloTech (Part 1)

SRE is all about keeping the user experience working, and working with product-focused folks can really help. For more on this, check out my former coworker Jen Wohlner’s awesome SRECon19 talk on SRE & product management.

Samantha Coffman — HelloFresh

A two-part article that describes HelloTech’s efforts on how to advocate a product mindset within their platform team.
As background, it said first that “Having mastered Product Management in their end products and realised the benefit of improving the customer experience, more companies are shifting their attention to how they can apply the same techniques to the rest of their organisation.”.
In this Part 1, the following three points are included.

What is Wrong with Traditional Platform Teams Without Product Representation
The Benefits of Adopting a Product First Mindset for Platform at HelloTech
Best Practices from HelloTech for Adopting a Product Platform Mentality

Outages

Cloudflare
○ Cloudflare had a 50% drop in traffic served by their network subsequent to a BGP issue. Linked is their analysis including snippets of router configurations. Lots of services suffered contemporaneous outages possibly stemming from Cloudflare’s, including Discord, Postmates, Hosted Graphite, and DownDetector.John Graham-Cumming — Cloudflare
Twitter
○ Twitter had a major security breach, and as part of their response, they temporarily cut off large parts of their service. Click for their post about what happened.
GitHub
WhatsApp
Hulu
Snapchat
Microsoft Outlook
○ Notably, the outage involved the Outlook application that people run on their computer, not the cloud version.
Fastly
○ Also a control plane incident later that day.Full disclosure: Fastly is my employer.

KubeWeekly #226 July 24th

The Headlines

Editor’s pick of the highlights from the past week.

Advanced Cloud Engineer Bootcamp makes it simple for IT pros to learn cloud

CNCF staff

Following the successful launch of the Cloud Engineer Bootcamp last month, The Linux Foundation and CNCF heard from many sysadmins, developers, engineers, and others who wanted a similarly structured program to help them learn the skills necessary to move into cloud engineering but with more advanced training. These individuals did not need the beginner training courses included Cloud Engineer Bootcamp.

That led us to launch a new Advanced Cloud Engineer Bootcamp. The program includes six training courses and registration for the Certified Kubernetes Administrator (CKA) exam, along with dedicated online support forums and weekday live video chat with instructors. Designed for working professionals, the program can be completed in as little as six months with around 10 hours per week of study time.

Learn more about the program and sign up today!

Following the Cloud Engineer Bootcamp in June, CNCF provided training courses for the Advanced Cloud Engineer Bootcamp. The regular price of $999 was being saved for $599 until July 31st(at that moment).
KubeCon + CloudNativeCon EU Virtual Session Spotlight

The countdown to KubeCon + CloudNativeCon EU Virtual on August 17–20, 2020 is on! As we approach the event, we curated a few recommended sessions that we don’t want you to miss. Please see the feature for this week and be sure to register today!

Tutorial: KubeEdge Hands on Workshop — Build Your Edge AI App on Real Edge Devices
Presented by: Zefeng Wang, Huawei & Zhang Jie, China Unicom

This workshop is intended to invite participants to get hands on experience building a real edge computing solution with KubeEdge, end-to-end.

Starting from deploying and provisioning an edge node(e.g. Raspberry Pi), followed with device modeling and connectivity setup, then building a video stream machine learning based solution.

Through this exercise, participants will get first hand experience to understand the orchestration engine build on top of Kubernetes, understand the edge computing node setup mechanism, learn the device modeling concept for IoT Edge scenarios. And develop a state-of-art AI based video stream processing flow, all in a 30 minutes session.

KubeCon + CloudNativeCon EU Virtual highlights the “Tutorial: KubeEdge Hands on Workshop — Build Your Edge AI App on Real Edge Devices” session. Schedule: 8/17 (Monday) 16:55–18:15 CEST (Central European Summer Time).
Sessions for the following three purposes.

Understanding the orchestration engine built on top of Kurbnetes
Understanding the edge computing node setup mechanism
Learning the device modeling concepts for IoT Edge scenarios

ICYMI: CNCF Webinars

You can view all CNCF recorded and upcoming webinars here.

CNCF Project Webinar: Fluent Bit v1.5

Eduardo Silva, Principal Engineer @Treasure Data, Masoud Koleini, Staff Research Software Engineer @Arm & Wesley Pettit, Software Developer Engineer @AWS

Introducing Kubernetes logging and performance improvements and new features included in the new major v1.5 release of Fluent Bit.
It dives into the new features on this major release that includes performance improvements and new connectors for Google Stackdriver, Amazon Cloudwatch, LogDNA, New Relic and PostgreSQL.

CNCF Member Webinar: Learn how to clean up your cloud-native “DevOps Dumping Ground”

Melissa Sussmann, Product Marketing Lead @Puppet & Kenaz Kwa Principal Product Manager @Puppet

The explanation is based on the following points.
For a lot of organizations, home-grown glue logic is inconsistent, not repeatable, and expensive to maintain hundreds of event-based workflows and thousands of combinations.
They believe that the answer lies in automation workflows. In particular, workflows-as-code that can be triggered by events.
They want to replace engineers’ home-grown digital duct tape with reusable, event-driven workflows.

CNCF Member Webinar: Kubernetes Security Anatomy and the Recently Disclosed CVEs

Gadi Naor, CTO & Co-Founder @Alcide

They touched on two CVE disclosures recently made by the community (CVE-2020–8555, CVE-2020–8552) and reviewed a holistic preventative prescription to Kubernetes security and how it can be used to detect and prevent exploits of this kind and others.
I’ll take a closer look at this later.

CNCF Member Webinar: Implementing Canary Releases on Kubernetes w/ Spinnaker, Istio, and Prometheus

Oleg Chunikhin, CTO @Kublr

It describes implementing a canary release on Kubernetes using Spinnaker, Istio, and Prometheus. It’s a one-hour session with plenty of demos starting around 23 minutes.

CNCF Member Webinar: Kubernetes Secrets Management: Build Secure Apps Faster Without Secrets

Jody Hunt, Director of DevOps Security @CyberArk

It describes best practices and secret management challenges for securing application access within Kubernetes. It’s easy to understand through the explanation and the demo.

CNCF Ambassador Webinar: Building application management platform with Open Application Model

Lei Zhang (Harry), Staff Engineer @Alibaba

The language for this webinar is Chinese.
It Introduces the practices and principles for building an application management platform using Kubernetes.

CNCF Member Webinar: Observability of multi-party computation with OpenTelemetry

Antoine Toulme, Engineering Manager @Splunk & Dave McAllister Sr, Technical Evangelist @Splunk

It introduces reference architectures for integrating technologies such as Kubernetes, OpenTelemetry Collector, and Hyperledger Fabric.

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

Forwarding Logs to Splunk Using the OpenShift Log Forwarding API

Andrew Block, Red Hat

Introducing the “OpenShift Log Forwarding API,” which makes it easier to integrate OpenShift and Splunk.
It manages the life cycle of Fluentd transfer instances in combination with the Helm chart. Instead of managing complex configurations, both OpenShift administrators and end users can easily deploy the entire solution so they can focus on their business-critical tasks.

Best practices for alerting on Kubernetes

Jorge Salamero Sanz, Sysdig

A Step-by-step instructions on best practices for alerts on the Kubernetes platform and orchestration. It contains examples of PromQL alerts.
If you’re new to Kubernetes and monitoring, I recommend my previous article, Monitoring Kubernetes in Production.

Introduction to Istio access control

Marton Sereg, Banzai Cloud

Banzai Cloud had written about Istio’s connectivity, monitoring, and safety, but not much about control. To fill that gap, this article described Istio’s access control model AuthorizationPolicies.

CRD is just a table in Kubernetes

Hiro Osaki, IT Next

As titled, it says that “CRD (Custom Resource Definition) is just a Kubernetes table”, and explains carefully using database diagrams, YAML, and CLI screens.

Shell-operator

Shell-operator is a tool for running event-driven scripts in a Kubernetes cluster

The GitHub page for the “Shell-operator”, an OSS tool for event-driven scripting on Kubernetes clusters. It provides an integration layer for Kubernetes clusters and shell scripts with scripts as event trigger hooks.

Diving Into Istio 1.6 Certificate Rotation

Christian Posta, Solo.io

It explained some common use cases for Istio and some good practices to help you with things like rotating root certificates, intermediates, and if necessary these various certificates for your Certificate Authority.
Commentary videos of Part 0~4 are embedded. I would like to review this along with the video while getting my hands dirty.

Hidden Gems of Kops — Kubernetes Deployment Tool

Rafael Nunes

It briefly describes the hidden splendor of kops that may help in provisioning Kubernetes clusters.

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

GKE best practices: Exposing applications through Ingress and Services

Harrison Sweeney and Mark Church, Google Cloud

It walks through the various factors to consider when publishing an application on GKE.
It describes how each factor affects the exposure of your application and highlights which networking solution each requirement leads to.
Assuming you’re familiar with Kubernetes concepts such as deployments, services, and ingress resources, it distinguishes between different exposing methods, from inside to outside, to multi-cluster.

Instrumentation and cAdvisor, with David Ashpole

Adam Glick and Craig Box, Kubernetes Podcast from Google

Kubernetes Podcast by Google employees. The current co-hosts are Craig Box and Adam Glick.
Google Cloud’s David Ashpole (TL of Kubernetes SIG Instrumentation, and the maintainer of cadvisor) is a guest.
The topics of my interest in News of the week are:
○ Spring Cloud Data Flow for Kubernetes from VMware; part of the Spring Runtime package
○ Custom Pod Autoscaler (and docs) by Jamie Thompson
○ Ingress support added to AWS App Mesh
○ Threat Alert: Attacker Building Malicious Images Directly on Your Host from Aqua Security

Traffic Director and gRPC — proxyless services for your service mesh

Stewart Reichling and Srini Polavarapu, Google Cloud

It introduces Traffic Director , a control plane managed by Google Cloud to solve the first barrier to service mesh adoption.

Implementing a GitOps UI with Spotify’s Backstage

Chanwit Kaewkasi, WeaveWorks

Using Backstage, Spotify’s open source framework , It introduced how to create a GitOps plugin with a UI that you can provide through the developer portal.

Explaining Kubernetes in 10 minutes using an analogy

Satyajit Das, Red Hat

To explain how Kubernetes works, it uses an analogy, the story of “using a rental house”.

Introducing Gloo Federation for Multi-Cluster API Gateway Management

Idit Levine, Solo.io

Introductory article of the management function “Gloo Federation” of the new multi-cluster API gateway.
It has two goals: to simplify and globalize the management of multiple Gloo instances across multiple clusters, and to provide advanced multi-cluster control capabilities, with the following characteristics:
Global Dashboard
Federated Configuration
Multi-Cluster Failover Routing
Location-Based Routing
Role-Based Access Control

Cloud Native Computing Foundation Takes Charge of Red Hat’s Operator Framework

Mike Melanson, The New Stack

An article about the history and future of Red Hat’s “Operator Framework” donated to the CNCF.
Operator Hub is separated from Operator Framework. There are no plans to donate to CNCF at this time.

Announcing Clutch, the Open-source Platform for Infrastructure Tooling

Daniel Hochman and Derek Shaller, Lyft

An announcement for open sourcing “Clutch,” an extensible UI and API platform for Lyft’s infrastructure tools.

6 Kubernetes workflows and processes you can automate

Kevin Casey, Enterprisers Project

Introducing the following six workflows and processes that can be automated with Kubernetes.

App setup/Installation
Pod and node scaling
Persistent storage management
Chaos testing
Deployment and versioning of Custom Resource Definitions (CRDs)
Container and Kubernetes security

Managing Day 2 in a Hybrid Cloud Environment

Emily Omier, Nirmata

It explains the necessity of introducing appropriate tools to avoid the “unmanageable Day 2 nightmare”, the operation phase after the introduction of containers and Kubernetes.
Lastly, the company ties and introduces its dashboard with the easy-to-understand and controllability.

Linkerd Case Studies

William Morgan, Buoyant

It introduced case studies of the adoption of Linkerd by four companies, Nordstrom, finleap connect, Paybase and Subspace. The case study highlights three main themes:

Security is a driving force behind service mesh adoption.
Istio is the starting point, but Linkerd is the final destination.
Latency matters, and the service mesh can help.

Conftest joins the Open Policy Agent project

Gareth Rushgrove, Open Policy Agent maintainer

The Maintainer of OPA(Open Policy Agent), Gareth Rushgrove announced that Conftest joined the OPA project.

The Future of Instrumentation Is Open: Introducing Open Source Agents and Projects at New Relic

Ramon Guiu, New Relic

New Relic’s announcement to make their agents, integrations, and SDKs available under an open source license.
It said that “Starting today, our agents for C, Go, .NET, Node, Python, and Ruby, as well as the Infrastructure agent, Infrastructure integrations, the Infrastructure Integrations SDK, and Telemetry SDKs are available and open to contributions in New Relic’s GitHub organization”.
The remaining agents will be available soon — Java in September, PHP in October, and Browser and Mobile in 2021.

Upcoming CNCF webinars

You can check some Recorded Webinars and Upcoming Webinars here. The following are posted as Upcoming CNCF webinars at that moment.

Member Webinar: One large cluster or lots of small ones? Pros, cons and when to apply each approach
Flavio Castelli, Distinguished Engineer @SUSE
July 24, 2020 10:00 AM Pacific Time

Member Webinar: Kubernetes Policies 101
Eran Leib, Founder, VP Product Management @Apolicy
Spenser Paul, Director of Sales, North America @DoiT International
July 28, 2020 10:00 AM Pacific Time

Member Webinar: GitOps Continuous Delivery with Argo and Codefresh
Dan Garfield, Chief Technology Evangelist @Codefresh
July 29, 2020 1:00 PM Pacific Time

Member Webinar: Cluster API — Yesterday, Today, Tomorrow
Saad Malik CTO & Co-Founder @Spectro Cloud
Jun Zhou Chief Architect @Spectro Cloud
July 30, 2020 10:00 AM Pacific Time

Project Webinar: How We Doubled System Read Throughput with Only 26 Lines of Code
TiKV team
July 31, 2020 10:00 AM Pacific Time

Member Webinar: Comparing eBPF and Istio/Envoy for Monitoring Microservice Interactions
Roko Kruze, Solutions Engineer @Flowmill
Mike Cohen, Co-Founder and COO @Flowmill
Aug 4, 2020 10:00 AM Pacific Time

Member Webinar: Debugging your debugging tools; What to do when your service mesh goes down in production?
Neeraj Poddar, Co-founder and Chief Architect @Aspen Mesh
Aug 5, 2020 7:00 AM Pacific Time

Member Webinar: Making Data Work for Developers with Kubernetes & Cassandra
Chris Splinter, Sr. Product Manager — Developer Solutions @DataStax
Patrick McFadin, VP of Developer Relations @DataStax
Aug 5, 2020 1:00 PM Pacific Time

Member Webinar: Maximizing M3 — Pushing performance boundaries at scale in a cloud-native distributed metrics engine
Ryan Allen, Senior Software Engineer @Chronosphere
Aug 6, 2020 10:00 AM Pacific Time

Member Webinar: Hardware for Kubernetes, Peeling Back the Layers
Erik Reidel, SVP Compute & Storage Solutions @ITRenew
Aug 11, 2020 10:00 AM Pacific Time

Member Webinar: The Open-Source Observability Playbook
Hen Peretz, Head of Solutions Engineering @Epsagon
Aug 12, 2020 7:00 AM Pacific Time

Member Webinar: Migrating Real-Time Communication Applications to Kubernetes at Scale: Learnings from 8×8’s Experience
Michael Laws, Sr. Site Reliability Engineer/DevOps at 8×8
Pankaj Gupta, Sr. Director at Citrix
Aug 12, 2020 1:00 PM Pacific Time

Project Webinar: Kubernetes 1.19
Kubernetes release team
Aug 28, 2020 10:00 AM Pacific Time
REGISTER NOW »

Member Webinar: Getting started with container runtime security using Falco
Loris Degioanni, CTO and Founder @Sysdig
Sept 2, 2020 1:00 PM Pacific Time
REGISTER NOW »

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara