SRE / DevOps / Kubernetes Weekly Collection#54(Week 6, 2021)

11 min readFeb 15, 2021

In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #528 February 7th, 2021
SRE Weekly Issue #256 February 7th, 2021
KubeWeekly #250 February 12th, 2021

DEVOPS WEEKLY ISSUE #528 February 7th, 2021

News

A detailed post on the Bottlerocket build system. You may not have quite as complex a project, but lots of interesting tricks in here for using Cargo for much more than just building Rust code.

The title is “How the Bottlerocket build system works” from the AWS Open Source Blog.
It explains in detail the Bottlerocket OS created for the purpose of running containers that are OSS from AWS on VMs and bare metal.
I felt nostalgic when I jumped to the Cargo page and saw the photo of the palletized cardboard. It reminds me of the days of the logistics industry, where I wrap and unravel.

A post on how to best defend your software build pipeline from targeted supply chain attacks.

The title is “Defending software build pipelines from malicious attack”.
Continuing from last week’s article “Securing the NCSC’s web platform,” an article from the UK’s National Cyber Security Center (NCSC).
It explains why the build pipeline is one of the foundations of system security and why should give it particular attention, along with the following:
○ The benefits of automation
○ Defend the pipeline
○ Protect builds from each other
○ Establish a chain of custody
○ Consider a managed service for your build pipelines
○ Hard work, but worth the effort

Architecture diagrams often feature lots of boxes and arrows, but how do you overlay more useful information without visual overload? This post provides a handy visual language.

The title is “A visual language for digital integration”.
It explains how to visually capture the right information (and only the right information) on one page when designing the integration for digital systems.
In a future post, it will explore the process of determining components in more detail.

Dockerfiles are ubiquitous for building container images. But if you’re looking for something that provides a higher level interface and stronger opinions then buildpacks are worth a look. This post compares the two.

The title is “Build packs vs Docker files”.
The story of the author’s development team migrating from Dockerfile to buildpack. The following six perspectives are explained as factors that determined the transition.
○ Developer Productivity
○ Security
○ Performance
○ Customizability
○ Community
○ Kubernetes Support

Threat modelling is a useful tool for getting people thinking about the security of their systems. It’s also a great way of encouraging collaboration between development and security teams. This new manifesto is a good starting point.

The web page of “Threat Modeling Manifesto”.
Threat Modeling is explained according to the following items.
○ What is threat modeling?
○ Values
○ Principles
○ About

Ever wanted to understand how Kubernetes allocates IP addresses when you run a high-level command like kubectl expose? This post has you covered.

The title is “Putting a VIP in your Kubernetes Clusters”.
It touches on Tim Hockin ‘s “Bringing Traffic Into Your Kubernetes Cluster” and discusses Type：LoadBalancer(or in most cases, Virtual IP address) from a different perspective.

gRPC is optimised for fast, secure over-the-wire transfer. But that makes it harder to debug than something like JSON over HTTP. Here’s how to use Wireshark for analyzing gRPC messages.

The title is “Analyzing gRPC messages using Wireshark”.
It explains how to configure and use the protocol-specific components “Wireshark gRPC dissector” and “Protocol Buffers (Protobuf) dissector” that allow Wireshark to analyze gRPC messages.

A case study for building a Kubernetes-powered CI/CD pipeline using GitLab and Helm.

The title is “Building a Kubernetes CI / CD Pipeline with GitLab and Helm”.
Since It was covered in KubeWeekly #249 last week, I will skip it.

Events

WTF is SRE? Container Solutions presents a new WTFinar that tackles the beginning of understanding SRE. Join Nathen Harvey from Google to learn about service level indicators (SLIs) and service level objectives (SLOs) — components of error budgets. 9th February, 15:00 CET

An introductory article on the event “WTF is SRE?” featured in last week’s SRE Weekly Issue #255.

Tools

Vorteil provides a super interesting toolkit for building and running fast micro-VMs. You can even convert an OCI-compliant container image directly to a VM and run it using Vorteil.

A GitHub page of the operating system “Vorteil” for running cloud applications in micro virtual machines.

Kubenav provides desktop, web and mobile apps for monitoring the status of a Kubernetes cluster.

A GitHub page of the mobile, desktop, and web app “kubenav” for managing Kubernetes clusters and getting an overview of resource status.

SRE Weekly Issue #256 February 7th, 2021

Articles

Slack’s Outage on January 4th 2021

Here’s a blog post from Slack giving even more information about what went wrong on January 4. Bravo, Slack, there’s a lot in here for us to learn from.

Laura Nolan — Slack

Regarding the outage, the article I covered in SRE Weekly Issue #254 the other day was a Slack report, but this one is in Slack’s engineering blog.
As the Corrective Action, AWS assures Slack to review the AWS Transit Gateways (TGWs) scaling algorithms for large packet-per-second increases as part of their post-incident process and Slack set reminders to request preemptive upscaling of their TGW of the next holiday season, and more.

Zero Downtime Release: Disruption-free Load Balancing of a Multi-Billion User Website

This academic paper from Facebook explains how they release code without disrupting active connections, even for a small number of users.

Usama Naseer, Luca Niccolini, Udip Pant, Alan Frindell, Ranjeeth Dasineni, and Theophilus A. Benson — Facebook

An abstract page of Facebookpaper. You can download the paper from the link.
It’s about Zero Downtime Release, a framework that leverages various components of the end-to-end network infrastructure to prevent or mask interruptions in the face of a release.

Brand SRES

Another lesson we can learn from aviation: have one place where engineers can find out about temporary infrastructure changes that are important.

Bill Duncan

It explores ways to communicate effectively with the entire team in a many-to-many condition where many environments are dealt as SRE with many other team members, and the temporary state of each environment. It uses the aviation term “NOTAM (Notices to Airmen)” as the keyword to explain the situation.

Incident Post Mortem: January 29, 2021 [Coinbase]

Coinbase posted this detailed analysis of their January 29th incident.

Coinbase

It details the outage, explains what caused it, and describes changes to prevent similar failures in the future.
They are working on changes such as reviewing monitoring, utilizing read-only replicas, and breaking down monolithic app servers into individual services.

Council Post: How Cloud Services Platform Teams Can Drive The Adoption Of Effective SRE Practices

Interesting thesis: a company moving into the cloud is in a unique position to adopt SRE practices — and better situated than cloud-first companies.

Tina Huang (CTO, transpose) — Forbes

Along the title, it explains that there are two contrasting approaches that legacy software companies can adopt in their cloud strategies, including significantly different SRE outcomes.

Adopt Cloud Services For Individual Services And Teams
Build A Cloud Platform Team

“I’m Just Doing my Job,” An SRE Myth

We need to push past surface-level mitigation of an incident and really dig in and learn.

Darrell Pappa — Blameless

The author, who heard the lines of the title from the person in charge of the customer consultation desk, suggested that SRE should be from the customer’s perspective as follows, and that the problems should be systematized and SRE best practices should be applied.
Customers deserve better, and we should always be their biggest advocate. So, next time you find yourself saying, “Sorry, but I’m just doing my job,” try to shift your perspective to the customer. View these problems as systemic, use SRE best practices like SLOs and error budgets, and embrace a blameless culture to help make a change.

GitHub Availability Report: January 2021

GitHub’s database failed in a manner that wasn’t detected by their automated failover system.

Keith Ballinger — GitHub

It describes one incident and its countermeasures that caused a significant impact and reduced availability of the GitHub Actions service that occurred in January.

Open source update: School of SRE

LinkedIn published their SRE training documentation in the form of a full curriculum covering a range of topics.

Akbar KM and Kalyanasundaram Somasundaram — LinkedIn

Introducing the School of SRE, a curriculum curated for ambitious SREs published by LinkedIn on GitHub.
It would be nice to have this kind of information in order to improve your skills as a team.

Push some big numbers through your system and look for bugs

Your code may be designed to handle 64-bit integers, but what if a library (such as a JSON decoder) converts them to floating point numbers?

rachelbythebay

It introduces how to search for bugs and play with JSON.

Outages

KubeWeekly #250 February 12th, 2021

The Headlines

Editor’s pick of the highlights from the past week.

Last chance to register for KubeCon + CloudNativeCon Europe 2021 — Virtual for $10!

KubeCon + CloudNativeCon Europe 2021 Virtual is happening on May 4–7, 2021. Be sure to register for a full All Access Pass for just $10 through February 14 at 23:59 CEST! The price will increase to $75 on February 15, so act fast to take advantage of this great deal.

Don’t forget — the CFP deadline for KubeCon + CloudNativeCon Europe 2021 Virtual co-located events closes on February 19!

See the full list of co-located events below:

Cloud Native Rust Day– hosted by CNCF — May 3
Cloud Native Security Day Europe — May 4
Cloud Native Wasm Day — May 4
FluentCon Cloud Native Logging day with Fluent Bit & Fluentd — May 4
Kubernetes AI Day — May 4
Kubernetes on Edge Day — May 4
ServiceMeshCon Europe — May 4

KubeCon + CloudNativeCon Europe 2021 — Virtual and Co-located events scheduled to be held on May 4–7, 2021. All Access Pass for $ 10 is up to 2/14. I have already applied for it and I’m considering which one to apply for Co-located events.

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

Getting started with Kubernetes audit logs and Falco

Pawan Shankar, Sysdig

It describes what Kubernetes audit logs are, the information they provide, and how to integrate them with the open source runtime security tool “Falco” to detect suspicious activity in your cluster.

Building container images in Go

Ahmet Alp Balkan

It explains how to build an OCI container image without using Docker by programmatically building the layers and image manifests using the go-containerregistry module.

Cloud Development Environments: Using Skaffold and Telepresence on Kubernetes for fast dev loops

Peter O’Neill, Ambassador Labs

It explains how to use Skaffold to build and deploy a local environment, launch Telepresence, project the local services you are building to a remote cluster, and loop through development.

Achieving Cloud Native Security and Compliance with Teleport

Ninad Desai, InfraCloud

It touches on the need for Zero Trust Architecture and introduces “Teleport” as a product that fits into the area of ”Zero Trust Network” for cloud-native apps.

Kubernetes Liveness Probes — Examples & Common Pitfalls

Levent Ogut, Loft

The Readiness Probe and Liveness Probe, which it described in a previous post, mention that they behave differently and explain each component, configuration, and how to troubleshoot.

Let’s Learn Harvester

Saiyam Pathak, Civo

It introduces an open source hyper-converged infrastructure (HCI) software running on Kubernetes. It is explained as an open source product alternative to products such as vSphere and Nutanix.

ICYMI: CNCF online programs this week

A weekly summary of CNCF online programs from this week.

How to Manage Kubernetes Application Lifecycle Using Carvel

Helen George and Joao Pereira @VMware

It introduces Carvel, an open source project that provides a reliable, single-purpose, configurable set of tools to help you build, configure, and deploy your apps to Kubernetes.
It shows how to take advantage of Carvel and explains how to use each tool individually or together.

Debugging Kubernetes On The Fly

Josh Hendrick @Rookout

It describes what traditional challenges are when debugging Kubernetes-based apps, and how real-time debugging of production workloads can help solve them.

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

The State of Cloud Native Application Security survey — 2021

Matt Jarvis, Snyk

It introduces Snyk’s Cloud native application security (CNAS) 2021 survey and shares plans for its report.
The first 500 people will get free coffee, and the survey results will be released to the community free of charge, so if you are interested, please check it.

Garden: The Configure-Once Kubernetes Platform for Seamless Dev/Prod Integration

Thor Sigurdsson & Mike Winters, Garden

It describes that “Most of the problems developers run into CI are caused by a) discrepancies between dev and CI environments and b) insufficient, slow integration testing.”, and one possible approach to solve these problems is to use a consistent configuration for every pre-production environment, from development to testing to CI.
In this context, it introduces “Garden” which uses a consistent configuration for all pre-production environments, from development to testing to CI.
Garden is an open source project that describes the entire stack (all services, tests, dependencies) and allows you to launch an on-demand full stack environment at every step of your development pipeline.
At the beginning, It explains how it impacts developer experience that Development environments use a completely separate (and often pared-down) configuration compared to CI, which builds, tests, and deploys in a more production-like setting.
○ First, the discrepancy between development and CI environments leads to hard-to-predict errors in CI.
○ Second, developers have no idea if integration tests will pass when they push to CI.
○ Third, the process of troubleshooting CI is slow and tedious.
○ And fourth, even just writing integration tests takes a lot of time and effort.

Upcoming CNCF Online Programs

CNCF Live webinar: Toward Hybrid Cloud Serverless Transparency with Lithops Framework
presented by IBM
February 16, 2021 at 10:00 am PT
Register Now

This Week in Cloud Native (Livestream): KCD El Salvador
February 17, 2021 at 12:00 pm PT
Register Now

CNCF Online Programs Playlist on YouTube
Check out our playlist for more curated content you don’t want to miss! New content is added every Friday.

For more information, please visit our updated Online Programs page.

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara