SRE / DevOps / Kubernetes Weekly Collection#89(Week 41, 2021)

6 min readOct 17, 2021

In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #563 October 10th, 2021
SRE Weekly Issue #291 October 10th, 2021
KubeWeekly # 281 October 29th, 2021 ← 2 weeks off due to KubeCon + CloudNativeCon NA 2021

DEVOPS WEEKLY ISSUE #563 October 10th, 2021

News

Supply chain security is an increasingly important topic. This presentation explains why it’s important, with well explained recent examples and a breakdown of the SLSA framework for categorizing threats and related open source projects.

The title is “The Insecure Software Supply Chain”.
As the Editor commented above, it provides recent examples, threat categorization, and related open source projects. The font of the slide is unified and it is very easy to see.

Slides with a great update on the state of devops in 2021, featuring recent research and looking at SPACE, a framework for measuring productivity.

The title is “Why Even DevOp”.
It describes that DevOps technology transformations not only help deliver software with speed and stability, but also show how to reduce burnout, improve culture, and improve communication. It also analyzes productivity.

Anything that helps get away from credentials in CI pipelines is a good thing. This post looks at federated ID, and specifically OpenID Connect usage in GitHub Actions.

The title is “Deploy without credentials with GitHub Actions and OIDC”.
You can now create OIDC (OpenID Connect) with a new feature in GitHub Actions. It explains how to integrate this with OpenFaaS and deploy a setlist of users on GitHub to an OpenFaaS cluster without credentials.

Kubernetes now has dual stack networking support, for IPv4 and IPv6. This post is a nice behind-the-scenes look at the hard work behind the scenes to implement such a large feature.

The title is “Dual-Stack Networking in Kubernetes”.
As the Editor commented above, it describes behind-the-scenes efforts to support IPv4/IPv6 dual stacks on Kubernetes.

A look at the 2021 Accelerate State of Devops Report, with observations about the relationship between Devops and SRE, risk taking, multi-cloud and security.

The title is “Google’s State of DevOps 2021 Report: What SREs Need to Know”.
The full report spans 45 pages and the official focus of the report is DevOps rather than SRE. From this survey result, the most important points for SREs are extracted and explained.

A post on the new OpenEBS release, which also acts as a good introduction to container attached storage in Kubernetes.

The title is “Why OpenEBS 3.0 for Kubernetes and Storage?”.
An article for the release of OpenEBS 3.0. Starting with the background to using Kubernetes for your data, it discusses the various enhancements and new features of OpenEBS 3.0 and the benefits of using OpenEBS for Kubernetes storage.

Tools

CloudGraph is a GraphQL powered search engine for your cloud infrastructure. Handy for a number of use cases. The docs have some nice query examples too.

The GitHub page of “CloudGraph”, a tool for querying cloud infrastructure and configurations.

SRE Weekly Issue #291 October 10th, 2021

Articles

Understanding How Facebook Disappeared from the Internet

Facebook’s outage caused significantly increased load on DNS resolvers, among other effects. Cloudflare also published this followup article with more findings.

Celso Martinho and Sabina Zejnilovic — Cloudflare

It analyzes the outage that occurred on Facebook’s 2021/10/4 from the perspective of Cloudflare.
Facebook has posted “Update about the October 4th outage” and “More details about the October 4 outage” blogs with fault updates and details. Detailed articles will be individually piped up later, so I will touch on them there.

The New Norm

Shell (the oil company) reduced accidents by 84% by teaching roughnecks to cry. Listen to this podcast (or check it out in article form to find out how. Can we apply this to SRE?

Alix Spiegel and Hanna Rosin — NPR’s Invisibilia

A podcast and its overview article that tells the social norms, their changes, and their effects through two stories. The Editor is asking, “Can we apply to SRE?”. This article dates 2016/6/17.

Google’s State of DevOps 2021 Report: What SREs Need to Know

Don’t have time to read Google’s entire report? Here are the highlights.

Quentin Rousseau — Rootly

Since it is covered in DEVOPS WEEKLY ISSUE # 563 above, I will skip it.

More details about the October 4 outage

I really like how open Facebook engineering has been about what went wrong on Monday. This article is an update on their initial post.

Santosh Janardhan — Facebook

A blog where Facebook reports and publishes details of the outage that occurred on October 4, 2021.
The server connection between the data center and the Internet was completely disconnected, and the DNS server became inaccessible due to the suspension of BGP route advertisement.
They arranged to send their engineers onsite to have them debug the issue and restart the system, but it took time for the following reasons.
○ These facilities are designed with high levels of physical and system security in mind. They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them.
○ So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers.

Tools to explore BGP

Want to learn about BGP? Ride along as Julia Evans learns. I especially like how she whipped out strace to figure out how traceroute was determining ASNs.

Julia Evans

The author couldn’t find an article summarizing how to look up information related to BGP on its computer, as one of the Facebook outages mentioned above was caused by BGP. As a result of tweeting for tools there, a lot of information was gathered. Among them, it introduces some tools that can be used to look up BGP information.

Announcing the VOID

The Verica Open Incident Database is an exciting new project that seeks to create a catalog of public incident postings. Click through to check out the VOID and read the inaugural paper with initial findings. I’m really excited to see what this project brings!

Courtney Nash — Verica

Announcing the release of VOID (Verica Open Incident Database) , a project that makes public software-related incident reports available to everyone in one place to generate new, better questions and community discussions about incidents .

‘date -d’ vs. ‘date -s’, and ‘show foo’ vs. ‘clear foo’

Printing versus setting a date — they’re only separated by a typo. Perhaps something similar happened with Facebook’s outage.

rachelbythebay

It gives an example of performing unintended operations with the CLI and mentions a rumor about Facebook’s outage.

SRE Doesn’t Scale

Adopting a microservice architecture can strain your SRE. This article highlights an oft-missed section of the SRE book about scaling SRE.

Tyler Treat

Along the title, the author’s opinion on this blog is given, focusing on chapter 32, which is often overlooked in SRE books.

Outages

Facebook, Instagram, WhatsApp, and Oculus
Well, that sure was a big one. Facebook and related services were totally down for 6+ hours — even their DNS servers.They also had another, smaller outage later in the week.
Slack
Not a real outage, but Slack reported that users were having a hard time connecting to Slack, because resolvers were overloaded by DNS lookups for facebook.com.
NordVPN
PayPal

KubeWeekly # 281 October 29th, 2021 ← 2 weeks off due to KubeCon + CloudNativeCon NA 2021

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara