SRE / DevOps / Kubernetes Weekly Collection#89(Week 41, 2021)
--
- In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
- Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
- I hope it contributes to the people browsing this kind of information as a reference.
DEVOPS WEEKLY ISSUE #563 October 10th, 2021
SRE Weekly Issue #291 October 10th, 2021
KubeWeekly # 281 October 29th, 2021 ← 2 weeks off due to KubeCon + CloudNativeCon NA 2021
DEVOPS WEEKLY ISSUE #563 October 10th, 2021
News
- The title is “The Insecure Software Supply Chain”.
- As the Editor commented above, it provides recent examples, threat categorization, and related open source projects. The font of the slide is unified and it is very easy to see.
- The title is “Why Even DevOp”.
- It describes that DevOps technology transformations not only help deliver software with speed and stability, but also show how to reduce burnout, improve culture, and improve communication. It also analyzes productivity.
- The title is “Deploy without credentials with GitHub Actions and OIDC”.
- You can now create OIDC (OpenID Connect) with a new feature in GitHub Actions. It explains how to integrate this with OpenFaaS and deploy a setlist of users on GitHub to an OpenFaaS cluster without credentials.
- The title is “Dual-Stack Networking in Kubernetes”.
- As the Editor commented above, it describes behind-the-scenes efforts to support IPv4/IPv6 dual stacks on Kubernetes.
- The title is “Google’s State of DevOps 2021 Report: What SREs Need to Know”.
- The full report spans 45 pages and the official focus of the report is DevOps rather than SRE. From this survey result, the most important points for SREs are extracted and explained.
- The title is “Why OpenEBS 3.0 for Kubernetes and Storage?”.
- An article for the release of OpenEBS 3.0. Starting with the background to using Kubernetes for your data, it discusses the various enhancements and new features of OpenEBS 3.0 and the benefits of using OpenEBS for Kubernetes storage.
Tools
- The GitHub page of “CloudGraph”, a tool for querying cloud infrastructure and configurations.
SRE Weekly Issue #291 October 10th, 2021
Articles
Understanding How Facebook Disappeared from the Internet
Facebook’s outage caused significantly increased load on DNS resolvers, among other effects. Cloudflare also published this followup article with more findings.
Celso Martinho and Sabina Zejnilovic — Cloudflare
- It analyzes the outage that occurred on Facebook’s 2021/10/4 from the perspective of Cloudflare.
- Facebook has posted “Update about the October 4th outage” and “More details about the October 4 outage” blogs with fault updates and details. Detailed articles will be individually piped up later, so I will touch on them there.
Shell (the oil company) reduced accidents by 84% by teaching roughnecks to cry. Listen to this podcast (or check it out in article form to find out how. Can we apply this to SRE?
Alix Spiegel and Hanna Rosin — NPR’s Invisibilia
- A podcast and its overview article that tells the social norms, their changes, and their effects through two stories. The Editor is asking, “Can we apply to SRE?”. This article dates 2016/6/17.
Google’s State of DevOps 2021 Report: What SREs Need to Know
Don’t have time to read Google’s entire report? Here are the highlights.
Quentin Rousseau — Rootly
- Since it is covered in DEVOPS WEEKLY ISSUE # 563 above, I will skip it.
More details about the October 4 outage
I really like how open Facebook engineering has been about what went wrong on Monday. This article is an update on their initial post.
Santosh Janardhan — Facebook
- A blog where Facebook reports and publishes details of the outage that occurred on October 4, 2021.
- The server connection between the data center and the Internet was completely disconnected, and the DNS server became inaccessible due to the suspension of BGP route advertisement.
- They arranged to send their engineers onsite to have them debug the issue and restart the system, but it took time for the following reasons.
○ These facilities are designed with high levels of physical and system security in mind. They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them.
○ So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers.
Want to learn about BGP? Ride along as Julia Evans learns. I especially like how she whipped out strace to figure out how traceroute was determining ASNs.
Julia Evans
- The author couldn’t find an article summarizing how to look up information related to BGP on its computer, as one of the Facebook outages mentioned above was caused by BGP. As a result of tweeting for tools there, a lot of information was gathered. Among them, it introduces some tools that can be used to look up BGP information.
The Verica Open Incident Database is an exciting new project that seeks to create a catalog of public incident postings. Click through to check out the VOID and read the inaugural paper with initial findings. I’m really excited to see what this project brings!
Courtney Nash — Verica
- Announcing the release of VOID (Verica Open Incident Database) , a project that makes public software-related incident reports available to everyone in one place to generate new, better questions and community discussions about incidents .
‘date -d’ vs. ‘date -s’, and ‘show foo’ vs. ‘clear foo’
Printing versus setting a date — they’re only separated by a typo. Perhaps something similar happened with Facebook’s outage.
rachelbythebay
- It gives an example of performing unintended operations with the CLI and mentions a rumor about Facebook’s outage.
Adopting a microservice architecture can strain your SRE. This article highlights an oft-missed section of the SRE book about scaling SRE.
Tyler Treat
- Along the title, the author’s opinion on this blog is given, focusing on chapter 32, which is often overlooked in SRE books.
Outages
- Facebook, Instagram, WhatsApp, and Oculus
Well, that sure was a big one. Facebook and related services were totally down for 6+ hours — even their DNS servers.They also had another, smaller outage later in the week. - Slack
Not a real outage, but Slack reported that users were having a hard time connecting to Slack, because resolvers were overloaded by DNS lookups for facebook.com. - NordVPN
- PayPal
KubeWeekly # 281 October 29th, 2021 ← 2 weeks off due to KubeCon + CloudNativeCon NA 2021
How about those articles? Do you have any interest in any?
Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.
Bye now!!