SRE / DevOps / Kubernetes Weekly Collection#93(Week 45, 2021)
- In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
- Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
- I hope it contributes to the people browsing this kind of information as a reference.
DEVOPS WEEKLY ISSUE #567 November 7th, 2021
SRE Weekly Issue #295 November 7th, 2021
KubeWeekly #283 November 12th, 2021
DEVOPS WEEKLY ISSUE #567 November 7th, 2021
News
- The title is “In our systems we trust”.
- It explains how they can maintain user and stakeholder trust in their systems and products in an environment where failure is part of the journey, what about trust between them internally as team members, feeling confident in the quality of each other’s work through a story with the following points.
○ Health of code
○ Health of relation
○ Let’s talk about trust.
○ A word of advice
○ Final thoughts
- The title is “Infrastructure Observability for Changing the Spend Curve”.
- It is a deep dive on how they crafted an order of magnitude change in they spend (10x reduction compared to baseline growth) over the last two years with iterative understanding and changes in Slack’s Continuous Integration (CI) infrastructure.
- The following three ideas described in Takeaways and Hacklang were interesting.
- Adaptive capacity to decrease the cost of each test by changing the infrastructure runtime.
- Circuit breakers to decrease the number of tests by changing the infrastructure workflow.
- Pipeline changes to decrease the number of tests by changing our user workflows.
- An introductory article of “Luau”, a fast, small, secure, and gradually typed embeddable scripting language derived from Lua. It has been used by Roblox game developers to write game code, and used by the company engineers to implement most of the user’s application code and part of the editor (Roblox Studio) as plugins and open-sourced.
- Click here for the GitHub page.
- The title is “Developing Databricks’ Runbot CI Solution”.
- It explores the motivations behind developing their own CI solution “Runbot”, the core design decisions that went into it, and how they used it to greatly improve the experience of all the developers within the Databricks engineering organization.
- A playlist “KubeCon 2021 o11y Talks” that collects sessions on Observability of KubeCon + CloudNativeCon NA 2021.
- The title is “NGINX Monitoring: Best Tools and Key Metrics You Should Know About”.
- The explanation focuses on NGINX’s Key Metrics and “The Top 7 NGINX Monitoring Tools” below.
- Sematext
- Prometheus and Grafana
- New Relic
- datadog
- AppDynamics
- SolarWinds Server & Application Monitor
- Dynatrace
- The title is “Golden Signals — Monitoring from first principles”.
- The first article in a three-part blog series. It describes the four major SRE golden signals for metric-driven measurements and their role in the overall context of monitoring.
Tools
- An article of “Boilerplate for a basic AWS infrastructure with EKS cluster” using Terraform.
- Click here for the GitHub page.
SRE Weekly Issue #295 November 7th, 2021
Articles
MTTR is a Misleading Metric — Now What?
I love this crystal clear argument based on statistics and research. MTTR as a metric is simply meaningless.
Courtney Nash — Verica
- Part 2 of a series that focuses on each of the key findings of VOID Report 2021. It raises the title and comments in the Editor above and considers what to use instead of MTTx metrics.
Five steps to better customer communication
Their steps for better communication during an outage:
* Provide context to minimise speculation
* Explain what you’re doing to demonstrate you’re ‘on it’
* Set some expectations for when things will return to normal
* Tell people what they should do0
* Let folks know when you’ll be updating them next
Chris Evans — incident.io
- The explanation focuses on the five steps excerpted by the above Editor in the title.
Heroku Incident 2365 Follow-Up
Despite checking in advance to be sure their systems would support the new Let’s Encrypt certificate chain, they ran into trouble.
[…] we discovered that several HTTP client libraries our systems use were using their own vendored root certificates.
Heroku
- Looking back on the incident that occurred on September 30, 2021. It was triggered by the expiration of the old root certificate issued by Let’s Encrypt.
Multicloud failover is almost always a terrible idea
This is the best case I’ve seen yet against multi-cloud infrastructure. I really like the airline analogy.
Lydia Leong
- Along the title, it explains that the huge cost and complexity of multi-cloud implementation is effectively a negative distraction from what you should actually be doing that would improve your uptime and reduce your risks.
An Update on Our Outage — Roblox
Roblox had a major, several-day outage starting on October 28. I don’t usually include game outages in the Outages section since they’re so common and there’s not usually much information to learn from, I sure do like a good post-incident report. Thanks, folks!
David Baszucki — Roblox
- In line with Roblox’s key value of “Respect the Community”, in outage information, they mentioned to continue to be transparent in their post-mortem.
When you’re sending small TCP packets, two optimizations can conspire to introduce an artificial 40 millisecond (not megasecond…) delay.
Vorner
- It shares the story of tracking bugs that occurred in the production environment of the application written in Rust at the following points.
○ A bit of backstory
○ The increased latencies
○ It was acting really weird
○ The benchmarks
○ Configuration options
○ Overriding the defaults of http1_writev
○ Splitting vectored writes
○ Nagle’s algorithm
○ Ok, but why 40ms?
○ Conclusion
_Here’s Google’s follow-up report for their October 25–26 Meet outage.
- Click here for additional information in Japanese or other languages if you are guided to the English page.
- Improvements identified are as follows.
○ Increased resource allocation for the backend message delivery system in the short term, and automatically detect message delivery overload in the long term.
○ Enhancements to monitoring systems to capture real time data on quality of livestreams for all volumes of traffic.
○ Alert logic updates to capture spikes in rebuffering rate proactively to help mitigate before any customer impact.
/r/sre — How to deal with retries in SLIs
Should you count failed requests toward your SLI if the client retries and succeeds? A good argument can be made on either side.
u/Sufficient_Tree4275 and other Reddit users
- As commented by the Editor above, it is interesting that both opinions discuss the concept of retry in SLI.
What the SRE team wants to achieve with the development team
Mercari restructured its SRE team, moving toward an embedded model to adapt to their growing microservice architecture.
ShibuyaMitsuhiro — Mercari
- The English translation of the blog published on January 29, 2021 is as follows. It’s been about 10 months since it was written, so I think they are on the next stage, so I am curious about the current situation.
○ This article is a translation of the Japanese article written on 2021/01/29 Jan. 29th, 2021.
Episode 1: Honeycomb and the Kafka Migration — The VOID
There’s a really great discussion in this episode about leaving slack in the system in the form of bits of capacity and inefficiency that can be drawn upon to buy time during an outage.
Courtney Nash, with guests Liz Fong-Jones and Fred Hebert — Verica
- Approximately 31 minutes of Podcast describes behind-the-scenes coverage of Honeycomb’s May this meta-analysis on a series of outages related to the Kafka architecture migration, beyond the Editor’s comments above, as well as the specific technology details below.
○ Complex socio-technical systems and the kinds of failures that can happen in them (they’re always surprises)
○ Transparency and the benefits of companies sharing these outage reports
○ Safety margins, performance envelopes, and the role of expertise in developing a sense for them
○ Honeycomb’s incident response philosophy and process
○ The cognitive costs of responding to incidents What we can (and can’t) learn from incident reports
Why a ‘Reliability Mindset’ Must Be Adopted Beyond SRE
Here’s how non-SREs can use SRE principles to improve their systems.
Laurel Frazier — Transposit
- The explanation focuses on the following six methods by which non-SREs begin to adopt the “reliability concept”.
- Be prepared
- Embrace automation
- Let the data do the talking
- Debrief without blame
- Close the feedback loop
- Be customer-centric
Outages
- Facebook, Messenger and Instagram
Or Meta or whatever. - Google Nest
KubeWeekly #283 November 12th, 2021
The Headlines
Editor’s pick of the highlights from the past week.
How Pokemon Go creator builds on Kubernetes for developers
B. Cameron Gain, The New Stack
In this latest episode of The New Stack Makers podcast, Ria Bhatia, senior product manager of Niantic, discusses why the Pokemon Go platform remains relevant today to developer customers and why Kubernetes will remain an integral part of the platform.
- It explains why the Pokemon Go platform remains relevant and why Kubernetes will remain an integral part of the platform as the company hopes to bring in more “developer customers”. Podcasts and YouTube videos are embedded respectively.
ICYMI: CNCF online programs this week
A weekly summary of CNCF online programs from this week.
Breaking tradition: The future of package management with Kubernetes
Aaron Hurley & Dmitriy Kalinin, VMware
- A one-hour session that deeply explores the idea of the session of the same name introduced in the keynote of Kubecon + CloudNativeCon NA 2021.
- It explains in detail how the Carvel project team is re-imagining package management for Kubernetes to bring you a modern, declarative way to automate end-to-end lifecycle management of packaged applications and their dependencies.
Improve core-to-edge mobility and resiliency for cloud native applications
Ben Morrison, Trilio
- A 53-minute session that introduces how to implement the mobility and resiliency of cloud-native apps at the edge with the following points.
○ How to further simplify deployment and management of cloud-native applications to improve resiliency and availability for edge clouds and help customers better curate their data for competitive advantage.
○ How to protect and migrate workloads between core and edge using enterprise and lightweight Kubernetes with data management tooling.
The Technical
Tutorials, tools, and more that take you on a deep dive into the code.
How Krateo PlatformOps integrates Backstage
Diego Brag, Kiratech
- As the title suggests, the open source project “Krateo Platform Ops” based on a comprehensive and modular architecture for centrally creating and managing all types of resources is called “Backstage”, an open platform created by the Spotify IT team. It introduces the value and importance of integration of Krateo PlatformOps with Backstage. Backstage is a CNCF sandbox project.
Kube-lineage: A CLI tool for visualizing Kubernetes object relationships
Justin Toh
- It introduces the CLI tool “kube-lineage” that visualizes the relationships between Kubernetes objects.
Multifactor SSO authentication for Postgres on Kubernetes
Jonathan Katz, Crunchy Data
- Since PostgreSQL 12 has made it possible to provide multi-factor authentication for databases, we are introducing the introduction method.
Flux security audit has concluded
Daniel Holbach, Weaveworks
- Since the security audit by CNCF and OSTIF (the Open Source Technology Improvement Fund) of CNCF’s Incubation project “Flux” has been completed, the results are shared.
- The main purpose of the audit is to assess Flux’s basic security posture and identify the next steps in the security story.
Horizontal pod autoscaling with custom metrics in Kubernetes
Natalie Serrino, Pixie
- As the title suggests, it explains how to autoscale Kubernetes Deployment using kind: HorizontalPod Autoscaler with custom application metrics.
The Editorial
Articles, announcements, and morethatgive you a high-level overview of challenges and features.
Announcing the 2021 Steering Committee election results
Kaslin Fields, Google
- As the title suggests, the results of the 2021 Steering Committee Election of the Steering Committee, which is elected from the Kubernetes community for a two-year term, are announced. It introduced four new and reappointed members from this election, three continuing members who were not subject to the election, and thanked the people involved.
Kubernetes podcast from Google: Knative 1.0, with Ville Aikas
Craig Box & Jimmy Moore
- Kubernetes Podcast by Google employees. This time the Host is Craig Box and Guest Host Jimmy Moore.
- With the release of Knative 1.0, they are welcoming Ville Aikas , a member of Knative’s Steering Committee and co-founder of Chainguard Inc.
- This time they are in direct guest interviews without News of the week.
Security microservices, configuration and observability take the stage at KubeCon NA 2021
Patrick Nelson, SiliconANGLE
- The keynote of KubeCon + CloudNativeCon NA 2021 explains the following five important points of the security session.
- Modern security practices take hold
- Configuration is more important in elaborate environments than cyberattack prevention
- Supply chain hacks are escalating, and in the spotlight
- Streamlining app deployment to Kubernetes
- Costs getting reined in
Key takeaways from KubeCon: deeper focus on FinOps, GitOps
Charlotte Dunlap, GlobalData
- Key Takeaways related to FinOps and GitOps of KubeCon + CloudNativeCon NA 2021 are excerpted and explained.
- There are two Summary Bullets:
○ The Open Source Security Foundation (OpenSSF), a new group focused on software security supply chain problems, added $10 million in vendor funding.
○ Google Cloud recently joined the FinOps Foundation, representing the first major cloud provider to commit.
Kubernetes operators: Cruise control for managing cloud native apps
Senthil Raja Chermapandian, Ericsson
- It explains the benefits of Kubernetes Operator and the overall picture.
Upcoming CNCF Online Programs
Live Webinar
- November 16 @10amPT: Intro to open source observability with Grafana, Prometheus, Loki, and Tempo presented by Richard Hartmann, Grafana Labs — RSVP
Cloud Native Live
- November 17 @9amPT: Building, analyzing, optimizing, and securing containerized apps presented by Martin Wimpress, Slim.AI — RSVP
On-demand
- November 18: InfluxDB + Telegraf Operator: Easy Kubernetes Monitoring presented by Pat Gaughen & Wojciech Kocjan, InfluxData — RSVP
YouTube playlist submissions
- Looking for more great curated content? Visit our Online Programs playlist on YouTube.
Learn more about CNCF Online Programs
How about those articles? Do you have any interest in any?
Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.
Bye now!!