SRE / DevOps / Kubernetes Weekly Collection#65(Week 17, 2021)
- In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
- Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
- I hope it contributes to the people browsing this kind of information as a reference.
DEVOPS WEEKLY ISSUE #539 April 25th, 2021
SRE Weekly Issue #267 April 25th, 2021
KubeWeekly # 261 April 30, 2021
DEVOPS WEEKLY ISSUE #539 April 25th, 2021
News
A strong argument for why you need a platform team to really benefit from running on Kubernetes.
- The title is “Why you need a platform team for Kubernetes”.
- It is explained according to the title, and the conclusion is as follows.
○ If your organization is large enough and you have a dedicated team to maintain Kubernetes, you can save a lot of time and effort compared to other options for managing your computing resources.
○ If you’re a small organization and you can’t justify a dedicated Kubernetes team, the quality and reliability of your platform can be sacrificed.
- The title is “How to Successfully Hand Over Systems”.
- As the title and Editor’s comments above, for engineering managers, product managers, and teams, acknowledge that system ownership change is a process that should be well planned and done at a time that works best for everyone involved.
A post on using the role of incident commander to aid in addressing operational incidents smoothly.
- The title is “Embrace your inner incident commander”.
- Since it was previously covered in SRE Weekly Issue# 259, I will skip it.
- The title is “Annotating Kubernetes Services for Humans”.
- A page outlines a convention for using annotations to help developers manage Kubernetes Services.
- The title is “Building the future of event-driven architecture.”
- A web page of Async API.
- The title is “Migrating to GKE: Preemptible nodes and making space for the Chaos Monkeys”.
- Last year, Expel’s SRE and DDT (Device Discovery and Tasks) teams moved from a statically provisioned virtual machine (VM) legacy environment to a more dynamically scalable and reliable device task infrastructure. The update is explained according to the title.
- The title is “Visualization in Kafka Cruise Control”.
- From Teads’ engineering blog. The above title and the contents of the Editor comment are explained while showing the UI and graphs.
Jobs
As Site Reliability Engineer you will deploy, maintain, monitor and improve the reliability, scalability and performance of our in-house built trading software. You will sit on the trading floor together with the end-users and set standards for the production environment — it is an engineering role, not a support role. You will have a real, direct impact on our ability to trade and trading results. You will work with short feedback loops and flat hierarchy. No two days are the same!
- SRE job listings.
Events
- An event page of “Dev X Conf” event. Registration is from the GitHub link on the web page.
- A GitHub page of “Fail over Conf”. I feel their passion to make differences at a virtual conference.
Tools
- A GitHub page of “ConsoleMe”, a web service that facilitates AWS IAM permissions and credential management for end users and cloud administrators.
- A GitHub page of the terminal workspace and multiplexer “Zellij” written in Rust.
- It aims to become a general purpose application development platform in the future.
- A GitHub page of “Qovery”, an open source abstraction layer library that makes it easy to deploy apps to AWS, GCP, Azure, and other cloud providers in just minutes.
- Written in Rust, it leverages Terraform, Helm, Kubectl, and Docker to manage resources.
SRE Weekly Issue #267 April 25th, 2021
Articles
SRE Case Study: Mysterious Traffic Imbalance
Yet more proof that DNS behavior varies way more than is obvious at first glance. Who the heck thought longest common prefix matching was a good idea?
Charles Li — eBay
- The title case study is described based on a fictitious website.
Fast and flexible observability with canonical log lines
The application may log multiple lines during the lifecycle of a request. Stripe has found it invaluable to also log one final line with a fully summary of the request.
Brandur Leach — Stripe
- As the title suggests, it explains how to use the “canonical log line” to ensure lightweight and strong observability.
Google Incident Report — April 12, 2021
This is a followup with more detail on the G-Suite outage I reported here last week. A database issue caused two separate outages.
- As mentioned above, a follow-up report on Google incidents.
The top 3 mistakes companies make with SLOs, SLAs, and SLIs
Really great advice about 3 common pitfalls in implementing SL*s.
Cortex
- According to the title and the comment of the above Editor, the following 3 are explained.
- Unnecessary SLOs
- Tracking vanity SLIs — instead of business goals
- Lack of visibility and ownership around SLOs
Going solid: a model of system dynamics and consequences for patient safety — Resilience Roundup
This research paper explores the marginal boundary, a set of conditions beyond which a system enters a different operating mode and an accident is much more likely. It discusses the concept of coupling between seemingly unrelated parts of the system and shows how economic incentives can push a system toward this boundary.
Dr. Richard Cook and Jens Rasmussen (Original paper)
Thai Wood — Resilience Roundup (summary)
- It describes the problems with the system transitioning from a loosely coupled state to a very tightly coupled state and the effects that can occur as a result.
Vodafone Idea BGP Leak — Global Routing System Must Implement MANRS
This is an analysis of a recent BGP leak with a discussion about how the impact from such events can be mitigated through emerging best practices.
Alessandro Improta and Luca Sani — Catchpoint
- It explains the origin hijacking incident by Vodafone Idea (AS55410) that occurred on April 16, 2021.
- The author proposes the implementation of “Mutually Agreed Norms for Routing Security (MANRS)” to address the threat of routing security.
How to Successfully Hand Over Systems
How do you hand over ownership of a system, transferring enough knowledge that the new owners can maintain its availability and reliability successfully?
Aleksandra Gavrilovska — SoundCloud
- Since it is covered in DEVOPS WEEKLY ISSUE #539 above, I will skip it.
Resiliency Planning for High-Traffic Events
Shopify works toward Black Friday / Cyber Monday all year long, through a combination of load testing, failure mode analysis, game days, and incident analysis.
Ryan McIlmoyl — Shopify
- It describes creating and maintaining resiliency plans for large development teams, testing and tools, developing incident strategies, and incorporating and improving feedback loops.
- About an hour of this session video is embedded in a web page.
Outages
- Microsoft Azure web portal
- Microsoft 365
- Discord
- google.com.ar
This one’s interesting. A random person was able to buy the domain name google.com.ar, despite the fact that its registration had not expired.
KubeWeekly #261 April 30, 2021
The Headlines
Editor’s pick of the highlights from the past week.
Last chance to register for KubeCon + CloudNativeCon Europe 2021 — Virtual
KubeCon + CloudNativeCon Europe 2021 — Virtual kicks off next week on May 4–7, 2021! Join the CNCF global community for more than 100 interactive sessions and experiences.
If you haven’t registered yet, be sure to register now and begin planning your experience. Don’t forget that we have two different pass options — including a free Keynote pass. We hope to “see” you there!
Editor’s note: KubeWeekly will take a short break for KubeCon + CloudNativeCon Europe 2021 and will resume on May 21. Enjoy the show!
- It in an announcement for just before KubeCon + CloudNativeCon Europe 2021 and for that KubeWeekly will be off for two weeks for this event and will resume on May 21st.
ICYMI: CNCF online programs this week
A weekly summary of CNCF online programs from this week.
Migrating from Flux v1 to Flux v2
Leigh Capili, Weaveworks
- Approximately 1 hour session with live demos, including how to bootstrap a cluster with Flux v1 and how to migrate to Flux v2.
Reduce the carbon footprint of your cloud-native workloads now
Eric Riedel & Jean-Jacques Chanut, ITRenew & Andy Randall, Kinvolk
- It describes how to reduce carbon dioxide emissions in today’s cloud-native workloads and achieve greater computing economics.
It is time to talk about DataMesh
Fred Chian, Brobridge Co. Ltd.
- It explains how to properly handle data supply issues in the microservices implementation process and aims to create an efficient data delivery platform for microservices through the DataMesh architecture.
Using machine learning on K8s logs to find root cause faster
Larry Lancaster & Gavin Cohen, Zebrium & Aran Khanna, Reserved.ai
- The content of the title is explained with the following points.
- How the technology works
- Live demonstration of the technology against a Kubernetes demo app
- Case study: How Reserved.ai is using the technology to speed-up incident resolution time
The Technical
Tutorials, tools, and more that take you on a deep dive into the code.
Developing a Kong Gateway Plugin with Go
- It explains how to create a custom Kong Gateway plugin using Golang.
- Click here for a sample plugin which the author created that adds an extra layer for security between consumers and producers.
AKS cost monitoring and governance with Kubecost
- A blog post on the Kubecost.com webpage.
- The open source “Kubecost” that supports various self-managed and hosted Kubernetes environments, including AKS (Azure Kubernetes Service), is introduced with a diagram from the perspective of using AKS.
Annotating Kubernetes services for humans
- Like the above DEVOPS WEEKLY ISSUE #539, it introduces Service annotations, so it seems good to read them as well.
Automate service mesh observability with Kuma
- It describes how to set traffic metrics and traffic trace policies for immediate use for “Kuma”.
Kubernetes deployment strategies | Day 37 of #100DaysOfKubernetes
- A YouTube commentary video. As described in the summary column, it explains the Kubernetes deployment strategy etc. on the following timeline.
○ 03:26 — Big-Bang
○ 05:22 — Rolling Updates
○ 07:28 — Blue-Green Deployment
○ 09:28 — A/B Testing
○ 10:56 — Canary Deployments
○ 13:03 — Progressive Delivery
The Editorial
Articles, announcements, and morethatgive you a high-level overview of challenges and features.
Results from the CNCF Cloud Native Survey China 2020
CNCF
- The content is exactly as the title, and the description is tailored to the reader, such as the Chinese notation in the English text and the Chinese text in the name.
Alkin Tezuysal, Vitess maintainer
- An overview article by the Maintainer for the release of Vitess 10.
- The following points are taken up.
○ Compatibility (MySQL, frameworks)
○ Migration
○ Schema Management
○Performance Optimizations
○ User Interface
○ Benchmarking
Swallow, with Alex Palessandro
Craig Box, Kubernetes Podcast from Google
- Kubernetes Podcast by Google employees. This time the Host is Craig Box and Guest Host Patrick Flynn. The previous appearances are as follows.
○ Episode 64 with Sarah D’Angelo and Patrick Flynn - The guest is Alex Palesandro, Research Assistant of Polytechnic University of Turin and co-creator Liqo.
- The topics I was interested in in the News of the week are as follows.
○ Red Hat Virtual Summit announcements
○ Lens 5 Beta
○ Kubernetes moves to three releases per year
How Containers are helping IT catch up with the speed of business
Ziv Kedem, Zerto
- A Forbes article that explains the content of the title for business people.
Turbocharge workloads with new multi-instance NVIDIA GPUs on GKE
Maulin Patel, Sr. Product Manager, Google Cloud and Pradeep Venkatachalam, Software Engineer, Google Cloud
- The contents of the title are introduced at the following points.
○ What customers are saying
○ Creating multi-instance GPU partitions
○ Deploying containers on a multi-instance GPU node
○ Getting started
Podcast: Building a business around popular open source tools for Kubernetes with Richard Li
Justin Dorfman, Richard Littauer, & Tzury Bar Yochay, Curiefense
- There are interesting stories behind the start of Datawire and various projects built from Datawire, such as Telepresence and the Ambassador API Gateway.
Reminder: Participate in CNCF microsurveys on Cloud Financial Management on Kubernetes and diversity
- A survey reminder introduced several times here.
Take the 2021 CNCF Cloud Native Survey — Part 1
- It Introduces Cloud Native Survey 2021. This year’s survey is divided into two parts. The theme of Part 1 is “cloud, containers, and Kubernetes”.
- Part 2 will be held later this year. The theme is “CNCF projects and other cloud native technologies such as service mesh, serverless, and storage”.
Upcoming CNCF Online Programs
- No Online Programs are scheduled for the week of KubeCon + CloudNativeCon Europe 2021 Virtual. We will resume the week of May 10!
- Looking for more great curated content? Visit our Online Programs playlist on YouTube.
How about those articles? Do you have any interest in any?
Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.
Bye now!!