SRE / DevOps / Kubernetes Weekly Collection#79(Week 31, 2021)
- In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
- Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
- I hope it contributes to the people browsing this kind of information as a reference.
DEVOPS WEEKLY ISSUE #553 August 1st, 2021
SRE Weekly Issue #281 August 1st, 2021
KubeWeekly # 271 August 6th, 2021
DEVOPS WEEKLY ISSUE #553 August 1st, 2021
News
- The title is “DevOps and Cloud InfoQ Trends Report — July 2021”.
- Since it is covered in “KubeWeekly #269 July 23th, 2021 “, I will skip it.
- The title is “(All) DNS Resource Records”.
- As the Editor comment and title above, this article comprehensively covers DNS Resource Records (RRs). There were many things that I didn’t recognize other than the general RRs that it picked. I bookmarked and will check again.
- The title is “The SPACE of Developer Productivity”.
- It explicates several common myths and misconceptions about developer productivity and proposes the SPACE framework above.
- The SPACE framework is meant to help individuals, teams, and organizations identify pertinent metrics that present a holistic picture of productivity; this will lead to more thoughtful discussions about productivity and to the design of more impactful solutions.
- The title is “We need to talk about testing”.
- From the perspective of “how programmers and testers can work together for a happy and fulfilling life”, several topics related to testing are taken up and explained.
- The title is “The Unique Reliability Engineering Requirements of Microservices”.
- It discusses how managing reliability for a microservices-based app is different from working with a monolith with the point that “In short, although the fundamental principles and concepts that undergird reliability engineering are the same in any context, SREs should adapt practices to the special requirements of whichever type of environment they are supporting.”.
- A series that explains the theory of monitoring so that readers can broaden their horizons about the world of tech monitoring by researching “from scratch”. The title of the above link is “Monitoring theory, from scratch — The definitions”. Summary/takeaways are below.
○ Monitoring is about efficiently and promptly observing things important to your business
○ Good monitoring is monitoring that allows a human to derive insights about your business’ operation, in order to prevent or minimize damage.
○ Good tech monitoring is monitoring that allows a human to ensure your business operates correctly, smoothly, efficiently, securely and in compliance while reducing false positives and false negatives to a minimum. - The title of the second article is “Monitoring theory, from scratch — Indicators and synthetics”. Summary/Takeaways are below.
○ Try to pick indicators that offer the smallest gap between your desired definition of ‘up’ and their ability to report it.
○ Document and train operators on such gaps and implementation limits.
○ Try to pick indicators that are as discrete as possible and leave as little room as possible for interpretation.
○ If you end up scratching your head much while looking at your indicators, revise and re-iterate.
○ Remember that indicators aren’t predictive by nature and need to be complemented with other measures/systems.
○ Remember that system state changes and you need to make sure your indicators are fresh and react to it. It’s always an on-going process.
○ Consider supporting synthetic transactions in your indicator strategy, in particular if you have a complex and distributed end to end system.
- The title is “Making Tracing as a part of Engineering DNA”.
- Within ShareChat ‘s platform team, they have built a solution to understand the performance and response of how 10,000 CPU cores and RAM are being utilized by hundreds of microservices under various circumstances.
- They share meaningful insights from seeking detailed observability and monitoring metrics from all resources on ShareChat’s network.
Tools
- A GitHub page of “Authorino”, a cloud-native authentication/authorization enforcer for zero trust API protection.
- A GitHub page of “Naml (Not another markup language)”.
- You can replace Kubernetes YAML with Golang and write and deploy apps in Golang.
SRE Weekly Issue #281 August 1st, 2021
Articles
Learning from incidents — Formula 1
The incident: a formula 1 car hit the side barrier just over 20 minutes before the race was about to start. The team sprang into action with an incredibly calm, orderly and speedy incident response to replace the damaged parts faster than they ever have before.
This article is a great analysis, and there’s also an excellent 8-minute video that I highly recommend. Listen to the way the sporting director and everyone else communicates so calmly. It’s a rare treat to get video footage of a production incident like this.
Chris Evans — incident.io
- Analyzing and commenting on a great incident response in F1. As commented by the Editor above, the embedded 8-minute video is very good and is highly recommended.
Observe a Service; Not a Server
The underlying components become the cattle, and the services become the new Pet that you tend to with your utmost care.
Piyush Verma — Last9
- Using the metaphor of “pet vs. cattle”, it is explained in the following items.
○ Establishing Service KPIs
○ Features vs. Stability
○ Cascading
○ The images in the article are cute.
aws-samples/aws-incident-response-playbooks
AWS posted these example/template incident response playbooks for customers to use in their incident response process.
AWS
- A playbook that covers some common scenarios that AWS users face. It outlines the procedure based on the NIST Computer Security Incident Handling Guide (Special Publication 800–61 Revision 2), which can be used for the following purposes:
○ Gather evidence
○ Contain and then eradicate the incident
○ recover from the incident
○ Conduct post-incident activities, including post-mortem and feedback processes
A list with descriptions of all DNS record types, even the obscure ones. Tag yourself, I’m HIP.
Jan Schaumann
- Since it is covered in DEVOPS WEEKLY ISSUE #553 above, I will skip it.
What’s a Major Incident Anyway?
This one includes a useful set of questions to prompt you as you develop your incident response and classification process.
Hollie Whitehead — xMatters
- The author explains it with the aim that by knowing how to define a major incident and how to respond to it effectively, you can ensure your operations are running the way they should and focus your attention on the work that actually matters.
The author of this article shows us how they communicate actively, perform incident retrospectives, and even discuss “near misses” and normal work in order to better learn how their system works — all skills that apply directly to SRE.
Jason Koppe — Learning From Incidents
- The author explains the following points from the idea that “some of the approaches that I have used to get better at mountain climbing can be applied to software engineering with a similar effect.”.
○ Discuss every day work
○ Learn from incidents
○ What the software industry can learn from the climbing community
The Unique Reliability Engineering Requirements of Microservices
Although the fundamental concepts of site reliability engineering are the same in any environment, SREs must adapt practices to different technologies, like microservices.
JJ Tang — Rootly
- Since it is covered in DEVOPS WEEKLY ISSUE #553 above, I will skip it.
It’s Time to Rethink Outage Reports
This one uses Akamai’s incident report from their July 22 major outage as a jumping-off point to discuss openness in incident reports. The text of Akamai’s incident report is included in full.
Geoff Huston — CircleID
- As the title suggests, it encourages reconsideration of the contents of Outage Reports from the following perspectives.
○ It would be a positive step forward for this industry if Akamai’s outage report were not unusual in any way. It would be good if all service providers spent the time and effort post rectification of an operational problem to produce such outage reports as a matter of standard operating procedure.
Culture & Conduct Risk: The Normalization of Deviance
Drawing from the “normalization of deviance” concept introduced in the Challenger disaster study [Diane Vaughan], this article explores the idea of studying your organization culture to catch problems early, rather than waiting to respond after they happen.
Stephen Scott
- It explains the title contents with the following point.
- A prevailing assumption, among firms and regulators alike, is that misconduct problems can be discovered only after they occur: a ‘detect and correct’ mindset. But we’re beginning to see the emergence of a ‘predict and prevent’ approach to managing conduct risk in organizations.
Lorin Hochstein (Netflix) [StaffEng Podcast]
This episode of the StaffEng Podcast is an interview with Lorin Hochstein, whose writings I’ve featured here numerous times. My favorite part of this episode is when they talk about doing incident analysis for near misses. One of the hosts points out that it’s much easier for folks to talk about what happened, because there was no incident so they’re not worried about being blamed.
David Noël-Romas and Alex Kessinger– StaffEng Podcast
- A 48-minute podcast that talks about resilience to guests from Netflix Senior Software Engineer Lorin Hochstein, who features an article on this blog every week.
Outages
- Let’s Encrypt
- Snapchat
- Wikipedia
To fact-check this one, I looked at their grafana dashboard. Neat! - Netflix
- Venmo
- Blackboard Learn
- eBay
The Headlines
Editor’s pick of the highlights from the past week.
Kubernetes 1.22: Reaching new peaks
The Kubernetes release team is pleased to announce the release of Kubernetes 1.22, the second release of 2021!
This release consists of 53 enhancements: 13 enhancements have graduated to stable, 24 enhancements are moving to beta, and 16 enhancements are entering alpha. Also, three features have been deprecated. Learn more about the release from the blog, or listen to an interview with release team lead Savitha Raghunathan.
- An article about Kubernetes 1.22 release on the Kubernetes Blog. There are changes in the above numbers, which are explained in the following major items. There is a lot of information, so I will gradually catch up.
○ Major Themes
○Major Changes
○ Other Updates
○ Release notes
○ Release Team
○ Release Logo
○ User Highlights
○ Project Velocity
○ Ecosystem Updates
○ Event Updates
○ Upcoming release webinar
○ Get Involved - The Release Logo for Kubernetes 1.22 is below.
CNCF Unveils Schedule for KubeCon + CloudNativeCon North America 2021 in Los Angeles and Virtual
The schedule for KubeCon + CloudNativeCon North America 2021 is finally here! Attendees will experience over 230 sessions, including keynotes and breakouts, with over 70 presentations hosted by project maintainers. From non-technical and end user case studies to advanced engineering deep dives — the conference has content for everyone interested in cloud native technology. Explore the schedule and start planning your experience today.
- The schedule for “KubeCon + CloudNativeCon North America 2021” to be held from October 11th has been announced. I plan to participate online, but I forgot to buy an early bird ticket, so I bought a ticket for $ 75 and registered for participation.
ICYMI: CNCF online programs this week
A weekly summary of CNCF online programs from this week.
Cloud Native Live:Humanising your cloud native platform
Lee Briggs, Pulumi
- A one-hour session that looks at the proliferation of platforms across cloud-native organizations and explains how Pulumi’s software delivery platform and infrastructure pipeline builds can satisfy and help users.
On-demand Webinars:Securing your continuous everything strategy
Abubakar Siddiq Ango, GitLab
- A 13-minute session that explains the vulnerabilities that can occur at various stages of the continuous development life cycle and how to mitigate them.
Governors clusters these persistent data
James Spurin, StorageOS
- A 29-minute session that focuses on CSI, which is often overlooked, and explains the following.
○ Why the CSI and the installation of an effective data plane should be a key consideration for Kubernetes deployments
○ Opportunities for improvements that include multi-tenancy, high availability, compliance with encryption at rest, ease of use with GitOps and the transition of traditional and legacy workloads, dependent on persistent data
○ Live demo, showcasing the benefits of a Kubernetes data plane - It was good that the materials and demos were very easy to see.
Visit our Online Programs playlist on YouTube for more content.
The Technical
Tutorials, tools, and more that take you on a deep dive into the code.
Two year update: Building an open source marketplace for Kubernetes
Alex Ellis, OpenFaaS Ltd
- It describes the journey of the last two years from the initial creation of the Kubernetes open source marketplace to the growth of the community, the acquisition of the first sponsored app, and the next step.
Implementing traffic policies in Kubernetes
Cody De Arkland, Kong
- A guest posting to CNCF Blog originally published at Kong blog.
- It details how to use the latest distributed control plane, Kuma , bundled with Envoy Proxy to monitor and monitor Kubernetes traffic.
Flux blog
- Since it was previously covered in “ KubeWeekly # 267 July 9th, 2021 “, I will skip it.
Encrypt your Kubernetes secrets with Mozilla SOPS
Thorsten Hans blog
- It Explains how to use SOPS in combination with Azure Key Vault to encrypt and decrypt Kubernetes secrets (YAML files). This allows you to save your secret directly to git along with other Kubernetes manifests.
The Editorial
Articles, announcements, and morethatgive you a high-level overview of challenges and features.
NSA, CISA release Kubernetes hardening guidance
National Security Administration
- It details threats to the Kubernetes environment and provides configuration guidance to minimize risk.
A guide to Kubernetes chargeback
Kubecost blog
- It aims to explain the benefits of the chargeback model, introduce some best practices for implementing Kubernetes chargeback reports, and help you get started with Kubercost within minutes.
Keeping KubeCon + CloudNativeCon Los Angeles 2021 safe for onsite attendees
CNCF
- It introduces safety efforts for on-site participants of KubeCon + CloudNativeCon Los Angeles 2021. In addition to securing space by reducing the capacity to one-third of the venue, various other measures are being taken in cooperation with local hygiene authorities.
Cloud Native Wasm Day North America hosted by CNCF
CFP Closes, Monday, Aug 9 at 11:59 PM PST
- The introduction of Cloud Native Wasm Day and the deadline for CFP are as mentioned above Monday, Aug 9 at 11:59 PM PST.
Upcoming CNCF Online Programs
Cloud Native Live
- August 11 at 9am PT: Emissary and Linkerd — How to integrate your Service Mesh with K8s Ingress presented by Daniel Bryant, Ambassador Labs & Jason Morgan, Buoyant — RSVP
On-demand
- August 12: Bringing your Terraform to Crossplane presented by Nic Cope, Upbound — RSVP
- August 12: Choosing the right storage for stateful workloads on Kubernetes presented by Abhinivesh Jain, Wipro — RSVP
YouTube playlist submissions
- Looking for more great curated content? Visit our Online Programs playlist on YouTube.
How about those articles? Do you have any interest in any?
Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.
Bye now!!