SRE / DevOps / Kubernetes Weekly Collection#42(Week 47)
- In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
- Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
- I hope it contributes to the people browsing this kind of information as a reference.
DEVOPS WEEKLY ISSUE #516 November 15th, 2020
SRE Weekly Issue #244 November 15th, 2020
KubeWeekly : ←No Updates
- The title is “Video course on cloud-native CI/CD with Tekton & Argo CD”.
- It explains how to use Tekton and ArgoCD to implement a suitable continuous delivery pipeline for modern enterprise Java projects. It embeds the nine YouTube videos the author explains on the web page.
- The title is “Measuring DevOps”.
- Google studies have developed and validated four metrics that provide a high-level system view of software delivery and performance and predict the ability of an organization to reach its goals. This video shows GCP and Tekton automation that automates the generation and collection of these four metrics.
○ Deployment Frequency
○ Lead Time to Change
○ Change Fail Rate
○ Time to Restore
- The title is “Self-Service Platform Development Made Easy with Pulumi”.
- Based on his experience consulting dozens of internal platform teams, he explains Puppet’s “The State of DevOps Report 2020”. He was surprised that all sections were about “Scaling DevOps practices with internal platforms.”.
- The author introduces “cloud platform”, which was created as a prototype and the code is open sourced. It provides an easy way for platform teams to create self-service.
- The title is “Fixing leaky logs: how to find a bug and ensure it never returns”.
- It introduces a case where security is put into the hands of the developer. It explains how he and his colleague’s developer have successfully identified data breaches in the logs, fixed the problem, and implemented a method to prevent it from happening in the future.
- The title is “Facing Challenges at the Edge”.
- It details some of Edge’s platform and app manageability challenges, potential designs and solutions.
- It guides you through two articles about DevOps. The first title of the above link is “2020 State of DevOps Report is here!”. Research reveals a strong link between DevOps evolution and the use of in-house platforms, and analyzer results show approval processes (orthodox and adaptive), automated testing and deployment, and advanced risk mitigation techniques.
- The following four different approaches to change management based on are revealed.
○ Operationally mature
○ Engineering driven
○ Governance focused
○ To this. - The second title is “Two secret weapons DevOps can use to take over the entire enterprise”. This also touches on the article above and shares ideas about the in-house platform team.
Events
- This week it continues to feature Container Solutions events. This time, it has Matthew Skelton, co-author of “Team Topologies”, as a guest. As mentioned above, a 90-minute course was held. 11/19 (Thursday) 14:15 CET (Central European Time zone).
Books
- It introduces the free e-book “SRE: The Cloud Native Approach to Operations” provided by Container Solutions.
- Enter your full name and email address and you will be directed to the download page by email. The amount is 30 pages.
Tools
- The GitHub page of the CLI tool “ctlptl” for declaratively setting up a local Kubernetes cluster.
- Goals and Non-Goals are clearly written, and it is good because it mentions and introduces other tools too.
- The GitHub page of the open source platform “Athenz” for X.509 certificate-based service authentication and fine-grained access control in dynamic infrastructure.
- It supports provisioning and configuration (centralized authentication) use cases and serving / runtime (distributed authentication) use cases. The Athenz authentication system utilizes x.509 certificates and industry standard mutual TLS-bound oauth2 access tokens.
- The name “Athenz” comes from “AuthNZ” (N for authentication, Z for authorization).
- I will skip it because it was covered in KubeWeekly last week.
- It introduces the new Kubernetes distribution “K0s” which is developed by the Lens team.
- Click here for the io page.
- The answer to “Why is it zero?” Is below.
○ The “zero” in k0s really captures our aspiration to not compromise as we build the ultimate Kubernetes distribution:
● Zero Friction
● Zero Dependencies
● Zero Overhead
● Zero Cost
● Zero Downtime
SRE Weekly Issue #244 November 15th, 2020
Articles
Type in the exact number of machines to proceed
If you’re gonna operate on a pile of computers all at once that numbers 6+ figures, making you type that number in is a way to make you pause and think about what you’re doing.
Rachel by the bay
- When operating with the CLI on many machines and generating a confirmation prompt as a sanity check, instead of asking for the Y/N type and inputting it, it proposes a method to read and input the number as follows.
○ Blah blah blah 123,456 machines will be affected. Proceed?
Enter number of machines to confirm: 123456
OK! Continuing. - Certainly, even if you are asked your intention, you only press Yes, so if you are displayed the target and confirm it and enter the number, you will not make a mistake in the range of influence, and you will not unknowingly involve a large number of machines.
IT metrics: Why the five 9s must go
Find out why they decided to focus less on nines, and what they did instead.
Robert Sullivan
- It points out the problems/issues of shooting the percentage of uptime for “9s”, and introduces the following approaches evolved by the company’s SRE, L1 / L2 support, and operation team.
- Count all the minutes that affect business performance. These are “impacted” minutes.
- Not all incidents are the same, so it’s important to agree on definitions. Here’s what we decided:
- Count all impacted minutes (global, partial, and degraded) against the total number of minutes in a month.
- Meet with business leadership regularly (we do this weekly) to discuss the numbers and the impact of service interruptions on the business.
- Track instances in which your monitoring leads to action that avoids impacted time. (We refer to these as “mitigated events.”)
- Count the minutes that high-availability services were not fully redundant.
Reminds me of the classic:
It’s not DNS There’s no way it’s DNS It was DNS
Mike S.
- It’s a 2017 article, so it’s quite a while ago. But still DNS is difficult, and I laughed when I saw the link to “network solutions haiku”.
Moving OkCupid from REST to GraphQL
Their front-end made duplicate calls to the new API to test load and response time prior to cutting over.
Michael P. Geraci — OkCupid
- A case study from OkCupid that made the transition from REST API to GraphQL API1 on a site with millions of users without compromising performance.
- I also share the following four processes and “what I should have done” that I learned through the release.
- Pick an appropriate page to convert
- Build the schema
- Add a shadow request to call the new API while still fetching data via the REST API
- Do an A/B test with real users that changes the data source
New Arctic Air Crash Aftermath Role-Play Simulation Orchestrating a Fundamental Surprise
This is really cool. The researchers created a role-play scenario based on a real plane crash. They tried to get participants to blame “human error”, so that they could then surprise them with all of the (many) contributing factors that were involved.
Emily S. Patterson, Richard I. Cook, David D. Woods, Marta L. Render
- A “fundamental surprise” generated by deliberate discrepancies between lessons learned during role-playing and potential lessons, allows oversimplified assumptions of how complex systems are unchallenged.
Tips from one Sysadmin’s journey to becoming an SRE.
Josh Duffney — Octopus Deploy
- In line with the title, I share my thoughts on the following items for those who are new to the industry.
○ It isn’t about tools, but…
○ Learn to code from the command-line
○ Start at the source
○ Pull requests mean deployments
Outages
- YouTube
- Macs
Mac users had issues launching applications, owing to an outage of ocsp.apple.com. Apple confirmed the issue. - PrometheusKube
The link points to their awesome writeup of what went wrong and the on-the-fly reworking they had to do to fix it. - Hotmail
- Various stock trading platforms
There’s some speculation that this was a result of increased trading volume following Pfizer’s announcement about vaccine trial results. - Robinhood
- Increased Error Rates
KubeWeekly : ←No Updates
How about those articles? Do you have any interest in any?
Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.
Bye now!!