SRE / DevOps / Kubernetes Weekly Collection#47(Week 52)

9 min readDec 28, 2020

In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #521 December 20th, 2020
SRE Weekly Issue #249 December 20th, 2020
KubeWeekly # 245 ← No Updates

DEVOPS WEEKLY ISSUE #521 December 20th, 2020

News

There are lots of tools for storing data, but how do you find the right dataset for analysis? This post explores a number of different architectural approaches and discusses pros and cons.

The title is “Data Hub: Popular metadata architectures explained”.
It describes the three generations of architecture that the industry has generated as data discovery tools, and along its scope, where many of the well-known options fall.
○ First-generation architecture: Monolith everything
○ Second-generation architecture: 3-tier app with a service API
○ Third-generation architecture: Event-sourced metadata
Architectural progression between generations are mirrored by the evolution of LinkedIn’s DataHub architecture, which publishes this article. The company has promoted the latest best practices through the following open source.
○ (first open sourced and shared with the world as WhereHows in 2016, and then rewritten completely and re-shared with the open source community in 2019 as DataHub).

A good writeup from a recent AWS reInvent talk focused on AWS’s Serverless services. This post focused on what this means for operations, which is often neatly ignored in the marketing.

The title is “Does AWS Serverless care about IT Operations? Their service naming says “no” but their breadth and quality of choice says “yes”.
The meaning of “serverless” was mentioned at the beginning, and it does not literally eliminate servers, but states as follows.
○ “I believe quite the opposite, that serverless is the wave beyond VM configuration management in empowering operations-minded people to reclaim their focus, creativity, and business relevance.”
From the releases at AWS re: Invent, it picked up things related to serverless and explained them according to the theme.
○ “I wrote operations in this post about as many times as AWS uses the word innovation in their presentations, but I’m walking away from re:Invent with the impression that AWS is serious about both.”

The title is “Raft does not Guarantee Liveness in the face of Network Faults”.
It touched on Cloudflare’s post-mortem “A Byzantine failure in the real world” that was covered on this blog before, and based on the discussion on Twitter about Raft of the distributed consensus algorithm , he explained it with the following three points.
○ Does Raft guarantee liveness in the presence of network failures?
○ So, does Raft with PreVote guarantee liveness then?
○ Does Raft with PreVote and CheckQuorum guarantee liveness?

This long read introduces YOLOSec and FOMOSec as terms to describe problematic but all-to-common approaches to security strategy, driven either by short-termism or by chasing fashion.

The title is “On YOLOsec and FOMOsec”.
The proponent author explains why both YOLO security (YOLOsec) and FOMO security are detrimental disadvantages to infosec’s defenses and how to find them to protect them from your organization’s security strategy.
The moment I saw the notation “33 minutes” in the upper left of the title, I gave up reading all at once. Some excerpts of tl; dr and Conclusion are below.
○ The tl;dr is that #yolosec and #fomosec are disconnected from the goals and needs of the business, forsaking pragmatism and prudence in favor of fanatical flavors of recklessness. YOLOsec reflects a security strategy driven by a “you only live once” mentality — one that emboldens people to ignore future concerns around security to achieve today’s gratification. FOMOsec reflects a security strategy driven by a fear of missing out — one that frightens people into misallocating resources towards what makes them feel better about their security efforts.
○ If security must shun both YOLOsec and FOMOsec, how should it look instead? To simultaneously alleviate a longing for belonging, envy, and myopia, infosec defenders must seek out and share the identity of “builder”58 with software engineers59. Aligning infosec metrics to software delivery metrics facilitates the alignment of infosec work to software delivery work. Acting upon this alignment — not just paying lip service — engenders the opportunity for security teams to more tangibly connect the work they perform with value and meaning produced.

More and more teams are now needing to manage multiple Kubernetes clusters. This post takes a look at the monitoring challenges that brings, and how to solve them with Prometheus and Grafana.

The title is “How to monitor multi-cloud Kubernetes with Prometheus and Grafana”.
I’ve covered it in Kube Weekly # 244 last week, so I will skip it.

A post exploring DNS routing in Kubernetes, stepping through several potential solutions to a specific problem.

The title is “Forbidden lore: hacking DNS routing for k8s”.
There are multiple registries in Harbor, and they are struggling with DNS to point to different registries depending on the usage when retrieving container images.

Ensuring SSL certificates don’t expire is an essential if annoying problem, and several services exist to help. This post runs down a list of different solutions.

The title is “10 Best Tools to Monitor SSL Certificate Expiry, Validity & Change”.
The following 10 SSL certificate expiration dates/validity/changes are explained using figures as the title suggests.

Sematext Synthetics
TrackSSL
Pingdom
Smartbear
Keychest
Site24x7
Juices
SSL Certificate Expiration Alerts
Certificate Expiry Monitor
SSL Certification Expiration Checker

A look at building a Kubernetes-based platform using Argo Workflows and Argo Events.

The title is “Building Kubernetes Clusters using Kubernetes”.
It describes how to build a Kubernetes cluster using the Kubernetes with Argo Events and Argo Workflows.
The SAP Concur used in this article uses EKS, and the same concept can be applied to other cloud providers.
○ Note: SAP Concur uses AWS EKS, and a similar concept can be applied to Google’s GKE, Azure’s AKS, or any other cloud provider’s Kubernetes offering.

SRE Weekly Issue #249 December 20th, 2020

Articles

Generic mitigations

Every service needs a couple of big hammers that are easy to swing.

Jennifer Mace — O’Reilly and Google

The concept of “generic mitigation” is explained using cute illustrations.

How Facebook keeps its large-scale infrastructure hardware up and running

Answer: automation. Lots of automation. And automation of the automation.

Fred Lin, Harish Dattatraya Dixit, and Sriram Sankar — Facebook

It is easy to see the flow diagram of automatic/periodic detection, alert firing, automatic repair, etc. by connecting tools to detect hardware failures.
The following four papers are also introduced to check the details.
○ Hardware remediation at scale.
○ Optimizing interrupt handling performance for memory failures in large scale data centers
○ Predicting remediations for hardware failures in large-scale datacenters.
○ Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment.

Tips for On Call Engineers During the Holidays

Oh, how quaint! This article was written back when people traveled for the holidays.

Ashley Roof —

It introduces tips for On Call during the holiday season.
At Transposit, they know the pain of On Call themselves, so they united to come up with the following five tips to make the holidays as painless as possible on shifts.
○ Share the love (or spread the pain) when organizing on call shifts, and incentivize communal behavior.
○ Communicate early and often, with and without runbooks.
○ Plan around potential travel problems
○ Let friendly allies help you manage the social side of the situation
○ Pat yourself and your team on the back

Raft does not Guarantee Liveness in the face of Network Faults

Surprise! Fortunately, there are some ways to fix this limitation.

Heidi Howard, Ittai Abraham — Decentralized Thoughts

I covered it in DEVOPS WEEKLY ISSUE # 521 above, so I will skip it.

Anatomy of Unsuccessful Incident Management

A common question when a company is implementing incident management is: why do we need this process?

It turns out that the easiest way to answer this question is to look at the world of unsuccessful incident management.

Kintaba

The simplest way to answer a frequently asked question when a company implements incident management, “Why this process is needed,” describes the following characteristics of failed incident management:
○ Confusion about Process
○ Panic and Thrash
○ Lack of Awareness
○ Blame
○ Uncoordinated & Conflicting Response
○ Confusion over Ownership
○ Repeat Problems

Just Culture: Standardizing Fire Service Accountability

Whether you’re new to Just Culture or an old hand, there’s a lot of great detail in this article.

Tory Thompson — Firehouse

It describes “Just Culture” as an industry term used to describe a value-based accountability model that considers the behaviors, systems, and expectations that make up an organization.
It is explained from the viewpoint that to foster a fair culture requires a multifaceted approach to managing risk, and it is important to take a holistic approach when investigating the issues and risks inherent in the operation of an organization.
○ Knowledge, systems, safeguards
○ Human performance
○ How we make mistakes
○ Safety and reporting culture
○ Systems and safeguards
○ Our experience
○ Standardization and bias reduction
○ Big data
○ Building trust

Let’s Talk: Full-Service Ownership

Not sold yet on full service ownership for development teams? This interview may help.

Vivian Chan — PagerDuty

It introduces the introduction of “full-service ownership” to the issues and answering questions in an interview format. The question is below.
○ Q: First things first, what exactly is a service?
○ Q: So what’s the big deal about full-service ownership? Why should IT and engineering leaders care? Paint me a picture.
○ Q: What is one of the biggest drivers for moving to a model of full-service ownership?
○ Q: Where does one even start?

Jeli.io: Supporting Grounded Incident Analysis

While ostensibly about Jeli.io, this article makes a great case for why incident analysis is important in general and what kind of data we should be trying to gather.

John Allspaw — Adaptive Capacity Labs

An introductory article on Jeli.io, an analysis platform specializing in software-related incidents, by an angel investor.

Heroku incident #2130 follow-up: Heroku Connect Sync Issue

A new feature roll-out resulted in impaired service for some customers.

An incident report of Heroku Heroku Connect. Syncing with Salesforce affected 25% of production connections.

Uber’s adventures in the adaptive universe

The adaptive universe: where adaptations to challenges feed back and cause more challenges, requiring more adaptations.

Lorin Hochstein

It explains what’s in the title and Editor’s comments, but since it’s an article from a Twitter thread by former Uber engineer McLaren Stanley, the author highly recommends reading the original threat as follows:
○ I highly recommend reading the original thread in full. My writing above is based solely on that thread, I don’t have any additional information, and I probably got some stuff wrong. I also created a concept map based on Stanley’s thread.

The Shadow Request: Troubleshooting OkCupid’s First GraphQL Release

Our first GraphQL release was twice as slow as our old REST API. Here’s how we fixed it.

Another great example of making a duplicate request to a new API in the background to test it before deploying it.

Michael P. Geraci — OkCupid

Since they were building the GraphQL API on a whole new stack, they wanted to see how it measures under production load compared to the previous REST API so that it doesn’t adversely affect the user experience. The story of thinking and releasing “Shadow Request”.
In Shadow Request, on the target page, the user loaded the page data from the REST API as usual, displayed the page, the user loaded the same data from GraphQL, timed the call, and discarded the data.
It describes the improvements found in the Docker and Node environments, how GraphQL resolver works with lists of entities, and CORS requests.

Outages

Google Workspace Status Dashboard
All Google services that use OAuth were unreachable due to an issue with Google’s User ID service. Click through for their report. This one caused issues for the start of my daughters’ school day since Meet and Classroom were down.
Google Cloud Status Dashboard
Gmail
Delivery of messages to @gmail.com addresses failed permanently and would not be retried. This report by Google has the details.
Instagram
Microsoft Outlook
Galileo (satellite navigation system)
Spotify

Failure information of each of the above companies

KubeWeekly # 245 ← No Updates

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara