- In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
- Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
- I hope it contributes to the people browsing this kind of information as a reference.
DEVOPS WEEKLY ISSUE #521 December 20th, 2020
- The title is “Data Hub: Popular metadata architectures explained”.
- It describes the three generations of architecture that the industry has generated as data discovery tools, and along its scope, where many of the well-known options fall.
○ First-generation architecture: Monolith everything
○ Second-generation architecture: 3-tier app with a service API
○ Third-generation architecture: Event-sourced metadata
- Architectural progression between generations are mirrored by the evolution of LinkedIn’s DataHub architecture, which publishes this article. The company has promoted the latest best practices through the following open source.
○ (first open sourced and shared with the world as WhereHows in 2016, and then rewritten completely and re-shared with the open source community in 2019 as DataHub).
- The title is “Does AWS Serverless care about IT Operations? Their service naming says “no” but their breadth and quality of choice says “yes”.
- The meaning of “serverless” was mentioned at the beginning, and it does not literally eliminate servers, but states as follows.
○ “I believe quite the opposite, that serverless is the wave beyond VM configuration management in empowering operations-minded people to reclaim their focus, creativity, and business relevance.”
- From the releases at AWS re: Invent, it picked up things related to serverless and explained them according to the theme.
○ “I wrote operations in this post about as many times as AWS uses the word innovation in their presentations, but I’m walking away from re:Invent with the impression that AWS is serious about both.”
Related to a large CDN outage, this post looks at some of the theory behind the RAFT consensus algorithm, and whether it provides liveness guareeness during network failures and what can be done about it.
- The title is “Raft does not Guarantee Liveness in the face of Network Faults”.
- It touched on Cloudflare’s post-mortem “A Byzantine failure in the real world” that was covered on this blog before, and based on the discussion on Twitter about Raft of the distributed consensus algorithm , he explained it with the following three points.
○ Does Raft guarantee liveness in the presence of network failures?
○ So, does Raft with PreVote guarantee liveness then?
○ Does Raft with PreVote and CheckQuorum guarantee liveness?
- The title is “On YOLOsec and FOMOsec”.
- The proponent author explains why both YOLO security (YOLOsec) and FOMO security are detrimental disadvantages to infosec’s defenses and how to find them to protect them from your organization’s security strategy.
- The moment I saw the notation “33 minutes” in the upper left of the title, I gave up reading all at once. Some excerpts of tl; dr and Conclusion are below.
○ The tl;dr is that #yolosec and #fomosec are disconnected from the goals and needs of the business, forsaking pragmatism and prudence in favor of fanatical flavors of recklessness. YOLOsec reflects a security strategy driven by a “you only live once” mentality — one that emboldens people to ignore future concerns around security to achieve today’s gratification. FOMOsec reflects a security strategy driven by a fear of missing out — one that frightens people into misallocating resources towards what makes them feel better about their security efforts.
○ If security must shun both YOLOsec and FOMOsec, how should it look instead? To simultaneously alleviate a longing for belonging, envy, and myopia, infosec defenders must seek out and share the identity of “builder”58 with software engineers59. Aligning infosec metrics to software delivery metrics facilitates the alignment of infosec work to software delivery work. Acting upon this alignment — not just paying lip service — engenders the opportunity for security teams to more tangibly connect the work they perform with value and meaning produced.
- The title is “How to monitor multi-cloud Kubernetes with Prometheus and Grafana”.
- I’ve covered it in Kube Weekly # 244 last week, so I will skip it.
- The title is “Forbidden lore: hacking DNS routing for k8s”.
- There are multiple registries in Harbor, and they are struggling with DNS to point to different registries depending on the usage when retrieving container images.
- The title is “10 Best Tools to Monitor SSL Certificate Expiry, Validity & Change”.
- The following 10 SSL certificate expiration dates/validity/changes are explained using figures as the title suggests.
- Sematext Synthetics
- SSL Certificate Expiration Alerts
- Certificate Expiry Monitor
- SSL Certification Expiration Checker
- The title is “Building Kubernetes Clusters using Kubernetes”.
- It describes how to build a Kubernetes cluster using the Kubernetes with Argo Events and Argo Workflows.
- The SAP Concur used in this article uses EKS, and the same concept can be applied to other cloud providers.
○ Note: SAP Concur uses AWS EKS, and a similar concept can be applied to Google’s GKE, Azure’s AKS, or any other cloud provider’s Kubernetes offering.
SRE Weekly Issue #249 December 20th, 2020
Every service needs a couple of big hammers that are easy to swing.
Jennifer Mace — O’Reilly and Google
- The concept of “generic mitigation” is explained using cute illustrations.
Answer: automation. Lots of automation. And automation of the automation.
Fred Lin, Harish Dattatraya Dixit, and Sriram Sankar — Facebook
- It is easy to see the flow diagram of automatic/periodic detection, alert firing, automatic repair, etc. by connecting tools to detect hardware failures.
- The following four papers are also introduced to check the details.
○ Hardware remediation at scale.
○ Optimizing interrupt handling performance for memory failures in large scale data centers
○ Predicting remediations for hardware failures in large-scale datacenters.
○ Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment.
Oh, how quaint! This article was written back when people traveled for the holidays.
Ashley Roof —
- It introduces tips for On Call during the holiday season.
- At Transposit, they know the pain of On Call themselves, so they united to come up with the following five tips to make the holidays as painless as possible on shifts.
○ Share the love (or spread the pain) when organizing on call shifts, and incentivize communal behavior.
○ Communicate early and often, with and without runbooks.
○ Plan around potential travel problems
○ Let friendly allies help you manage the social side of the situation
○ Pat yourself and your team on the back
Surprise! Fortunately, there are some ways to fix this limitation.
Heidi Howard, Ittai Abraham — Decentralized Thoughts
- I covered it in DEVOPS WEEKLY ISSUE # 521 above, so I will skip it.
A common question when a company is implementing incident management is: why do we need this process?
It turns out that the easiest way to answer this question is to look at the world of unsuccessful incident management.
- The simplest way to answer a frequently asked question when a company implements incident management, “Why this process is needed,” describes the following characteristics of failed incident management:
○ Confusion about Process
○ Panic and Thrash
○ Lack of Awareness
○ Uncoordinated & Conflicting Response
○ Confusion over Ownership
○ Repeat Problems
Whether you’re new to Just Culture or an old hand, there’s a lot of great detail in this article.
Tory Thompson — Firehouse
- It describes “Just Culture” as an industry term used to describe a value-based accountability model that considers the behaviors, systems, and expectations that make up an organization.
- It is explained from the viewpoint that to foster a fair culture requires a multifaceted approach to managing risk, and it is important to take a holistic approach when investigating the issues and risks inherent in the operation of an organization.
○ Knowledge, systems, safeguards
○ Human performance
○ How we make mistakes
○ Safety and reporting culture
○ Systems and safeguards
○ Our experience
○ Standardization and bias reduction
○ Big data
○ Building trust
Not sold yet on full service ownership for development teams? This interview may help.
Vivian Chan — PagerDuty
- It introduces the introduction of “full-service ownership” to the issues and answering questions in an interview format. The question is below.
○ Q: First things first, what exactly is a service?
○ Q: So what’s the big deal about full-service ownership? Why should IT and engineering leaders care? Paint me a picture.
○ Q: What is one of the biggest drivers for moving to a model of full-service ownership?
○ Q: Where does one even start?
While ostensibly about Jeli.io, this article makes a great case for why incident analysis is important in general and what kind of data we should be trying to gather.
John Allspaw — Adaptive Capacity Labs
- An introductory article on Jeli.io, an analysis platform specializing in software-related incidents, by an angel investor.
A new feature roll-out resulted in impaired service for some customers.
- An incident report of Heroku Heroku Connect. Syncing with Salesforce affected 25% of production connections.
The adaptive universe: where adaptations to challenges feed back and cause more challenges, requiring more adaptations.
- It explains what’s in the title and Editor’s comments, but since it’s an article from a Twitter thread by former Uber engineer McLaren Stanley, the author highly recommends reading the original threat as follows:
○ I highly recommend reading the original thread in full. My writing above is based solely on that thread, I don’t have any additional information, and I probably got some stuff wrong. I also created a concept map based on Stanley’s thread.
Our first GraphQL release was twice as slow as our old REST API. Here’s how we fixed it.
Another great example of making a duplicate request to a new API in the background to test it before deploying it.
Michael P. Geraci — OkCupid
- Since they were building the GraphQL API on a whole new stack, they wanted to see how it measures under production load compared to the previous REST API so that it doesn’t adversely affect the user experience. The story of thinking and releasing “Shadow Request”.
- In Shadow Request, on the target page, the user loaded the page data from the REST API as usual, displayed the page, the user loaded the same data from GraphQL, timed the call, and discarded the data.
- It describes the improvements found in the Docker and Node environments, how GraphQL resolver works with lists of entities, and CORS requests.
- Google Workspace Status Dashboard
All Google services that use OAuth were unreachable due to an issue with Google’s User ID service. Click through for their report. This one caused issues for the start of my daughters’ school day since Meet and Classroom were down.
- Google Cloud Status Dashboard
Delivery of messages to @gmail.com addresses failed permanently and would not be retried. This report by Google has the details.
- Microsoft Outlook
- Galileo (satellite navigation system)
Failure information of each of the above companies
KubeWeekly # 245 ← No Updates
How about those articles? Do you have any interest in any?
Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.