SRE / DevOps / Kubernetes Weekly Collection#48(Week 53)
- In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
- Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
- I hope it contributes to the people browsing this kind of information as a reference.
DEVOPS WEEKLY ISSUE #522 December 27th, 2020
- The title is “How to sell SLOs to Engineering Directors”.
- It shares redacted internal memo aimed to familiarize SLOs to listeners and explain the value of SLO culture and how we would implement and roll them out.
- The title is “How Shopify Uses WebAssembly Outside of the Browser”.
- It explains how Shopify chose WebAssembly, a universal format that guarantees it’s performant, secure, and flexible as following with the perspectives of the security / performance / flexibility / community-driven, architecture, and more.
○ We want Partners to focus on using their domain knowledge to solve problems, and not on managing scalable web services. To make this a reality we’re keeping the flexibility of untrusted Partner code, but executing it on our own infrastructure. We choose a universal format for that code that ensures it’s performant, secure, and flexible: WebAssembly.
Details of one team moving away from (some) microservices and merging code back into a monolithic application. Good discussion of the advantages and costs of microservices and how to right-size individual services.
- The title is “Why I’ve Been Merging Microservices Back Into The Monolith At InVision”.
- A story about integrating a legacy service that was a microservice to which he belonged into a monolith and resizing it.
○ “I am not anti-microservices.This quest is intended to “right size” the monolith. What I am doing is solving a pain-point for my team.” he pointed out in the beginning as follows.
- “To be very clear, I wanted to start this post off by stating unequivocally that I am not anti-microservices. My merging of services back into the monolith is not some crusade to get microservices out of my life. This quest is intended to “right size” the monolith. What I am doing is solving a pain-point for my team. If it weren’t reducing friction, I wouldn’t spend so much time (and opportunity cost) lifting, shifting, and refactoring old code”.
- It was very helpful to see the problems that microservices solve, how the company introduced it, and if it were to be redone.
○ “In short, all the benefits of Conway’s Law for the organization have become liabilities over time for my “legacy” team. And so, we’ve been trying to “right size” our domain of responsibility, bringing balance back to Conway’s Law. Or, in other words, we’re trying to alter our service boundaries to match our team boundary”. Which means, merging microservices back into the monolith.”
○ “A far more helpful term would have been, “right sized”. Microservices were never intended to be “small services”, they were intended to be “right sized services.””
○ “For my team, “right sized” means fewer repositories, fewer deployment queues, fewer languages, and fewer operational dashboards. For my rather small team, “right sized” is more about “People” than it is about “Technology”. So, in the same way that InVision originally introduced microservices to solve “People problems”, my team is now destroying those very same microservices in order to solve “People problems”.”
- The title is “Uber’s Real-Time Push Platform”.
- Uber built its own app experience by migrating from polling to a gRPC-based bi-directional streaming protocol to update apps.
- I saw the item “Eliminating polling, introducing RAMEN” twice. RAMEN is an abbreviation for Realtime Asynchronous MEssaging Network.
- I was hungry when I saw the words “RAMEN Server” and “Scaling RAMEN globally”. Is UberEats delivering RAMEN noodles for me?
- The title is “How GitOps Improves the Security of Your Development Pipelines”.
- An outline article of the session of the virtual event “GitOps Days 2020”. The YouTube video of the session is embedded.
- It explains GitOps with the following three points, saying that you can control changes and see changes from a single source.
- Config as Code
- Changes are auditable
- Production matches the desired state kept in Git
- The title is “Compiling Qt with Docker multi-stage and multi-platform”.
- An article about building the cross-platform development framework Qt into multi-stage and multi-platform using Docker.
- For embedded devices, compiling is not easy, and compiling Qt(and QtWebEngine) is a very heavy operation. Therefore, precompile and distribute Qt so that the Dockerfile is downloaded and included in the build process(rather than compiling as part of the installation process).
- The title is “Open Telemetry Java: All you need to know”.
- As a tutorial, it explains how to attach the OpenTelemetry Java Agent, Trace methods, Span methods, and so on.
- Click here for the GitHub page.
- The GitHub page of Tobs(The Observability Stack for Kubernetes), a tool that makes it as easy as possible to install a fully observable stack on a Kubernetes cluster.
- It provides CLI tools that make deployment and operation easy, and also provides helm charts that can be used directly or as subcharts for other projects.
- As mentioned above, the GitHub page of the OSS tool “Singer” for ETL. It sends data between databases, web APIs, files, queues, and anything else you can think of.
- Click here for the GitHub page.
- As the name implies, the GitHub page of “grafana-sync”, a tool for synchronizing Grafana dashboards.
SRE Weekly Issue #250 December 27th, 2020
Here’s how Algolia was affected by the Salt Stack RCE vulnerability earlier this year and how they dealt with it.
Julien Lemoine — Algolia
- A Post-mortem of disability on 2020/05/03, dated 020/05/05 of Salt.
- Salt’s configuration management vulnerability “CVE-2020–11651” could attack Algolia’s infrastructure, allowing two types of malware code to infiltrate Algolia’s configuration manager.
Includes background information on SRE and example interview questions.
Marlo Vernon — Splunk
- This article is for those who are interviewed for employment as an SRE, and is divided into the following three items. I think that the items/contents will be helpful for the employer as well.
○ What is a site reliability engineer? (SRE)
○ Primary roles and responsibilities of an SRE
○ Questions to expect in a site reliability engineer interview
DNS, TLS certificates, and Unicode, among other issues, make for some great (and cringe-worthy) stories.
Adam LaGreca, with stories from Charity Majors, Matthew Fornaciari, Liran Haimovitch, Daniel Spoonhower, Lee Liu, and Tina Huang
- As the title suggests, each of the six CTOs features a Halloween and shares their experience of their outages to better prevent them from happening in the future.
In this story of a failover gone wrong, they discovered that they had had innodb_flush_log_at_trx_commit set incorrectly, explaining how they lost data when they weren’t expecting to.
Rajeev Rai — Razorpay
- It shares the history, response details, and knowledge of the failure of RDS failover to Multi-AZ that the company experienced in 2019.
This is a nice little comic about the role of SRE. Engineer the bridge, don’t be the bridge.
Piyush Verma — Last9
- As mentioned above, it’s a comic-style article. I felt that it was necessary for the SRE itself and others to understand the role and be careful not to take personal and ad hoc measures so that SREs could “engineering” and “observing” their platforms.
○ “SREs should’ve been engineering and observing the bridge, but instead they became the bridge.”
Lots of great concepts about human/computer systems, including this gem:
log facts, not interpretations
- A gradual record of the author’s talk at the online conference “Deserted Island DevOps Summer Send-Off “ at COVID-19. The pasted image is loose, but the content is full.
- The entire conference was held on the game “Animal Crossing”.
- Click here for a retrospective article on the conference.
In this troubleshooting story, an innocent-seeming dependency upgrade introduced a subtle but nasty bug.
- As the title suggests, they investigated the 502 error that was occasionally returned for API requests. The cause was that the TCP backlog length was set to 1 instead of the default value, 128.
Jordan Place — Transposit
Google released an update to their post-analysis for the December 14th outage involving Google OAuth.
- The editor picked the post-analysis mentioned in Outages last week, since it updated the following point.
○ “The following is a correction to the previously posted ISSUE SUMMARY, which after further research we determined needed an amendment. All services that require sign-in via a Google Account were affected with varying impact. Some operations with Cloud service accounts experienced elevated error rates on requests to the following endpoints: www.googleapis.com or oauth2.googleapis.com. Impact varied based on the Cloud Service and service account. Please open a support case if you were impacted and have further questions.”
KubeWeekly # 245 ← No Updates
How about those articles? Do you have any interest in any?
Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.