SRE / DevOps / Kubernetes Weekly Collection#48(Week 53)

  • In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.

DEVOPS WEEKLY ISSUE #522 December 27th, 2020
SRE Weekly Issue #250 December 27th, 2020
KubeWeekly # 245 ← No Updates

DEVOPS WEEKLY ISSUE #522 December 27th, 2020


SLOs, and any sort of change, generally needs management buy-in in order to properly implement. This post is a good example of how one team went about selling SLOs to engineering management.

  • The title is “How to sell SLOs to Engineering Directors”.

A nice description of using WebAssembly to allow for untrusted code execution. WASM as a generic extension capability I think will become pretty common, nice to see a specific approach.

  • The title is “How Shopify Uses WebAssembly Outside of the Browser”.

Details of one team moving away from (some) microservices and merging code back into a monolithic application. Good discussion of the advantages and costs of microservices and how to right-size individual services.

  • The title is “Why I’ve Been Merging Microservices Back Into The Monolith At InVision”.

A good architecture post on building a realtime platform API, moving from polling to gRPC-based bi-directional streaming.

  • The title is “Uber’s Real-Time Push Platform”.

A post on how applying GitOps practices can improve the security characteristics of your deployment pipeline.

  • The title is “How GitOps Improves the Security of Your Development Pipelines”.
  1. Config as Code

Most Dockerfiles are simple, but it’s possible to solve more complex problems too. This example shows cross-compilation patterns for expensive compilation targets.

  • The title is “Compiling Qt with Docker multi-stage and multi-platform”.

A look at OpenTelemetry and in particular it’s usage in Java applications.

  • The title is “Open Telemetry Java: All you need to know”.


Tobs is a distribution of monitoring tools for Kubernetes. Setup a full stack with Prometheus, Grafana, Promscale, Promlens and more with Helm or a custom CLI.

  • The GitHub page of Tobs(The Observability Stack for Kubernetes), a tool that makes it as easy as possible to install a fully observable stack on a Kubernetes cluster.

Singer is an open source toolkit for ETL. At its core is a specification, and a system of taps (for extracting data) and targets (for saving it)

  • As mentioned above, the GitHub page of the OSS tool “Singer” for ETL. It sends data between databases, web APIs, files, queues, and anything else you can think of.

Grafana-sync is a handy tool for syncing dashboards between Grafana installs using the Grafana API.

  • As the name implies, the GitHub page of “grafana-sync”, a tool for synchronizing Grafana dashboards.

SRE Weekly Issue #250 December 27th, 2020


Salt Incident: May 3rd 2020 Retrospective and Update

Here’s how Algolia was affected by the Salt Stack RCE vulnerability earlier this year and how they dealt with it.

Julien Lemoine — Algolia

  • A Post-mortem of disability on 2020/05/03, dated 020/05/05 of Salt.

How to Prepare for a Site Reliability Engineer Interview

Includes background information on SRE and example interview questions.

Marlo Vernon — Splunk

  • This article is for those who are interviewed for employment as an SRE, and is divided into the following three items. I think that the items/contents will be helpful for the employer as well.
    ○ What is a site reliability engineer? (SRE)
    ○ Primary roles and responsibilities of an SRE
    ○ Questions to expect in a site reliability engineer interview

6 Scary Outage Stories from CTOs

DNS, TLS certificates, and Unicode, among other issues, make for some great (and cringe-worthy) stories.

Adam LaGreca, with stories from Charity Majors, Matthew Fornaciari, Liran Haimovitch, Daniel Spoonhower, Lee Liu, and Tina Huang

  • As the title suggests, each of the six CTOs features a Halloween and shares their experience of their outages to better prevent them from happening in the future.

The Day of the RDS Multi-AZ Failover

In this story of a failover gone wrong, they discovered that they had had innodb_flush_log_at_trx_commit set incorrectly, explaining how they lost data when they weren’t expecting to.

Rajeev Rai — Razorpay

  • It shares the history, response details, and knowledge of the failure of RDS failover to Multi-AZ that the company experienced in 2019.

Much that we’ve gotten wrong about Site Reliability Engineering

This is a nice little comic about the role of SRE. Engineer the bridge, don’t be the bridge.

Piyush Verma — Last9

  • As mentioned above, it’s a comic-style article. I felt that it was necessary for the SRE itself and others to understand the role and be careful not to take personal and ad hoc measures so that SREs could “engineering” and “observing” their platforms.
    ○ “SREs should’ve been engineering and observing the bridge, but instead they became the bridge.”

You Reap What You Code

Lots of great concepts about human/computer systems, including this gem:

log facts, not interpretations

Fred Hebert

The Mysterious Case of the Bad Gateway (502)

In this troubleshooting story, an innocent-seeming dependency upgrade introduced a subtle but nasty bug.

  • As the title suggests, they investigated the 502 error that was occasionally returned for API requests. The cause was that the TCP backlog length was set to 1 instead of the default value, 128.

Jordan Place — Transposit

Google Cloud Platform

Google released an update to their post-analysis for the December 14th outage involving Google OAuth.

  • The editor picked the post-analysis mentioned in Outages last week, since it updated the following point.
    ○ “The following is a correction to the previously posted ISSUE SUMMARY, which after further research we determined needed an amendment. All services that require sign-in via a Google Account were affected with varying impact. Some operations with Cloud service accounts experienced elevated error rates on requests to the following endpoints: or Impact varied based on the Cloud Service and service account. Please open a support case if you were impacted and have further questions.”

KubeWeekly # 245 ← No Updates

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara

An infra engineer in Tokyo, Japan. Grew up in Athens, Greece(1986–1992). #Network, #Kubernetes, #GCP, #Certified AWS SAP

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store