SRE / DevOps / Kubernetes Weekly Collection#8(Week 13)

Image for post
Image for post
  • In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
  • Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
  • I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #482 March 22nd, 2020
SRE Weekly Issue #212 March 23rd, 2020
KubeWeekly #209: March 27th, 2020

DEVOPS WEEKLY ISSUE #482 March 22nd, 2020

A detailed look at the Envoy proxy, focused on usage patterns rather than technology. Interesting looking at the different use-case, both low level and integrated into higher-level service mesh tooling.

  • The title is “On the state of Envoy Proxy control planes”.
  • Lyft’s Software Engineer and Envoy Creator Matt Klein ‘s personal blog post from last week’s article (KubeWeekly #208: March 20th, 2020) presents the Envoy Proxy control plane and its analysis over the next few years.

An interesting post looking at using Kubernetes DaemonSets to help cluster administrators manage the cluster. Nice examples of provisioning SSH access to cluster nodes and running a virus scanning.

  • Proposal of how to use DaemonSets to manage the software, systems and configurations required to run a production environment.
  • Personally I was curious that the label was “aikido”. When I open the personal site of the author James Hunt , the wallpaper of “Chrono Trigger” jumps in.

A discussion of abstractions and how that maps to serverless architectures and some thoughts on configuration management.

  • The title is “Abstractions and serverless”.
  • Explain the significance of abstraction and dig deep into serverless in that context.
  • Creating a system with clear abstract boundaries that can be understood, maintained, and evolved is an issue that should be solved in many IT systems.

A discussion of incident management practices, in particular looking at involving developers in incident response and on-call activities.

  • The title is “Involving Engineers in Incident Management: QCon London Q&A”.
  • Samuel Parkinson , Principal Engineer at Financial Times at QCon London , held from March 2nd to March 6th, said, ``We encourage engineers to be involved in incident management and the benefits of learning from past incidents. From the session Q&A.
  • The author’s comment that “everything that happened in the past has new discoveries, and when new members join, it has a new perspective that the existing members did not notice” seems to be natural, and such a stance I felt that earnestly depends on the humanity of the team members, the leadership of the leader, and the past history.

Another post on architectural approaches to splitting up a large monolithic application, in particular looking at the strangler pattern and the importance of observability.

  • The title is “Break that big ball of mud!” October 2017 article.
  • The article’s contents that has been originally published in “NDC 2016 Blog”. The author is a fan of Star Wars used Force, Yoda, Death Star and so on for his explanation.
  • In reference to Yoda’s words from Star Wars, “every 15 years of coding experience dealing with legacy code causes a considerable proportion of fear, anger, hate, and pain,” he explains.

Analysis of a recent paper on analyzing characteristics of serverless usage, looking at the interesting optimisation where users want fast function start times and the cloud provider wants to minimise resources consumed.

  • “The title is “Serverless in the wild: characterizing and optimising the serverless workload at a large cloud provider”.
  • A series that randomly looks at CS research by Adrian Colyer.
  • This time, I’m taking a paper on “Features and optimization of serverless workloads in major cloud provider environment (Azure)” from arXiv (archive, same pronunciation as archive) . Click here for PDF version .
  • The author paid attention because Jonathan Mace tweeted about the original article on twitter .
  • It is explained by including diagrams and graphs from the viewpoint of cold start, pre-warming, keep alive, idle time, resource management, and cost. Interesting.

A post on the importance of egress filtering of network traffic. ALthough this particular post talks about Serverless, this is relevant to any architecture or infrastructure I think.

  • The title is “Egress Filtering in Serverless Applications”.
  • It explains the importance, methods, and risk examples of filtering outbound communications that are often overlooked by serverless apps.

Backstage is described as a platform for building developer portals. It has an impressive vision, to become a standard toolbox for the open source infrastructure landscape.

  • Web page of “Backstage”, a tool that provides a unified front-end portal screen for developers.
  • Click here for the GitHub page .

Docker released a useful new GitHub Action which makes building and publishing Docker images easier. Some nice touches like automatic tagging and building multiple tags.With a surge of developers and IT practitioners working remotely, there’s also a surge of confusion and operational inefficiency. See how data and automation is improving the way DevOps and IT operations engineers build, release and maintain reliable services remotely:

  • A blog post from Victor Ops, a sponsor of DevOp Weekly.
  • The title is “Using Data and Automation to Help Engineering Teams Work Remotely”.
  • Regarding the “remote work” that has been attracting the most engineers’ attention these days, while referring to the Network Operations Center (NOC) model etc., we touched on the automation, the data linkage method, etc. that should be considered, and as a solution, 14 days of their own service proposed.

SRE Weekly Issue #212 March 23rd, 2020

Meaningful availability

This very clearly written paper describes the Google G Suite team’s search for a meaningful availability metric: one that accurately reflected what their end users experienced, and that could be used by engineers to pinpoint issues and guide improvements.

Hauer et al. — NSDI’20 (original paper) Adrian Colyer — The Morning Paper (summary)

  • This week’s series is a random look at CS research by Adrian Colyer, who was featured in DevOps Weekly.
  • This time, I’m taking a paper from NSDI ’20 (SANTA CLARA, Calif., CA from February 25 to February 27) hosted by USENIX , “A Survey of Meaningful Availability Metrics by Google’s G Suite Team.” Click here for PDF .
  • The author has recommended it by Mr. Damien Mathieu.
  • You can state the definition of the keyword properly in the text. For example, “meaningful” means capturing the user experience.
  • This is content that you can repeatedly read and discuss deeply. Personal homework.

Our Top 5 On-Call Practices — Blameless: Better Reliability Through SRE

Their top 5 are:

  • Use Meaningful Severity Levels
  • Create Detailed Runbooks
  • Load Balance Through Qualitative Metrics
  • Get Ahead of Incidents
  • Cultivate a Culture of On-Call Empathy

Emily Arnott — Blameless

  • Entered with the prelude to “you may consider on-call as a necessary evil,” the five best practices, above, suggested what will make your team responsive, build a more resilient system, and minimize repeated interruptions.

NTP: Building a more accurate time service at Facebook scale

Synchronizing clocks can be critical in an HA system, and Facebook went to great lengths to ensure clock accuracy.

Zoe Talamantes and Oleg Obleukhov — Facebook

  • An introduction to the importance and accuracy of NTP (Network Time Protocol) in the scale of Facebook.
  • It is helpful to compare chrony and ntpd in detail.
  • PTP (Precision Time Protocol) has been unmarked so far, so I would like to check it as well.

The Fallacy of Move Fast and Break Things

You might end up just breaking things.

Dawn Parzych — LaunchDarkly

  • The content of the article the author originally published on .
  • Mark Zuckerberg’s words on Facebook, “move fast and break things,” have become the motto of many development teams, and many companies that want to become unicorns imitated, but across the industry, all teams argue that it’s not going well.
  • “High-performance teams have good systems and processes that help this idea work, and don’t take it at face value,” he said. And it is suggested to prepare tools.

InSearch: LinkedIn’s new message search platform

LinkedIn’s message search system takes advantage of the fact that relatively few users actually search their message. It only builds a search index the first time a user performs a search.

Suruchi Shah and Hari Shankar — LinkedIn

Destiny 2 Outage and Rollback

This followup post from Bungie covers two related incidents in February that caused loss of user data.


  • A story of obstacles and rollbacks in the game “Destiny 2” developed and operated by Bungie.

Involving Engineers in Incident Management: QCon London Q&A

An interview about how one company got their developers to join the on-call rotation. It covers how they trained them to help them build confidence and what benefits they got by joining.

Ben Linders — InfoQ

  • I will skip it because it is taken up in DEVOPS WEEKLY ISSUE #482 above.

The text of this incident originally mentioned Heroku, and it lines up with the Heroku outage below.
They also had this unrelated outage.

Heroku suffered two short bouts of 85% request failure to applications hosted on their platform.Separately, they recently posted a couple of followup reports for previous incidents:
* Incident #1961: logging outage
* Incident #1968: EU application errors

KubeWeekly #209: March 27th, 2020

Editor’s pick of the highlights from the past week.

Kubernetes 1.18

Kubernetes 1.18 is the first release of 2020! Kubernetes 1.18 consists of 38 enhancements: 15 enhancements are moving to stable, 11 enhancements in beta, and 12 enhancements in alpha.

Kubernetes 1.18 is a “fit and finish” release. Significant work has gone into improving beta and stable features to ensure users have a better experience. An equal effort has gone into adding new developments and exciting new features that promise to enhance the user experience even more. Having almost as many enhancements in alpha, beta, and stable is a great achievement. It shows the tremendous effort made by the community on improving the reliability of Kubernetes as well as continuing to expand its existing functionality.

  • Release information for Kubernetes version 1.18. As mentioned above, 38 functions have been improved (15 functions made stable, 11 functions made beta, 12 functions made alpha).
  • Check out the release logo, major changes and release notes , GitHub download page, and other essential information and links.

Kubernetes 1.18, with release team manager Jorge Alarcon

Adam Glick and Craig Box, Kubernetes Podcast from Google

Kubernetes 1.18 is out — almost! A bug has pushed it back a day. While you’re waiting, release team lead Jorge Alarcon will tell you all about the fit and finish you can expect in the release when it’s out tomorrow. Adam and Craig bring you the other community news of the week, as well as some podcast follow-up.

Weekly recap of CNCF member and project webinars that you might have missed.

You can view all CNCF recorded and upcoming webinars here.

CNCF Project Webinar: How to Migrate a MySQL Database to Vitess

Liz van Dijk, Solution Architect and Field Operations @PlanetScale

  • Webinar video explaining “How to migrate MySQL database to Vitess” by Liz van Dijk , Solution Architect & Field Operations at PlanetScale.
  • There is a demo and it is easy to see.
  • Please note that the sound sometimes skips, so it may be due to the speaker’s network environment during shooting.

CNCF Member Webinar: Lowering the Barrier to Kubernetes Proficiency — Navigating the Stormy Seas of Information Overload

Angel Rivera, Developer Advocate @CircleCI

  • Circle CI’s Developer Advocate Angel Rivera explains Kubernetes “to lower the barrier to improvement for new Kubernetes learners “.
  • They carefully explain the background, abbreviations, resources, components, etc. that made Kubernetes needed.

Tutorials, tools, and more that take you on a deep dive into the code.

Anatomy of my Kubernetes Cluster

Antonin Stefanutti

  • Antonin Stefanutti , a Software Engineer at Red Hat, performed the anatomy of “Home Kubernetes” according to his own requirements. Great!

Writing Kubernetes network policies with Inspektor Gadget’s Network Policy Advisor

Alban Crequy, Kinvolk

  • Using the Inspektor Gadget of OSS, which is a collection of gadgets for debugging and investigating apps on Kubernetes by Alba Crequy, CTO & co-founder of Kinvolk , they introduced how to write network policy by the Inspektor Gadget’s “Network Policy Advisor”.
  • They’re looking for contributors, and we’re also calling for participation in the #inspektor-gadget discussion on Kubernetes Slack.

Okteto Push — Your Code to Kubernetes in Seconds

Pablo Chico de Guzman, Okteto

Converting an Old MacBook Into an Always-On Personal Kubernetes Cluster

Sid Palas, DevOps Directive

  • The Author wanted a cluster of Kubernetes that was always up, so he talked about clustering a “2012 MacBook Air” that he had at hand and unused.

Quality of Service and OOM in Kubernetes

Ciro S. Costa, OpsTips

  • An article on the personal blog site of Ciro S. Costa, a Software Engineer at VMware, Inc. (checked the title as the updated on twitter).
  • He has been using Kubernetes resources for quite a long time, but he has’t personally digged deep into Kubernetes resources at the node level, which is the subject of this article.
  • Pos eviction of 3 QoS (quality of service) classes , OOM score, cgroup tree, cgroup unit memory, and kubelet are explained carefully.

Kubernetes secrets

Ciro S. Costa, OpsTips

  • The same author as the one above.
  • An article that examines and explains whether kubelet makes Secret of Kubernetes available to processes in node.
  • Secret management is a theme that I see a lot these days, but I don’t understand much about myself. This is also my homework.

Setting up a ProxySQL Sidecar Container

Jake Davis, Percona

  • Introduction of ProxySQL sidecar setting method by Jake Davis, DBA (Database Administrator) of Percona .
  • Their customer, Duolingo , had reached Aurora’s maximum connections of 16,000 (which was the hard limit for all instance classes), but using ProxySQL sidecars now (2020, 3/23 peak) It is said that the time is kept at around 6,000.

OpenShift 4.4 OKD Bare Metal Install on VMWare Home Lab

Craig Robinson, East Carolina University

  • OKD is an upstream and community support version of Red Hat’s OCP (OpenShift Container Platform).
  • He explains the settings so that you can test the cluster of OKD 4.4 in your home environment.
  • All you need is basic knowledge of virtualization platforms, Linux, and the ability to ask Google search engine.
  • The screen shots and explanations are generous.

Building a TODO API in Golang with Kubernetes

Alex Ellis

  • CNCF Ambassador Alex Ellis for Kubernetes beginners who want to write a practical Go API and deploy and manage their to-do list on Kubernetes.

A Guide On The Installation Of Spinnaker in Kubernetes Cluster

Vikas Saini, Magalix

  • An article that explains the procedure to install Spinnaker on GKE using halyard , which is a tool for installing, configuring, and updating Spinnaker .

A Primer: Continuous Integration and Continuous Delivery (CI/CD)

Catherine Paganini, Kublr

  • A series of articles explaining IT concepts to business leaders.
  • The theme this time is CI/CD, and explanations are given alongside keywords and diagrams that are easy to imagine.

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

Threading the Needle on Kubernetes Complexity with AI-Powered Observability

Andreas Grabner,

  • Kubernetes talks about the complexities and the large amount of data that needs to be met by AI observable products. There is no talk of concrete tools.

A ‘No-BS’ Checklist for Kubernetes

Oleg Chunikhin, Kublr

  • No-BS = Bad Staff, the authors, have created and shared a checklist to identify vendors and services that do not include the requirements needed to run Kubernetes in an enterprise production environment. “Nice-to-Haves” is also described for good elements.

Upcoming CNCF webinars
You can check some Recorded Webinars and Upcoming Webinars here. The following are posted as Upcoming CNCF webinars at that moment.

Container Security at Scale: Lessons Learned from the Front Lines with ABN AMRO and Palo Alto Networks
Wiebe de Roos, CI/CD Consultant @Flusso and ABN Amro
Keith Mokris,Technical Marketing Engineer @Palo Alto Networks
Member webinar
April 1, 2020 10:00 AM Pacific Time

Taming Your AI/ML Workloads with Kubeflow The Journey to Version 1.0
David Aronchick @Microsoft
Elvira Dzhureava, Technical Product Engineer AI/M @Cisco
Johnu George, Technical lead @Cisco Systems
Member webinar
April 2, 2020 9:00 AM Pacific Time

Welcome to CloudLand! An Illustrated Intro to the Cloud Native Landscape
Kaslin Fields, Developer Advocate @Google
Ambassador webinar
April 3, 2020 10:00 AM Pacific Time

Pravega: Rethinking storage for streams
Member webinar
April 7, 2020 10:00 AM Pacific Time

Best Practices for Deploying a Service Mesh in Production: From Technology to Teams
Member webinar
April 8, 2020 10:00 AM Pacific Time

New thoughts on distributed file system in the cloud native era
Member webinar
April 9, 2020 10:00 AM Pacific Time

Declarative Host Upgrades From Within Kubernetes
Adrian Goins,Director of Community and Evangelism @Rancher Labs
Dax McDonald,Software Engineer @Rancher Labs
Jacob Blain Christen, Principal Software Engineer @Rancher Labs
Member webinar
April 14, 2020 10:00 AM Pacific Time

杨雨 Alex Yang, 解决方案架构师 Solution Architect @Mirantis
张文墨Larry Zhang, 解决方案架构师 Solution Architect @Mirantis
Member webinar
This webinar will be delivered in Chinese
April 23, 2020 10:00 AM China Standard Time

Kubernetes 1.18
Kubernetes team
Project webinar
April 23, 2020 9:00 AM Pacific Time

Pivoting Your Pipeline from Legacy to Cloud Native
Tracy Ragan, CEO of DeployHub and CDF Board Member
Member webinar
June 30, 2020 10:00 AM Pacific Time

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara

Written by

An infra engineer in Tokyo, Japan. Grew up in Athens, Greece(1986–1992). #Network, #Kubernetes, #GCP, #AWS SAP, #National Tour Guide for English

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store