SRE / DevOps / Kubernetes Weekly Collection#20(Week 25)

Image for post
Image for post
  • In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
  • Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
  • I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #494 June 14th, 2020
SRE Weekly Issue #223 June 14th, 2020
KubeWeekly #221 June 18th, 2020

DEVOPS WEEKLY ISSUE #494 June 14th, 2020

A look at using Nomad for dynamic scheduling across 200+ edge locations, in particular for running local management tooling.

  • The title is “How we use HashiCorp Nomad”.
  • Cloudflare explains how Hashicorp’s Nomad can help improve service availability in each data center, how Nomad is deployed, the challenges it has overcome in the process, and its future and future use.
  • It was nice to be able to imagine the management unit/partition of the configuration through the explanation of the reliability model of the service executed by Cloudflare in more than 200 edge cities all over the world.

A look at a recent paper, and some of the maths behind it, for setting effective service level objectives that take into account backend performance and user impact.

  • The title is “The Tail at Scale” and the following, “The Tail at Scale Revisited”.
  • I read the following because the first blog was covered in my previous blog.
  • It presents graphs to help understand the relationship to the user experience and examines ways to dramatically improve overall performance.

Cloud Native is more than just technology and tools. This post explores modern patterns for strategy and transformation, including how to make small investments pay off, build psychological safety, reducing the cost of experiments and more. See the webinar from the authors of these patterns below too.

  • The title is “Chapter II: The Patterns”.
  • It expands the idea of the O’Reilly book, Cloud Native Transformation strategy section. It’s the chapter 2 of 3.
  • The first chapter clarifies what a strategy is and what is not. I think the ability to verbalize and articulate strategies is important, so I would like to work on this as well.

Lots of teams are and organisations are adopting monorepos, but one of the challenges at scale is Git repository performance. This post explores work from one team to speed things up.

  • The title is “Speeding up a Git monorepo at Dropbox with <200 lines of code”.
  • At Dropbox , after moving from Mercurial to Git in 2014, monorepos were gradually becoming a bottleneck in performance, especially the Mac OS used by many in-house engineers.
  • Git’s upstream improvements and small wrapper of custom code improved speed without splitting the repository.

Hard won lessons from implementing Istio, pointing out both the good and the bad.

  • The title is “Riding the Tiger: Lessons Learned Implementing Istio”.
  • The pain story of the team implementing Istio in a “real” production environment on managed Kubernetes from a cloud provider. It is the content of the article that the author himself wanted to see before challenging this adventure.

As organisations grow, the speed of tests invariably becomes a challenge. This post looks at using dynamic analysis to map files to tests, and then to selectively run a subset of the tests when changes are made.

  • The title is “Spark Joy by Running Fewer Tests”.
  • An example of code test improvement at Shopify. The title shows the satisfaction of the result.
  • It is said that the tests are mapped by the dynamic analysis of the changes in the files, the number of unnecessary test items is reduced, and all tests run within 2% of the pull request.
  • They adopted the dynamic analysis/test selection method described above, but many developers, including myself, were skeptical of this method and questioned it individually and openly. But when the people received the result, they silenced. They also considered other approaches such as static analysis, machine learning, and increasing machine resources, but those were out of the candidates for the reasons explained.

A well reasoned argument for why one organisation doesn’t use Kubernetes. Looking at operational overhead and security challenges.

  • The title is “Container technologies at Coinbase”.
  • The author could not find such articles in many articles outside the company, so he made it based on the blog post published inside Coinbase and published it publically. A good article that describes the historical flow, key resources, etc.
  • Minimal edits have been made and images have been added to provide more flare.
  • They said “If you are interested in working on our next generation of container technologies, our dynamic configuration service or other technologies mentioned above — we are actively hiring on our Infrastructure team.”

A case study looking at a large data platform migration, including moving from Hive to Spark, and building a developer-friendly system.

  • The title is “Accelerating developers by ditching the data center”.
  • An article by Scribd’s R Tyler Croy (Director of Platform Engineering) is posted on the Databricks company blog as a guest blog.
  • In the situation where machine learning, real-time data processing, and new data products are required more, modernization of the existing data platform is advanced.

This post explores how a single SQL statement can drastically improve performance. The basics of SQL are well understood but a good reminder how powerful databases like postgreSQL are.

  • The title is “How one word in PostgreSQL unlocked a 9x performance improvement”.
  • An article explaining how to improve the performance of PostgreSQL 9 to 10 times. Articles where the arrangement of images, spaces between characters, paragraphs, etc. are easy to see.

A free webinar with Jamie Dobson to introduce a pattern language for strategy and execution. This language was developed after analysing five years of studying successful (and unsuccessful) Cloud Native transformations. Perfect for executives, leaders and engineers.

  • Distributing webinars and e-books for “Executives, Team leads, and Managers”.

King is looking for new members for the infrastructure engineering teams to help develop, manage and expand our software based networking setup across datacenters and (Google) cloud. Please take a look at the open role for networking engineers. We’re also still looking for both database and streaming data engineers, if that is more your style.

  • Continued job information from King. There seems to be no fluctuation in the post. It seems that we are looking for SRE , Database SRE , Network SRE(at that moment).

KIP is a virtual kubelet implementation which allows for launching pods in ephemeral EC2 or GCP instances, including spot instances in EC2.

  • Introduction article of OSS tool KIP (Kloud Instance Provider).
  • Author Gokul Chandra has seen this article on this blog before, and it’s always a big subject, but it has a lot of polite commentary and screenshots, so it’s easy to imagine.
  • Click here for the GitHub page.

SRE Weekly Issue #223 June 14th, 2020

Prevent application and network instability by serving stale content

I’ve used this technique in the past with a single-page app and a highly-cacheable API, to ensure stability even when the backend goes down.

Patrick Hamann

Full disclosure: Fastly is my employer.

The Impending Doom of Expiring Root CAs and Legacy Clients

Here’s a deep dive into how your CA’s certificate can affect your application’s reliability — at least in the eyes of your customers.

Scott Helme

[Coinbase] Incident Post Mortem: June 1, 2020

Here’s Coinbase’s followup from their outage last week.

Michael de Hoog — Coinbase

  • A postmortem of Coinbase outage that was covered in the Outage section of the former issue of this blog., and mobile apps were affected.
  • When the BTC (bitcoin currency unit) price reached USD $10,000, the traffic surged five times, and there were parts that could not be handled by the design of auto scaling, so the impact of the surge in traffic related to the price The story up to the point where countermeasures to reduce are taken.

Who’s afraid of serializability?

Kyle Kingsbury recently did an analysis of PostgreSQL 12.3 and found that under certain conditions it violated guarantees it makes about transactions, including violations of the serializability transaction isolation level.

I thought it would be fun to use one of his counterexamples to illustrate what serializable means.

Lorin Hochstein

  • “Jepsen’s Kyle Kingsbury has recently analyzed PostgreSQL 12.3 and found that under certain conditions it violated transaction guarantees, including violations of serializability transaction isolation levels,” An article in which he thought it would be fun to use one of “his” counterexamples to illustrate what serializable means.

The article is closed by explaining the difference between “ Serializability “ and “ Linearizability “.

Achieving FMEA goals faster with Chaos Engineering

Failure mode and effects analysis (FMEA) is a decades-old method for identifying all possible failures in a design, a manufacturing or assembly process, or a product or service.

If you’ve been tasked with applying FMEA in your SRE work, this article will get you started.

Matthew Helmke
It explains the common points between FMEA and SRE.

  • Here are some of his points.
    ○ SRE and FMEA have one major goal: “preventing things that negatively impact customers.”
    ○ Both in the physical world of manufacturing and in the virtual world of computing, there are expectations and agreements that must be met, defining how to measure success.
    ○ Reliability and customer satisfaction are the goals.
  • FMEA is one of the core tools called “Failure mode and effect analysis”, and is a technique to evaluate potential risks of products and processes mainly at the design stage and eliminate those risks as much as possible.

KubeWeekly #221 June 18th, 2020

Editor’s pick of the highlights from the past week.

CNCF Supports the Black Lives Matter movement

This week, CNCF issued a statement on Black Lives Matter. Here is a short excerpt:

“CNCF stands in solidarity with the Black Lives Matter movement and racial equality for all. As a foundation that serves a diverse, global ecosystem of members, we also stand in solidarity with members of our community who challenge us all to do better — not just for right now — but for two months from now, two years from now, and beyond that.” — Priyanka Sharma, general manager of CNCF.

Read the full blog, including project supporting responses here.

  • Blog post with a statement about “Black Lives Matter Movement” from Priyanka Sharma, a new GM at CNCF.
  • Although she has been engaged in various activities including the scholarship for diversity of CNCF, she has learned from recent events that she has understood, emphasized, and learned the need to work more than the need to work more. She called on members of the community to work together, even if it takes time, by sharing the stories they want to share.

Introducing the CNCF Technology Radar

This week, the CNCF End User Community introduced Technology Radar. The EUC is a group of over 140 top companies and startups who meet regularly to discuss challenges and best practices when adopting cloud native technologies. The goal of the CNCF Technology Radar is to share what tools are actively being used by end users, the tools they would recommend, and their patterns of usage. Learn more about it and find out how you can get involved here.

  • An article and YouTube video introducing “Technology Radar” announced by the end user community of CNCF.
  • There are great discussions within the end-user community, but some members are unable to provide specific information to the public due to the legal/public relations permission of the company they belong to, so we are looking at the light of day. There was a lot of information that wasn’t there. This technology radar is the one that makes it anonymous. The definitions of the words used are explained. The theme this time is “Continuous Delivery”.

Editor’s note: We are sending this week’s newsletter a day early due to the Juneteenth holiday on Friday, June 19. CNCF and the Linux Foundation offices are closed in observance.

  • As an editor’s note, 6/19 (Friday) was closed for the CNCF office due to “Juneteenth”, an American holiday commemorating the declaration of slavery, so I sent this newsletter one day ahead of schedule. It has been described.

Weekly recap of CNCF member and project webinars that you might have missed.
You can view all CNCF recorded and upcoming webinars here.

CNCF Project Webinar: Charting Your Voyage To Helm 3

Matt Farina, Senior Staff Engineer @Samsung

Martin Hickey, Senior Software Engineer @IBM

Adam Reese, Senior Software Engineer @Microsoft and Bridget Kromhout, Principal Program Manager @Microsoft

  • It explains the difference between Helm 2 to Helm 3 and a smooth transition by Principal Program Manager).

CNCF Community Webinar: What end users really recommend for Continuous Delivery

Cheryl Hung, Director of Ecosystem @CNCF

  • I will skip it because it is a webinar video about technology radar featured in “The Headlines”.

CNCF Member Webinar: Multi Cluster Service Mesh Operations and Extensibility with WebAssembly

Idit Levine, Founder and CEO and Christian Posta, Global Field CTO

  • It describes the extensibility of WebAssembly for an application environment.

CNCF Member Webinar: Learning from the visible past to accelerate the observable future

Curtis Hrischuk, Technical Product Manager @Instana

  • It explains lessons learned from building and using an observability platform.

CNCF Member Webinar: Multitenancy Webinar: Better walls make better tenants

Adrian Ludwin, Senior Engineer @Google

  • It explains how to think about tenants, organizations, and unique health needs, and how to build a robust and secure multi-tenant solution.

CNCF Member Webinar: How to better understand K8s workloads using Octant

Wayne Witzel III, Octant Maintainer @VMware

  • It explains how to create and troubleshoot Kubernetes workloads using Octant.

Tutorials, tools, and more that take you on a deep dive into the code.

Monitoring Services like an SRE in OpenShift ServiceMesh

Raffaele Spazzoli, Red Hat

  • It explains how to calculate error budgets and configure alerts related to services running in ServiceMesh.

New in Prometheus v2.19.0: Memory-mapping of full chunks of the head block reduces memory usage by as much as 40%

Ganesh Vernekar, Grafana Labs

  • It introduces a new feature in the new version 2.19.0 of Prometheus released, “memory-mapping full chunks of the head (in-memory) block from disk’’ used to reduce memory usage by up to 40%, with benchmark results.
  • The author notes the following “It is natural to assume that you can now reduce the resource allocation for Prometheus as it’s using less memory. But it cannot be ignored that the chunks are loaded into memory when required. So if you run heavy queries that would touch a lot of series simultaneously for the past few hours of data, then Prometheus is going to take up a little more memory than the ideal reduction. Plan the resources keeping this in mind.”.

Hard lessons learned about Kubernetes garbage collection

Oleg Matskiv,

  • An article that conveys the thoughts contained in the subtitle. Subtitle “Why I’ll never skim Kubernetes documentation again.”.
  • It caused a bug that the namespace was unintentionally deleted in his POC environment. As the root cause, the setting scope of “ownerReference” crosses where it should be set according to the dependency of Namespace and cluster respectively, and it was specified in the document that it should not be done by design.
  • I tend to move my hand first and read the document diagonally, so it’s not another person’s affair. We should check it while being aware of where the document is.

Enterprise Kubernetes development with odo: The CLI tool for developers

Jason Dudash, Red Hat

How to create ephemeral environments using Crossplane and ArgoCD?

Suraj Banakar, InfraCloud

  • An article that explains how to use a custom resource on Crossplane and ArgoCD to spin up a temporary cluster, test your application, and configure it to be automatically deleted after a certain period of time.
  • This custom resource has a number of things to improve and is still in its early prototype stages.
  • Only GKE clusters were supported, and GKE was used as an environment, so a GCP account was required for hands-on articles at that moment.

High availability load balancers with Maglev

Terin Stock, Cloudflare

  • A technical blog that explains the past and what has been focusing on the implementation of Maglev aiming at Cloudflare’s high availability load balancer.
  • I’ve seen someone on Twitter mention it several times.

Misconfigured Kubeflow workloads are a security risk

Yossi Weizman, Azure Security Center

  • Microsoft Azure Security Center (ASC) blog. Introducing a new organizational attack (campaign) recently observed by ASC targeting Kubeflow, Kubernetes’ machine learning toolkit.
  • It has confirmed that this attack, targeting a misconfigured Kubeflow workload, is affecting dozens of Kubernetes clusters.

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

The Financial Times, with Sarah Wells and Dimitar Terziev

Adam Glick and Craig Box, Kubernetes Podcast from Google

  • Kubernetes Podcast by Google employees. The current co-hosts are Craig Box and Adam Glick.
  • Sarah Wells (Technical Director for Operations and Reliability) of Financial Times and Dimitar Terziev (the current platform lead for the CM team) of the company are invited as guests.
  • Sarah said in a keynote to the KubeCon EU two years ago, “How the company moved from monoliths to microservices, and how the content and metadata platform team moved specifically to Kubernetes.” This podcast talks about the recap of the transition and what happened after that.
  • Speaking of FT, I remember when I heard that the Nikkei newspaper acquired it in 2015. Even though it is a group company, I don’t think there is much interaction between engineers (I wrote it somehow, so there is no particular reason or intention).
  • The topics of interest in News of the week are:
    Zerto for Kubernetes
    Cloudera Data Platform for Private Clouds
    Cloudbees introduces DoD compliant CI, now with a CtF to deploy into an environment with an ATO, which meets DISA STIG and NIST RMF security guidelines
    Episode 44, with Tracy Miranda
    Gokul Chandra writes up Anthos

Google internships go virtual with the help of open source

Eric Brewer, Google

  • Google has used the open source program to hold the internship program that started in 1999, even after switching to virtual internship programs. Thousands of internship students from 43 countries around the world are participating.
  • Due to the impact of COVID-19, over 1,000 technical interns are actively contributing to open source projects, although they do not have access to technical resources at Google’s offices.

Supporting the Evolving Ingress Specification in Kubernetes 1.18

Alex Gervais, Datawire

  •’s Kubernetes blog post. Earlier this year, the Kubernetes team released Kubernetes 1.18, extending Ingress. In this blog, you’ll learn what’s new in the new Ingress spec, what it means for your app, and how to upgrade to an Ingress controller that supports this new spec.

Kubernetes by the numbers, in 2020: 12 stats to see

Kevin Casey, Red Hat

  • Again this week, an easy-to-understand article by Kevin Casey of Red Hat, explaining the number of points.
  • The theme is “How is Kubernetes impacting enterprise IT?” and “12 compelling Kubernetes statistics”.

Google’s Anthos from the eyes of a Kubernetes developer

Janakiram MSV, The New Stack

  • Janakiram MSV (Analyst) is a limited series of articles on The New Stack about Anthos, Kubernetes service on Google Cloud Platform. This article provides an overview of Anthos and its major components.
  • Each part of the series focuses on specific aspects of Anthos, covering cluster registration, Anthos configuration management, launching “click to deploy” applications from the GCP Marketplace, and more.

Kubernetes startup Kubermatic, formerly Loodse, open-sources its core technology

Ron Miller, TechCrunch

  • TechCrunch’s article telling that Germany’s Kubernetes automation platform “Loodse” has renamed it “ Kubermatic “ and announced that it will make the Kubermatic Kubernetes Platform OSS under Apache 2.0 license.
  • They interviewed Sebastian Scheele of the company’s Co-founder, and also shared the comments and information we got about the circumstances leading up to this announcement and the future.

MayaData launches Kubera; a Kubernetes management service

Chris Mellor, Blocks & Files

You can check some Recorded Webinars and Upcoming Webinars here. The following are posted as Upcoming CNCF webinars at that moment.

Member Webinar: How to Promote the use of Best Practices and Automate Security Policies Using Tools Like OPA and Kubernetes
Gary Duan, CTO and Co-Founder @NeuVector
June 18, 2020 10:00 AM Pacific Time

Member Webinar: Fast packet processing with KubeVirt
David Vossel, Principal Software Engineer @RedHat
Petr Horacek, Senior Software Engineer @Red Hat
June 23, 2020 7:00 AM Pacific Time

Member Webinar: Introduction to Cloud Provider Sub Sig BaiduCloud // 介绍SIG Cloud Provider子项目BaiduCloud
Ti Zhou 周倜, Senior Architect 高级架构师 @Baidu 百度
Zichao Ye 叶子超, Senior Software Engineer 高级软件工程师 @Baidu 百度
Tianyuan Sun 孙天元, Senior Software Engineer 高级软件工程师 @Baidu 百度
This webinar will be delivered in Chinese.
June 24, 2020 10:00 AM China Standard Time

Member Webinar: Cloud Infrastructure for Network Functions — Requirements and testing
Dana Nehama, Director, Product Management Network Cloud @Intel Corporation
Petar Torre, Principal Engineer @Intel Corporation
June 24, 2020 7:00 AM Pacific Time

Member webinar: Kubernetes Cost Allocation Done Right
Webb Brown, Co-founder and CEO @Kubecost
June 24, 2020 10:00 AM Pacific Time

Member Webinar: Monitoring Kubernetes clusters by “chatting” with them Prasad Ghangal, Creator of BotKube and Software geek @InfraCloud
Vishal Biyani, CTO @InfraCloud
Hrishikesh Deodhar, Director of Engineering @InfraCloud
June 25, 2020 10:00 AM Pacific Time

Ambassador Webinar: Commoditise Kubernetes with cluster-api
Gianluca Arbezzano, Senior Staff Software Engineer @Packet
June 26, 2020 10:00 AM Pacific Time

Member Webinar: Best Practices for Running and Implementing Kubernetes
Kendall Miller, President @Fairwinds
Robert Brenna, Director of Open Source @Fairwinds*
June 30, 2020 10:00 AM Pacific Time**

Member Webinar: 7 Critical Reasons for Kubernetes-Native Backup
Niraj Tolia, CEO and Co-Founder @Kasten
Mark Severson, Member of Technical Staff @Kasten
July 1, 2020 7:00 AM Pacific Time

Member Webinar: Pivoting Your Pipeline from Legacy to Cloud Native
Tracy Ragan, CEO of DeployHub and CDF Board Member
July 1, 2020 1:00 PM Pacific Time

Member Webinar: Stay on top of ongoing Kubernetes security hygiene
Zohar Kaufman, Co-Founder and VP R&D
Ariel Shuper, VP Product
July 2, 2020 10:00 AM Pacific Time

Member Webinar: Optimize your Kubernetes Clusters on Azure with Built-in Best Practices
Jorge Palma, Senior Program Manager @Microsoft
July 7, 2020 10:00 AM Pacific Time

Member Webinar: The Challenges and Countermeasures of Service Mesh Practice
裴斐 (Fei Pei), 网易 杭州研究院 云计算技术专家、架构师 @网易*
This webinar will be delivered in Chinese.
July 8, 2020 10:00 AM China Standard Time**

Project Webinar: What’s new in Linkerd 2.8 : Multi-cluster Kubernetes made simple and secure by default
Oliver Gould, Linkerd Project Lead, co-founder & CTO @Buoyant
July 8, 2020 10:00 AM Pacific Time

Member Webinar: Building Production-ready Services with Kubernetes and Serverless Architectures
Mike Metral, Software Architect and Engineer @Pulumi
Jason (Jay) Smith, App Modernization Specialist @Google Cloud
July 8, 2020 1:00 PM Pacific Time

Member Webinar: 如何落地 Service Mesh — 从技术选型到实践
马若飞 FreeWheel 北京研发中心首席工程师 @FreeWheel
This webinar will be delivered in Chinese.
July 9, 2020 10:00 AM China Standard Time

Member Webinar: The top 10 most-useful Kubernetes APIs for comprehensive cloud-native observability
Caleb Hailey, Co-founder and CEO @Sensu
July 9, 2020 10:00 AM Pacific Time

Member Webinar: Securing and Accelerating the Kubernetes CNI Data Plane with Project Antrea and NVIDIA Mellanox ConnectX SmartNICs
Antonin Bas, Maintainer of Project Antrea and Staff Engineer @VMware**
Moshe Levi, Sr. Staff Engineer @NVIDIA*
July 14, 2020 10:00 AM Pacific Time

Member Webinar: Serving Millions of Customers with Cloud Native and DevSecOps
Chris Hollies, CTO, Oracle Practice @Capgemini
Akshai Parthasarathy, Principal Director, Cloud Native and DevOps @Oracle Cloud
July 15, 2020 7:00 AM Pacific Time

Member Webinar: Advancing image security and compliance through Container Image Encryption!
Brandon Lum, Senior Software Engineer @IBM
July 15, 2020 10:00 AM Pacific Time

Member Webinar: Kubernetes and storage. Kubernetes for storage. An overview.
Kiran Mova, Chief Architect at MayaData and core maintainer of OpenEBS @MayaData
July 16, 2020 10:00 AM Pacific Time

Project Webinar: How We Doubled System Read Throughput with Only 26 Lines of Code
TiKV team
July 31, 2020 10:00 AM Pacific Time

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara

Written by

An infra engineer in Tokyo, Japan. Grew up in Athens, Greece(1986–1992). #Network, #Kubernetes, #GCP, #AWS SAP, #National Tour Guide for English

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store