SRE / DevOps / Kubernetes Weekly Collection#20(Week 25)

- In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
- Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
- I hope it contributes to the people browsing this kind of information as a reference.
DEVOPS WEEKLY ISSUE #494 June 14th, 2020
SRE Weekly Issue #223 June 14th, 2020
KubeWeekly #221 June 18th, 2020
DEVOPS WEEKLY ISSUE #494 June 14th, 2020
News
- The title is “How we use HashiCorp Nomad”.
- Cloudflare explains how Hashicorp’s Nomad can help improve service availability in each data center, how Nomad is deployed, the challenges it has overcome in the process, and its future and future use.
- It was nice to be able to imagine the management unit/partition of the configuration through the explanation of the reliability model of the service executed by Cloudflare in more than 200 edge cities all over the world.
- The title is “The Tail at Scale” and the following, “The Tail at Scale Revisited”.
- I read the following because the first blog was covered in my previous blog.
- It presents graphs to help understand the relationship to the user experience and examines ways to dramatically improve overall performance.
- The title is “Chapter II: The Patterns”.
- It expands the idea of the O’Reilly book, Cloud Native Transformation strategy section. It’s the chapter 2 of 3.
- The first chapter clarifies what a strategy is and what is not. I think the ability to verbalize and articulate strategies is important, so I would like to work on this as well.
- The title is “Speeding up a Git monorepo at Dropbox with <200 lines of code”.
- At Dropbox , after moving from Mercurial to Git in 2014, monorepos were gradually becoming a bottleneck in performance, especially the Mac OS used by many in-house engineers.
- Git’s upstream improvements and small wrapper of custom code improved speed without splitting the repository.
Hard won lessons from implementing Istio, pointing out both the good and the bad.
- The title is “Riding the Tiger: Lessons Learned Implementing Istio”.
- The pain story of the team implementing Istio in a “real” production environment on managed Kubernetes from a cloud provider. It is the content of the article that the author himself wanted to see before challenging this adventure.
- The title is “Spark Joy by Running Fewer Tests”.
- An example of code test improvement at Shopify. The title shows the satisfaction of the result.
- It is said that the tests are mapped by the dynamic analysis of the changes in the files, the number of unnecessary test items is reduced, and all tests run within 2% of the pull request.
- They adopted the dynamic analysis/test selection method described above, but many developers, including myself, were skeptical of this method and questioned it individually and openly. But when the people received the result, they silenced. They also considered other approaches such as static analysis, machine learning, and increasing machine resources, but those were out of the candidates for the reasons explained.
- The title is “Container technologies at Coinbase”.
- The author could not find such articles in many articles outside the company, so he made it based on the blog post published inside Coinbase and published it publically. A good article that describes the historical flow, key resources, etc.
- Minimal edits have been made and images have been added to provide more flare.
- They said “If you are interested in working on our next generation of container technologies, our dynamic configuration service or other technologies mentioned above — we are actively hiring on our Infrastructure team.”
- The title is “Accelerating developers by ditching the data center”.
- An article by Scribd’s R Tyler Croy (Director of Platform Engineering) is posted on the Databricks company blog as a guest blog.
- In the situation where machine learning, real-time data processing, and new data products are required more, modernization of the existing data platform is advanced.
- The title is “How one word in PostgreSQL unlocked a 9x performance improvement”.
- An article explaining how to improve the performance of PostgreSQL 9 to 10 times. Articles where the arrangement of images, spaces between characters, paragraphs, etc. are easy to see.
Events
- Distributing webinars and e-books for “Executives, Team leads, and Managers”.
Jobs
King is looking for new members for the infrastructure engineering teams to help develop, manage and expand our software based networking setup across datacenters and (Google) cloud. Please take a look at the open role for networking engineers. We’re also still looking for both database and streaming data engineers, if that is more your style.
- Continued job information from King. There seems to be no fluctuation in the post. It seems that we are looking for SRE , Database SRE , Network SRE(at that moment).
Tools
- Introduction article of OSS tool KIP (Kloud Instance Provider).
- Author Gokul Chandra has seen this article on this blog before, and it’s always a big subject, but it has a lot of polite commentary and screenshots, so it’s easy to imagine.
- Click here for the GitHub page.
SRE Weekly Issue #223 June 14th, 2020
Articles
Prevent application and network instability by serving stale content
I’ve used this technique in the past with a single-page app and a highly-cacheable API, to ensure stability even when the backend goes down.
Patrick Hamann
Full disclosure: Fastly is my employer.
- An article about “Serving stale content” by Fastly. Settings/functions that are effective when there is a problem such as the origin server or the time required for acquisition between them.
- Click here for the article “Serving Stale Content” in Japanese that describes the same content of the company.
The Impending Doom of Expiring Root CAs and Legacy Clients
Here’s a deep dive into how your CA’s certificate can affect your application’s reliability — at least in the eyes of your customers.
Scott Helme
[Coinbase] Incident Post Mortem: June 1, 2020
Here’s Coinbase’s followup from their outage last week.
Michael de Hoog — Coinbase
- A postmortem of Coinbase outage that was covered in the Outage section of the former issue of this blog. coinbase.com, pro.coinbase.com and mobile apps were affected.
- When the BTC (bitcoin currency unit) price reached USD $10,000, the traffic surged five times, and there were parts that could not be handled by the design of auto scaling, so the impact of the surge in traffic related to the price The story up to the point where countermeasures to reduce are taken.
Who’s afraid of serializability?
Kyle Kingsbury recently did an analysis of PostgreSQL 12.3 and found that under certain conditions it violated guarantees it makes about transactions, including violations of the serializability transaction isolation level.
I thought it would be fun to use one of his counterexamples to illustrate what serializable means.
Lorin Hochstein
- “Jepsen’s Kyle Kingsbury has recently analyzed PostgreSQL 12.3 and found that under certain conditions it violated transaction guarantees, including violations of serializability transaction isolation levels,” An article in which he thought it would be fun to use one of “his” counterexamples to illustrate what serializable means.
The article is closed by explaining the difference between “ Serializability “ and “ Linearizability “.
Achieving FMEA goals faster with Chaos Engineering
Failure mode and effects analysis (FMEA) is a decades-old method for identifying all possible failures in a design, a manufacturing or assembly process, or a product or service.
If you’ve been tasked with applying FMEA in your SRE work, this article will get you started.
Matthew Helmke
It explains the common points between FMEA and SRE.
- Here are some of his points.
○ SRE and FMEA have one major goal: “preventing things that negatively impact customers.”
○ Both in the physical world of manufacturing and in the virtual world of computing, there are expectations and agreements that must be met, defining how to measure success.
○ Reliability and customer satisfaction are the goals. - FMEA is one of the core tools called “Failure mode and effect analysis”, and is a technique to evaluate potential risks of products and processes mainly at the design stage and eliminate those risks as much as possible.
Outages
- IBM Softlayer
○ This is the big one this week, with downstream effects on lots of sites and services hosted in Softlayer. There’s a bit of detail from IBM that seems to indicate that a BGP error by a third party flooded IBM with misrouted traffic. - Squarespace
- Snapchat
- Microsoft Teams
KubeWeekly #221 June 18th, 2020
The Headlines
Editor’s pick of the highlights from the past week.
CNCF Supports the Black Lives Matter movement
This week, CNCF issued a statement on Black Lives Matter. Here is a short excerpt:
“CNCF stands in solidarity with the Black Lives Matter movement and racial equality for all. As a foundation that serves a diverse, global ecosystem of members, we also stand in solidarity with members of our community who challenge us all to do better — not just for right now — but for two months from now, two years from now, and beyond that.” — Priyanka Sharma, general manager of CNCF.
Read the full blog, including project supporting responses here.
- Blog post with a statement about “Black Lives Matter Movement” from Priyanka Sharma, a new GM at CNCF.
- Although she has been engaged in various activities including the scholarship for diversity of CNCF, she has learned from recent events that she has understood, emphasized, and learned the need to work more than the need to work more. She called on members of the community to work together, even if it takes time, by sharing the stories they want to share.
Introducing the CNCF Technology Radar
This week, the CNCF End User Community introduced Technology Radar. The EUC is a group of over 140 top companies and startups who meet regularly to discuss challenges and best practices when adopting cloud native technologies. The goal of the CNCF Technology Radar is to share what tools are actively being used by end users, the tools they would recommend, and their patterns of usage. Learn more about it and find out how you can get involved here.
- An article and YouTube video introducing “Technology Radar” announced by the end user community of CNCF.
- There are great discussions within the end-user community, but some members are unable to provide specific information to the public due to the legal/public relations permission of the company they belong to, so we are looking at the light of day. There was a lot of information that wasn’t there. This technology radar is the one that makes it anonymous. The definitions of the words used are explained. The theme this time is “Continuous Delivery”.
Editor’s note: We are sending this week’s newsletter a day early due to the Juneteenth holiday on Friday, June 19. CNCF and the Linux Foundation offices are closed in observance.
- As an editor’s note, 6/19 (Friday) was closed for the CNCF office due to “Juneteenth”, an American holiday commemorating the declaration of slavery, so I sent this newsletter one day ahead of schedule. It has been described.
ICYMI: CNCF Webinars
Weekly recap of CNCF member and project webinars that you might have missed.
You can view all CNCF recorded and upcoming webinars here.
CNCF Project Webinar: Charting Your Voyage To Helm 3
Matt Farina, Senior Staff Engineer @Samsung
Martin Hickey, Senior Software Engineer @IBM
Adam Reese, Senior Software Engineer @Microsoft and Bridget Kromhout, Principal Program Manager @Microsoft
- It explains the difference between Helm 2 to Helm 3 and a smooth transition by Principal Program Manager).
CNCF Community Webinar: What end users really recommend for Continuous Delivery
Cheryl Hung, Director of Ecosystem @CNCF
- I will skip it because it is a webinar video about technology radar featured in “The Headlines”.
CNCF Member Webinar: Multi Cluster Service Mesh Operations and Extensibility with WebAssembly
Idit Levine, Founder and CEO @Solo.io and Christian Posta, Global Field CTO @Solo.io
- It describes the extensibility of WebAssembly for an application environment.
CNCF Member Webinar: Learning from the visible past to accelerate the observable future
Curtis Hrischuk, Technical Product Manager @Instana
- It explains lessons learned from building and using an observability platform.
CNCF Member Webinar: Multitenancy Webinar: Better walls make better tenants
Adrian Ludwin, Senior Engineer @Google
- It explains how to think about tenants, organizations, and unique health needs, and how to build a robust and secure multi-tenant solution.
CNCF Member Webinar: How to better understand K8s workloads using Octant
Wayne Witzel III, Octant Maintainer @VMware
- It explains how to create and troubleshoot Kubernetes workloads using Octant.
The Technical
Tutorials, tools, and more that take you on a deep dive into the code.
Monitoring Services like an SRE in OpenShift ServiceMesh
Raffaele Spazzoli, Red Hat
- It explains how to calculate error budgets and configure alerts related to services running in ServiceMesh.
Ganesh Vernekar, Grafana Labs
- It introduces a new feature in the new version 2.19.0 of Prometheus released, “memory-mapping full chunks of the head (in-memory) block from disk’’ used to reduce memory usage by up to 40%, with benchmark results.
- The author notes the following “It is natural to assume that you can now reduce the resource allocation for Prometheus as it’s using less memory. But it cannot be ignored that the chunks are loaded into memory when required. So if you run heavy queries that would touch a lot of series simultaneously for the past few hours of data, then Prometheus is going to take up a little more memory than the ideal reduction. Plan the resources keeping this in mind.”.
Hard lessons learned about Kubernetes garbage collection
Oleg Matskiv, Opensource.com
- An article that conveys the thoughts contained in the subtitle. Subtitle “Why I’ll never skim Kubernetes documentation again.”.
- It caused a bug that the namespace was unintentionally deleted in his POC environment. As the root cause, the setting scope of “ownerReference” crosses where it should be set according to the dependency of Namespace and cluster respectively, and it was specified in the document that it should not be done by design.
- I tend to move my hand first and read the document diagonally, so it’s not another person’s affair. We should check it while being aware of where the document is.
Enterprise Kubernetes development with odo: The CLI tool for developers
Jason Dudash, Red Hat
- An article introducing the CLI tool for developers “odo (OpenShift Do)” developed by the Developer Tools team of Red Hat.
- It illustrates the benefits of using Kubernetes and odo together with a hands-on example. Here is the official odo document.
How to create ephemeral environments using Crossplane and ArgoCD?
Suraj Banakar, InfraCloud
- An article that explains how to use a custom resource on Crossplane and ArgoCD to spin up a temporary cluster, test your application, and configure it to be automatically deleted after a certain period of time.
- This custom resource has a number of things to improve and is still in its early prototype stages.
- Only GKE clusters were supported, and GKE was used as an environment, so a GCP account was required for hands-on articles at that moment.
High availability load balancers with Maglev
Terin Stock, Cloudflare
- A technical blog that explains the past and what has been focusing on the implementation of Maglev aiming at Cloudflare’s high availability load balancer.
- I’ve seen someone on Twitter mention it several times.
Misconfigured Kubeflow workloads are a security risk
Yossi Weizman, Azure Security Center
- Microsoft Azure Security Center (ASC) blog. Introducing a new organizational attack (campaign) recently observed by ASC targeting Kubeflow, Kubernetes’ machine learning toolkit.
- It has confirmed that this attack, targeting a misconfigured Kubeflow workload, is affecting dozens of Kubernetes clusters.
The Editorial
Articles, announcements, and morethatgive you a high-level overview of challenges and features.
The Financial Times, with Sarah Wells and Dimitar Terziev
Adam Glick and Craig Box, Kubernetes Podcast from Google
- Kubernetes Podcast by Google employees. The current co-hosts are Craig Box and Adam Glick.
- Sarah Wells (Technical Director for Operations and Reliability) of Financial Times and Dimitar Terziev (the current platform lead for the CM team) of the company are invited as guests.
- Sarah said in a keynote to the KubeCon EU two years ago, “How the company moved from monoliths to microservices, and how the content and metadata platform team moved specifically to Kubernetes.” This podcast talks about the recap of the transition and what happened after that.
- Speaking of FT, I remember when I heard that the Nikkei newspaper acquired it in 2015. Even though it is a group company, I don’t think there is much interaction between engineers (I wrote it somehow, so there is no particular reason or intention).
- The topics of interest in News of the week are:
○ Zerto for Kubernetes
○ Cloudera Data Platform for Private Clouds
○ Cloudbees introduces DoD compliant CI, now with a CtF to deploy into an environment with an ATO, which meets DISA STIG and NIST RMF security guidelines
●Episode 44, with Tracy Miranda
○ Gokul Chandra writes up Anthos
Google internships go virtual with the help of open source
Eric Brewer, Google
- Google has used the open source program to hold the internship program that started in 1999, even after switching to virtual internship programs. Thousands of internship students from 43 countries around the world are participating.
- Due to the impact of COVID-19, over 1,000 technical interns are actively contributing to open source projects, although they do not have access to technical resources at Google’s offices.
Supporting the Evolving Ingress Specification in Kubernetes 1.18
Alex Gervais, Datawire
- Kubernetes.io’s Kubernetes blog post. Earlier this year, the Kubernetes team released Kubernetes 1.18, extending Ingress. In this blog, you’ll learn what’s new in the new Ingress spec, what it means for your app, and how to upgrade to an Ingress controller that supports this new spec.
Kubernetes by the numbers, in 2020: 12 stats to see
Kevin Casey, Red Hat
- Again this week, an easy-to-understand article by Kevin Casey of Red Hat, explaining the number of points.
- The theme is “How is Kubernetes impacting enterprise IT?” and “12 compelling Kubernetes statistics”.
Google’s Anthos from the eyes of a Kubernetes developer
Janakiram MSV, The New Stack
- Janakiram MSV (Analyst) is a limited series of articles on The New Stack about Anthos, Kubernetes service on Google Cloud Platform. This article provides an overview of Anthos and its major components.
- Each part of the series focuses on specific aspects of Anthos, covering cluster registration, Anthos configuration management, launching “click to deploy” applications from the GCP Marketplace, and more.
Kubernetes startup Kubermatic, formerly Loodse, open-sources its core technology
Ron Miller, TechCrunch
- TechCrunch’s article telling that Germany’s Kubernetes automation platform “Loodse” has renamed it “ Kubermatic “ and announced that it will make the Kubermatic Kubernetes Platform OSS under Apache 2.0 license.
- They interviewed Sebastian Scheele of the company’s Co-founder, and also shared the comments and information we got about the circumstances leading up to this announcement and the future.
MayaData launches Kubera; a Kubernetes management service
Chris Mellor, Blocks & Files
- Launch announcement of Kubera , SaaS which is Kubernetes operation management service, on MayaData web page. Free for personal use.
- Click here for a blog post on Kubera by CEO Evan Powell of the company.
Upcoming CNCF webinars
You can check some Recorded Webinars and Upcoming Webinars here. The following are posted as Upcoming CNCF webinars at that moment.
Member Webinar: How to Promote the use of Best Practices and Automate Security Policies Using Tools Like OPA and Kubernetes
Gary Duan, CTO and Co-Founder @NeuVector
June 18, 2020 10:00 AM Pacific Time
Member Webinar: Fast packet processing with KubeVirt
David Vossel, Principal Software Engineer @RedHat
Petr Horacek, Senior Software Engineer @Red Hat
June 23, 2020 7:00 AM Pacific Time
Member Webinar: Introduction to Cloud Provider Sub Sig BaiduCloud // 介绍SIG Cloud Provider子项目BaiduCloud
Ti Zhou 周倜, Senior Architect 高级架构师 @Baidu 百度
Zichao Ye 叶子超, Senior Software Engineer 高级软件工程师 @Baidu 百度
Tianyuan Sun 孙天元, Senior Software Engineer 高级软件工程师 @Baidu 百度
This webinar will be delivered in Chinese.
June 24, 2020 10:00 AM China Standard Time
Member Webinar: Cloud Infrastructure for Network Functions — Requirements and testing
Dana Nehama, Director, Product Management Network Cloud @Intel Corporation
Petar Torre, Principal Engineer @Intel Corporation
June 24, 2020 7:00 AM Pacific Time
Member webinar: Kubernetes Cost Allocation Done Right
Webb Brown, Co-founder and CEO @Kubecost
June 24, 2020 10:00 AM Pacific Time
Member Webinar: Monitoring Kubernetes clusters by “chatting” with them Prasad Ghangal, Creator of BotKube and Software geek @InfraCloud
Vishal Biyani, CTO @InfraCloud
Hrishikesh Deodhar, Director of Engineering @InfraCloud
June 25, 2020 10:00 AM Pacific Time
Ambassador Webinar: Commoditise Kubernetes with cluster-api
Gianluca Arbezzano, Senior Staff Software Engineer @Packet
June 26, 2020 10:00 AM Pacific Time
Member Webinar: Best Practices for Running and Implementing Kubernetes
Kendall Miller, President @Fairwinds
Robert Brenna, Director of Open Source @Fairwinds*
June 30, 2020 10:00 AM Pacific Time**
Member Webinar: 7 Critical Reasons for Kubernetes-Native Backup
Niraj Tolia, CEO and Co-Founder @Kasten
Mark Severson, Member of Technical Staff @Kasten
July 1, 2020 7:00 AM Pacific Time
Member Webinar: Pivoting Your Pipeline from Legacy to Cloud Native
Tracy Ragan, CEO of DeployHub and CDF Board Member
July 1, 2020 1:00 PM Pacific Time
Member Webinar: Stay on top of ongoing Kubernetes security hygiene
Zohar Kaufman, Co-Founder and VP R&D @Portshift.io
Ariel Shuper, VP Product @Portshift.io
July 2, 2020 10:00 AM Pacific Time
Member Webinar: Optimize your Kubernetes Clusters on Azure with Built-in Best Practices
Jorge Palma, Senior Program Manager @Microsoft
July 7, 2020 10:00 AM Pacific Time
Member Webinar: The Challenges and Countermeasures of Service Mesh Practice
裴斐 (Fei Pei), 网易 杭州研究院 云计算技术专家、架构师 @网易*
This webinar will be delivered in Chinese.
July 8, 2020 10:00 AM China Standard Time**
Project Webinar: What’s new in Linkerd 2.8 : Multi-cluster Kubernetes made simple and secure by default
Oliver Gould, Linkerd Project Lead, co-founder & CTO @Buoyant
July 8, 2020 10:00 AM Pacific Time
Member Webinar: Building Production-ready Services with Kubernetes and Serverless Architectures
Mike Metral, Software Architect and Engineer @Pulumi
Jason (Jay) Smith, App Modernization Specialist @Google Cloud
July 8, 2020 1:00 PM Pacific Time
Member Webinar: 如何落地 Service Mesh — 从技术选型到实践
马若飞 FreeWheel 北京研发中心首席工程师 @FreeWheel
This webinar will be delivered in Chinese.
July 9, 2020 10:00 AM China Standard Time
Member Webinar: The top 10 most-useful Kubernetes APIs for comprehensive cloud-native observability
Caleb Hailey, Co-founder and CEO @Sensu
July 9, 2020 10:00 AM Pacific Time
Member Webinar: Securing and Accelerating the Kubernetes CNI Data Plane with Project Antrea and NVIDIA Mellanox ConnectX SmartNICs
Antonin Bas, Maintainer of Project Antrea and Staff Engineer @VMware**
Moshe Levi, Sr. Staff Engineer @NVIDIA*
July 14, 2020 10:00 AM Pacific Time
Member Webinar: Serving Millions of Customers with Cloud Native and DevSecOps
Chris Hollies, CTO, Oracle Practice @Capgemini
Akshai Parthasarathy, Principal Director, Cloud Native and DevOps @Oracle Cloud
July 15, 2020 7:00 AM Pacific Time
Member Webinar: Advancing image security and compliance through Container Image Encryption!
Brandon Lum, Senior Software Engineer @IBM
July 15, 2020 10:00 AM Pacific Time
Member Webinar: Kubernetes and storage. Kubernetes for storage. An overview.
Kiran Mova, Chief Architect at MayaData and core maintainer of OpenEBS @MayaData
July 16, 2020 10:00 AM Pacific Time
Project Webinar: How We Doubled System Read Throughput with Only 26 Lines of Code
TiKV team
July 31, 2020 10:00 AM Pacific Time
How about those articles? Do you have any interest in any?
Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.
Bye now!!