SRE / DevOps / Kubernetes Weekly Collection#36(Week 41)

- In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
- Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
- I hope it contributes to the people browsing this kind of information as a reference.
DEVOPS WEEKLY ISSUE #510 October 4th, 2020
SRE Weekly Issue #238 October 4th, 2020
KubeWeekly #236 October 9th, 20209
DEVOPS WEEKLY ISSUE #510 October 4th, 2020
News
- The web page of HIPAA (Health Insurance Portability and Accountability Act) compliant policy, the open source version of the Datica HIPAA Compliance Policies. I think it’s great to share these policies and I am interested to see it as well.
- Click here for the GitHub page.
- The title is “The Three Ways of DevOps”.
- It describes how to apply the concept of least privilege security to cloud instances within a DevOps environment without adding bulk or latency to your pipeline. The following three points are explained about DevOps.
○ The first way of DevOps is to emphasize the speed and efficiency of the entire system, instead of just your part.
○ The second way of DevOps is fast feedback.
○ The third way of DevOps is continuous learning.
- The title is “Complex Adaptive Systems (ii): thinking about emergence and ITSM”.
- It focuses on the emergence” of one particular characteristic of such a system in the context of a better understanding of its impact on IT service management.
- I could have images of what he expressed, but I couldn’t read it smoothly because there were many phrases and expressions that I was not good at. I think it is a problem unique to me, so I will skip the details for this reason.
- The title is “Monolith-> Services: Theory & Practice”.
- In response to the question, “How can we get from a monolith to micro-services quickly?”, he said,
○ Can’t answer that question.
● First, “quickly” is right out the window. You didn’t make this mess in a month; you’re not going to fix it in a month.
● Second, you want some benefit you aren’t currently getting that you expect from micro-services. What is that benefit? Micro-services aren’t the point. - He explains the following points.
○ Managing Coupling
○ Not Quickly
○ Good News
- The title is “Why you need a service registry”.
- It describes how GoCardless built the registry and some of the use cases it found.
- I thought that “I saw Jsonnet in this kind of usage in SRE Workbook and etc” and found that they are heavy users of GCP and made sense for me.
- The title is “Security September: Still Early Days for ABAC”.
- It describes exploits that allow someone to tag S3 buckets without permission, even with explicit deny, and what this means for attribute-based access control (ABAC). The following two points seem to be what the author would like to say the most.
- I strongly believe that at this point tags are a tool for auditing, billing and automation — not for access control.
- With the amount of services utilizing tags for their own purposes and subsequently tagging permissions being seen as a low risk permission, I recommend cloud security practitioners stick to traditional resource-based access control.
- The title is “Istio at Scale: Sidecar”.
- The following behavior is fine if the mesh has only a handful of services, but as the mesh begins to grow, it will consume resources (CPU, network I/O, memory) and take longer to propagate changes.
○ By default; all proxies on your mesh will receive all the config required in order to reach any other proxy.
○ It’s important to note that by “Config” — that isn’t just what you put in your DestinationRule or VirtualService, but also the state of all of the Endpoints for your services too, therefore whenever you do a Deployment, or a Pod becomes unready, or your service scales, or anything else that changes state — config is pushed to all the proxies. - As a solution to the above problem, the following three, Sidecar resources provided by Istio, and others are explained.
○ Configuring the Sidecar
○ Monitoring Pilot
○ Detecting Misconfiguration
- The title is “Use Terraform to Create and Manage a HA AKS Kubernetes Cluster in Azure”.
- I will skip it because it was covered in Kube Weekly # 235 last week.
Tools
- I will skip it because it was covered in Kube Weekly # 235 last week.
- The io page of Apache Pinot, a real-time distributed OLAP data store that provides low-latency, scalable real-time analytics.
- Click here for the GitHub page.
SRE Weekly Issue #238 October 4th, 2020
Articles
On Call Shouldn’t Suck: A Guide For Managers
Lots of really great advice in here. And really, with a title like that, I couldn’t resist reading it!
Charity Majors
- A blog written for managers so that on-call should not suck and doesn’t give the worst results. The following are its points.
It is engineering’s responsibility to be on call and own their code. It is management’s responsibility to make sure that on call does not suck. This is a handshake, it goes both ways, and if you do not hold up your end they should quit and leave you.
Follow-up for Google Cloud Infrastructure Components Incident #20010
Last week, I mentioned a Google Cloud Platform outage that affected multiple services. Here’s the detailed post-analysis by Google.
- 9/24 18:00 ~ 18:33 Follow-up of GCP failures that occurred in the US / Pacific.
- It is explained from the role of Global Service Load Balancer (GSLB) and Google Front End (GFE) in the item of Pastories, so unless you know it, I recommend you to hold it down and see the following items (I would not not say that I understand all).
Team Play with a Powerful and Independent Agent: A Full-Mission Simulation Study
This one is along the lines of the classic Ironies of Automation paper by Bainbridge. In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.How can automation be a team player, and what happens when it isn’t?
Nadine Sarter and David Woods (original paper)
Thai Wood — Resilience Roundup (summary)
- An article explaining the white paper. The points of the author of this blog are as follows.
○ While this paper focuses on the A320 aircraft specifically, the authors are very clear that this is not an issue with this one aircraft or even just aviation in general, but is a problem that can occur when automation doesn’t act as a team player.
○ Automation acts as more of a “team player” when it acts in a way that allows humans to keep track of what it’s doing and anticipate what it will do in the future. The authors even suggest that this measure, how much of a team player a given technology is, can even help predict how effective and successful that it is.
○ Conversely, when automation isn’t a team player, automation can be clumsy and confusing.
Applying Chaos Engineering in Healthcare: Getting Started with Sensitive Workloads
In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.How can you use chaos engineering when failures in the system can be critical and even life-threatening?
Carl Chesser — Infoq
- Chaos engineering, an area of growing interest in recent years, provides an example of the challenge of often hesitating when systems are viewed as “critical” or too important to fail.
○ A team at Cerner Corporation, a healthcare information technology company, shares how they found to be effective in introducing this practice with their systems. - Key Takeaways are below.
○ While chaos engineering is a proven technique for improving the resilience of systems, there is often a reluctance amongst stakeholders to introduce the practice when a system is viewed as critical.
○ With critical systems, it can be a good idea to first run experiments in your dev/test type environments to minimize both actual and perceived risk. As you learn new things from these early experiments, you can explain to stakeholders that production is a larger and more complex environment which would further benefit from this practice.
○ Using real production traffic, as opposed to a synthetic workload, can improve the usefulness of the experiments in these early stages.
○ A good chaos engineering practice helps you to improve both the resilience of the system, and its observability when incidents do occur.
This is your Guide for Implementing SRE in NOCs
In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.
Emily Arnot — Blameless
- The following four points explain how to practice SRE at NOC (Network Operation Centers) to improve its functions. I used to be on the NOC side, so it’s an interesting theme.
○ Monitor smarter by focusing on complex metrics
○ Escalate and triage with classification and on-call
○ Get the most from incidents with meaningful response
○ Manage ticketing from a customer-focused perspective
Is your microservice a distributed monolith?
This article suggests using chaos engineering to tell if your microservice-based architecture is secretly a monolith in disguise.
Andre Newman — Gremlin
- It describes what a microservices anti-pattern, a “distributed monolith” is, why it should be avoided, and how to use chaos engineering to validate whether your application falls under this anti-pattern.
Outages
- Slack
- Radware
An accidental BGP hijack by Telstra took down Radware. - Tokyo Stock Exchange
The Tokyo Stock Exchange was down for an entire day, the first time that’s ever happened. - Fastly
- Squarespace
- Google Search Indexing
- Microsoft Azure outage #SM79-F88
A problem with Azure Active Directory caused trouble for Office365 and other Microsoft services. Click through for their detailed follow-up.
Failure information of each of the above companies
KubeWeekly #236 October 9th
The Headlines
Editor’s pick of the highlights from the past week.
Cloud Native Computing Foundation announces Rook graduation
Congratulations to Rook on hitting graduated status! Rook is an open source cloud native storage orchestrator for Kubernetes, providing the platform, framework, and support for a diverse set of storage solutions to natively integrate with cloud native environments. Rook delivers its services via a Kubernetes Operator for each storage provider. Originally accepted into CNCF in 2018, it is the 13th CNCF project — and the first based on block, file, or object storage — to graduate.
- It has announced that it has been recognized as the 13th “Graduation” of the “Rook” CNCF project.
- Rook is an open source cloud native storage orchestrator for Kubernetes. It was accepted as a project by the CNCF in 2018.
ICYMI: CNCF Webinars
You can view all CNCF recorded and upcoming webinars here.
CNCF Project webinar: Transactional microservices with Vitess — coordination without state
Daniel Kozlowski, Minister of Engineering @PlanetScale
- It shows how to use Cloud Native Database “Vitess” to create secure ACID distributed transactions that behave exactly like transactions on a monolithic system without increasing complexity.
CNCF Member webinar: Multi-cluster & multi-cloud service mesh with CNCF’s Kuma and Envoy
Marco Palladino, CTO & Co-Founder @Kong
- It explains how to launch Kubernetes clusters in multiple regions and secure, route, connect, and monitor service connections with a distributed service mesh with the following points.
○ Use Kuma’s multi-zone deployment to spin up a multi-cluster and multi-region service mesh.
○ Leverage the global/remote control separation to scale reliability with HA.
○ Use the built-in service discovery and ingress capability for out of the box service connectivity across multiple zones, clusters and regions.
○ Use Kuma’s policy to determine the behavior of traffic across different clusters, like Traffic Route, mTLS, Traffic Permission and so on.
CNCF Member webinar: Kubernetes native two-level resource management for AI/ML workloads
Diana Arroyo Software Engineer @IBM Research and Alaa Youssef, Manager, Container Cloud Platform @IBM Research
- It explains resource management for AI/ML workloads in Kubernetes-based environments, and two-level resource management using Multi-Cluster App Dispatcher addresses the challenges.
Boyan Krosnov, CPO @StorPool
- It explains how various orchestration systems have taken a very similar path to public clouds, OpenStack, and Kubernetes to evolve storage capabilities, and why persistent storage is important.
CNCF Member webinar: Building dynamic machine learning pipelines with KubeDirector
Tom Phelan, Fellow, Software Organization @Hewlett Packard Enterprise, Kartik Mathur, Master Technologist @Hewlett Packard Enterprise and Donald Wake, Technical Marketing Engineer @Hewlett Packard Enterprise
- The machine learning (ML) pipeline is explained in the following points. The ML pipeline is complex to set up, more difficult to maintain if you need a constantly changing data model, and essentially requires a dynamic ML pipeline.
○ Discussing an example ML Pipeline centered around supporting an application that must predict travel times based upon a large data set of taxi ride data.
○ Walking through the development of the full ML pipeline using Kubernetes and another Open Source application called KubeDirector. You will learn how to train, register, and finally, query your model for answers.
○ Learning how a new capability of KubeDirector called “Connections” enables a dynamic, always up-to-date ML model.
The Technical
Tutorials, tools, and more that take you on a deep dive into the code.
Identity federation for multi-cluster Kubernetes and service mesh
Dennis Jannot, Solo.io
- A blog series article detailing specific challenge areas of multi-cluster Kubernetes and service mesh architectures, as well as considerations and approaches to solving them.
- This one, it explains how to federate IDs between multiple clusters for authentication between services.
- There are two types of authentication required for this environment, and this one is focusing on “authentication between services”. Click here for another one, “end-user authentication”.
Call an existing REST service with Apache Camel K
Mary Cochran, Red Hat
- It shows how to call an existing REST service and create a Camel K integration that uses an existing data format.
An architect’s guide to APIs: REST, GraphQL, and gRPC
Bob Reselman, Code District
- It describes the history of data exchange and compares REST, GraphQL, and gRPC APIs each other.
Self-hosted Github Actions runners in Kubernetes
Vito Botta, Dynablogger
- As titled, Self-hosted Github Actions runners are explained on Kubernetes using the simple official image Docker in Docker.
Kubernetes integration and more in odo 2.0
Serena Chechile Nichols and Steve Speicher, Red Hat
- It introduces the highlights of the v2.0 release of the developer CLI “Odo” for OpenShift and Kubernetes.
Provisioning Kubernetes clusters on AWS with Terraform and EKS
Kristijan Mitevski, LiveDOOH
- The contents are as shown in the title and TL; DR below. You will want to try it with polite commentary. I bookmarked it.
○ TL;DR: In this guide, you will learn how to create clusters on the AWS Elastic Kubernetes Service (EKS) with eksctl and Terraform. By the end of the tutorial, you will automate creating three clusters (dev, staging, prod) complete with the ALB Ingress Controller in a single click.
Saiyam Pathak, Civo
- It explains how to install a standalone Kubeflow pipeline on a Civo managed k3s cluster. I had an impression that there are more articles about the ML pipeline this week.
How to perform a CNI live migration from Flannel+Calico to Cilium
Josh Van Leeuwen, Sky Betting and Gaming
- It explains why we need to change CNI, what the author learned about developing a live migration solution, and how everything works.
- When I thought, “It’s something, some diagram, I’ve seen somewhere” and I checked its previous article and I covered it in my blog last month.
How to set up PostgreSQL monitoring in Kubernetes
Jonathan S. Katz
- The content is as titled. In the future, it seems that they will publish articles on how to interpret some visualizations that come with the PostgreSQL operator monitoring stack.
- A GitHub page of “Kube DOOM”, a tool to shoot and kill pods in a Kubernetes cluster on the screen of the game “DOOM”
- When I was wondering, “Are you sure you want to do it in 3D?”, I found some people were playing like this YouTube video.
What’s new in Goldilocks 3.0.0
Andy Suderman, Fairwinds
- It explains about the changes in v3.0.0 of Goldilocks, a tool for reviewing recommendations for setting resource requests and limits on Kubernetes deployments, and shares the story of how they got there.
The Editorial
Articles, announcements, and morethatgive you a high-level overview of challenges and features.
Adam Glick and Craig Box, Kubernetes Podcast from Google
- Kubernetes Podcast by Google employees. The current Co-hosts are Craig Box and Adam Glick.
- Former Google PM and Webb Brown , the founder of “Kubecost”, is invited as a guest. Kubecost aims to reduce spending and prevent resource-induced failures.
- The topics I was interested in in the News of the week are as follows.
○ Announcing Java support for cdk8s
○ Good: Envoy on Windows
○ Kubenav 3.0.0 announced
○ Cisco acquires Portshift
Helm Hub is moving to Artifact Hub
Matt Farina, Rancher Labs and Helm
- The announcement of the migration of Helm Hub to Artifact Hub.
- There are some reasons for the migration as follows.
○ Helm Hub was built to handle a limited number of Helm repositories and was designed for use cases that are slightly different from the public lists of as many chart repositories as possible.
○ As the number increased, there were some limitations.
Introducing DigitalOcean App Platform
Apurva Joshi, Digital Ocean
- An introductory article about DigitalOcean’s new Paas platform “DigitalOcean App Platform”. An explanation video of the procedure up to deploying the application is also embedded.
- There is a one who has already tried this and wrote a blog post in Japanese, so I think it will be helpful as well. “DigitalOcean App Platformを試してみる”.
Red Hat OpenShift Product Management Team
- A YouTube video of Red Hat’s OpenShift Product Management Team staff explaining the changes in OpenShift 4.6 in detail for nearly two hours in a relay fashion.
Call for questions! sig-HONK AMA KubeCon NA keynote panel
Ian Coldwater, Salesforce
- They are looking for questions about KubeCon NA’s keynote AMA panel session.
- You can check a short introduction of sig-HONK, an unofficial SIG(Special Interest Group) there.
Bill Johnson, Microsoft
- It shows an example of how to extend the Kubernetes Scheduler to take advantage of the natural variation in carbon strength of existing power grids to minimize the amount of carbon in the atmosphere that Kubernetes clusters are responsible for. I was so surprised at the theme that I saw the title twice.
Service meshes, Envoy, and WebAssembly with Christian Posta
Christian Posta, solo.io and Chris Short, Red Hat
- It invited Christian Posta, Global Field CTO of solo.io and former Chief Architect of Red Hat, as a guest, and asked various questions about the theme like, “Why are there so many products in Service Mesh?” “Why are so many products using Envoy as a control plane? “.
Open Service Mesh (OSM) is hosting first community call on Oct 13
OSM Community
- An GitHub page of OSM (Open Service Mesh), a lightweight and extensible cloud-native service mesh.
- The OSM project is based on the ideas and implementations of many cloud-native ecosystem projects such as Linkerd, Istio, Consul, Envoy, Kuma, Helm, and SMI specifications.
Upcoming CNCF webinars
You can check some Recorded Webinars and Upcoming Webinars here. The following are posted as Upcoming CNCF webinars at that moment.
Member Webinar: You can be a Kubernetes contributor too!
Jeremy L. Morris, Software Engineer @DigitalOcean
Oct 13, 2020 10:00 AM Pacific Time
REGISTER NOW »
Member Webinar: ephemeral.run: A full application environment for every PR–before you merge to master!
Vishal Biyani, CTO @InfraCloud
Jono Spiro, Staff Software Engineer, Engineering Operations @OpenGov
Oct 14, 2020 10:00 AM Pacific Time
REGISTER NOW »
Member Webinar: GitOps at scale for a multicloud, multi-region stateful application
Rick Spencer, Head of Platform @InfluxData
Oct 14, 2020 1:00 PM Pacific Time
REGISTER NOW »
Member Webinar: S&P experience report: multicloud serverless on Knative
Evan Anderson, Software Engineer @VMware
Mark Wang, Head of Cloud Engineering @S&P Global Ratings
Oct 15, 2020 10:00 AM Pacific Time
REGISTER NOW »
Member Webinar: Delivering cloud native apps to Kubernetes using werf
Dmitry Stolyarov CTO, @Flant
Oct 16, 2020 10:00 AM Pacific Time
REGISTER NOW »
Member Webinar: How to migrate NF or VNF to CNF without vendor lock-in
Grzegorz Sikora, VP Business Development @OVOO
Oct 20, 2020 10:00 AM Pacific Time
REGISTER NOW »
Member Webinar: Deploying Kubernetes to bare metal using cluster API
Seán McCord, Principal Senior Software Engineer @Talos Systems, Inc.
Oct 21, 2020 1:00 PM Pacific Time
REGISTER NOW »
Member Webinar: K8s audit logging deep dive
Randy Abernethy, Managing Partner @RX-M
Oct 22, 2020 10:00 AM Pacific Time
REGISTER NOW »
Member Webinar: Building 12 factor streaming data apps on Kubernetes
Stelios Charmpalis, Frontend Engineer @Lenses.io
Francisco Perez, Senior Backend Engineer @Lenses.io
Oct 23, 2020 10:00 AM Pacific Time
REGISTER NOW »
Member Webinar: Admission controllers: one part of your Kubernetes security and governance toolkit
Gunjan Patelm, Cloud Architect @Palo Alto Networks
Robert Haynes, Cloud Security Evangelist @Palo Alto Networks
Oct 28, 2020 7:00 AM Pacific Time
REGISTER NOW »
Member Webinar: Security in the world of service meshes
John A. Joyce, Principal Engineer @Cisco
Nov 4, 2020 1:00 PM Pacific Time
REGISTER NOW »
Member Webinar: Developer-friendly platforms with Kubernetes and infrastructure as code
Lee Briggs, Staff Software Engineer @Pulumi
Nov 6, 2020 10:00 AM Pacific Time
REGISTER NOW »
Member Webinar: Metal³: Kubernetes-native bare metal host management
Maël Kimmerlin, Senior Software Engineer @Ericsson Software Technology
Dec 10, 2020 10:00 AM Pacific Time
REGISTER NOW »
How about those articles? Do you have any interest in any?
Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.
Bye now!!