- In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
- Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
- I hope it contributes to the people browsing this kind of information as a reference.
- The title is “Addressing Build & Deploy Anti-patterns for Continuous Delivery Success”.
- The author explained that “Everyone needs to be involved and actively listening in on this feedback loop.” to know how to address anti-patterns on CDs.
- As the conclusion, he described “Software needs to be delivered. Software needs to be used.” and ,“Real users will tell us if the software fulfills their requirements.”.
- The title is “Porter: An Open Source Cloud Native Load Balancer in CNCF Landscape”.
- An article that describes the OSS load balancer “Porter” for bare metal Kubernetes clusters.
- Porter entered the CNCF Landscape last week and it also touched on the parent project, KubeSphere.
I mentioned Vector last week in the tools section. Anyone interested in a new monitoring tool should find this post interesting. Basic installation and usage instructions, and some observations about design choices.
- The title is “A Bit of a Vector”.
- An article explaining “ Vector “ that the author has been checking recently, always looking for new monitoring and logging tools.
- The author wrote multiple books like book on logging , and feature logging as part of broader monitoring frameworks.
- The title is “Kubernetes and Networks-why is this so dang hard?”.
- Tim Hockin of Google explained “Why is Kubernetes’s network so difficult?” with the slide in Speaker Deck.
- Information is added based on the same figure for easy viewing. The trade-offs of each structure are explained carefully.
- The title is “Operators and Sidecars Are the New Model for Software Delivery”.
- The original article was published on The New Stack. An article that describes the Operator model, sidecar model, and the transition from a single runtime to a multi-runtime application architecture.
- The title is “Btrfs at Facebook”.
- An article of LWN.net. I didn’t know the existence of LWN.net, but said, “Over the years LWN has grown with Linux and become one of the definitive Linux news sites.” It was interesting that Facebook occasionally drought down tried out entire data centers to test how well its disaster-recovery mechanism works and identify / improve the problem.
- The title is “Scaling Linux Services: Before accepting connections”.
- An article summarizing what happened in the process of investigating errors in a GitHub production system by the author.
- It focuses on some standard resource limits that exist before the client socket is passed to the application. I will read it again.
- The title is “Introducing Choria Scout”.
- Click here for the article introducing Scout components.
- The new project “Choria Scout” is “a highly scalable system health monitoring framework and monitoring data pipeline released under the Apache 2.0 license”.
- The GitHub page of the OSS tool “Dockle” with the following three features.
- Container image linter for security
- Help build Docker images based on best practices
- Easy to get started
- I encountered some images of contributors I’ve seen somewhere.
- The simple Kubernetes RBAC static analysis tool “Krane” GitHub page.
- It identifies potential security risks in the K8s RBAC design and suggests the ways to mitigate them.
- The Krane dashboard shows the current RBAC security status and guides you through its definition.
SRE Weekly Issue #227 July 12th, 2020
This is the first of a pair of articles this week on a major Slack outage in May. This one explores the technical side, with a lot of juicy details on what happened and how.
Laura Nolan — Slack
- An article that details Slack’s first serious outage on May 12, 2020 from a technical side.
- The top highlight says, “ The broken monitoring hadn’t been noticed partly because this system ‘just worked’ for a long time, and didn’t require any change. “. It is a terrifying sentence from the perspective of managing the infrastructure. However, the author’s sincere attitude is conveyed throughout the article, including the closing words.
- It is also touching upon the ongoing migration from HAProxy to Envoy Proxy, so I hope that another article will be published in the future.
This is the companion article that describes Slack’s incident response process, using the same incident as a case study.
Ryan Katkov — Slack
- An article that describes “the response process” of dealing with the same outage as the one above Slack article. I’m impressed with Slack’s established culture and philosophy that puts lessons from outages and response processes into feedback loops.
The author saw room for improvement in the retrospective process at Indeed. The article explains the recommendations they made and why, including de-emphasizing generation remediation items in favor of learning.
- Through the “Incident Retrospectives” conducted at Indeed, which the author belongs to, this article was trying to improve the process by considering that “ It became apparent to me that we were not using every incident to realize our full potential to learn.”.
The datacenter was purposefully switched to generator power during planned power maintenance, but unfortunately the fuel delivery system failed.
- The Google Compute Engine (GCE)-based components in the us-east1-c and us-east1-d zones were affected. Due to a fuel supply system failure that occurred during maintenance by a power equipment provider in the us-east1 region.
This is a good primer on the ins and outs of running a post-incident analysis.
Anusuya Kannabiran — Squadcast
- Article summarizing the importance of effective incident postmortems, perspectives covered, and six processes for writing
- Start with an incident timeline
- Conduct a postmortem meeting with anyone internal to the team who was affected by the incident
- Define roles and owners along with having a moderator
- Determine the urgency of an incident by setting the right thresholds
- Devil’s in the Details — incident metrics and other key information captured
- Publish and track post mortems promptly
This article goes through an interesting technique for setting up SLO metrics and alerts in GCP using Terraform and OpenCensus.
Cindy Quach — Google
- An article by Google Cloud that explains “How to create SLOs out of the box with service monitoring, and how to create SLOs for custom metrics to get better observability for your customer-focused metrics.” using an example of creation using Terraform and OpenCensus.
GitHub is committing to publishing a report on their availability each month with detail on incidents. This intro includes the reports for May and June with a description of 4 incidents.
Keith Ballinger — GitHub
- An article introduced by GitHub, which has released the “Availability Report” as new investments.
- They issue a GitHub availability report (including a description of any incidents that may have occurred and update you on how they are evolving their engineering systems and practices in response) on the first Wednesday of each month.
- They also said that “You should expect these updates to include a summary of what happened, as well as a technical explanation for incidents where we believe the occurrence was novel and contains information that helps engineers around the world learn how to improve product operations at scale.” and I appreciate it and will use it as a reference in the future.
This is neat: Blameless transitioned from “startup mode” toward an SRE methodology, becoming customer 0 of their own product in the process.
- An article explaining Blameless’s SRE efforts and how they implemented important best practices.
- I thought “Blameless’s article is persuasive and realistic even when explaining principles and theories”. Now I found that it’s because they overcame difficult situations with the implementation and improvement of SRE best practices.
- Facebook SDK
○ Like in May, a Facebook SDK release caused problems on iOS for Spotify, Pinterest, Tinder.
- Uber Eats
KubeWeekly #225 July 17th
Editor’s pick of the highlights from the past week.
Fluent Bit, a sub-project under the umbrella of CNCF graduated project Fluentd, has reached its version v1.5.
One of the biggest highlights of this major release is the joint work of different companies contributing with Fluent Bit core maintainers to bring improved and new connectors for observability cloud services provided by Google, Amazon, LogDNA, New Relic and Sumo Logic within others.
- An article introducing Fluent Bit, the subproject of Fluentd, has reached version 1.5.
- The highlight was the joint work of different companies contributing with Fluent Bit core maintainers to bring improved and new connectors for observability cloud services provided by Google, Amazon, LogDNA, New Relic and Sumo Logic within others. The following three new output connectors joined.
○ Amazon Cloudwatch Logs
○ New Relic
KubeCon + CloudNativeCon EU Virtual — Complimentary pass option!
This week, we launched a complimentary pass option for KubeCon + CloudNativeCon EU Virtual. Unsure which pass is the best option for you? We’ve outlined the benefits of the full pass and complimentary option below.
A reminder that the complimentary pass offer is only available through August 7, 23:59 PM CEST — so act fast!
- Complimentary pass option announcement for KubeCon + CloudNativeCon EU Virtual.
- You can join All Keynote Sessions, Sponsor Showcase, Sponsor Demo Theater, and Engage with Project Maintainers + Leads with this option.
- If you get the Full Event Pass, 50% off of training + Exam Bundle of CKA or CKAD is included, so it is beneficial for those who plan to take the exam and apply for the future.
KubeCon + CloudNativeCon EU Virtual Session Spotlight
The countdown to KubeCon + CloudNativeCon EU Virtual on August 17–20, 2020 is on! As we approach the event, we curated a few recommended sessions that we don’t want you to miss. Please see the feature for this week and be sure to register today!
Hosted by the Network Service Mesh Community
August 17, 2020
Why Attend NSMCon? Are you running workloads in multiple clusters? Across multiple clouds: on-premises, hybrid, multicloud, or public cloud? Do they need to interact with legacy workloads running in less “cloudy” environments? Network Service Mesh (NSM) ties them all together, at the granularity of individual workloads, not cluster/VPCs/data centers. NSM is a community-driven, CNCF Sandbox project that is rapidly gaining momentum because of its ability to simplify connectivity between workloads, independent of where they are running. It extends an IP reachability domain to workloads running in multiple clusters, legacy environments, on-premises, or in a public cloud, communicating with the protocols they are currently using.
Join the people building and using NSM at NSMCon for a day of tutorials, deep dives, and use cases to learn how NSM works, what it can do for you, and, most importantly, what’s coming next.
- KubeCon + CloudNativeCon EU Virtual spotlights on “Day Zero Co-located Event: NSMCon” session. Schedule: 8/17 (Monday) 13:00 CEST (Central European Summer Time) start.
- Network Service Mesh (NSM) Community one-day tutorial, deep dive session to learn about use cases, how NSM works, what NSM can do, and, most importantly, what happens next.
ICYMI: CNCF Webinars
Weekly recap of CNCF member and project webinars that you might have missed.
You can view all CNCF recorded and upcoming webinars here.
Antonin Bas, Maintainer of Project Antrea and Staff Engineer @VMware Moshe Levi, Sr. Staff Engineer @NVIDIA and Itay Ozery, Director, Product Management for Networking @NVIDIA
- It explains how to secure and accelerate Kubernetes CNI data plane with Kubernetes CNI network plugin “Antrea” and “NVIDIA Mellanox ConnectX SmartNICs”.
Zhang Kai, Staff Engineer @Alibaba & Wang Qingcan, Senior Engineer @Alibaba
- The language for this webinar is Chinese.
- The Senior Engineers at Alibaba explained Kubernetes scheduling framework, how to support complex workloads such as AI/ML, mixed resource environment such as GPU, and how to implement Job level scheduling policy,and etc.
Brandon Lum, Senior Software Engineer @IBM
- It explained Container Image Encryption, a function introduced recently.
Chris Hollies, CTO, Oracle Practice @Capgemini Akshai Parthasarathy, Principal Director, Cloud Native and DevOps @Oracle Cloud
- It explained how they re-architected their client(a leading financial organization with millions of customers)’s system by applying Cloud Native and DevSecOps.
Kiran Mova, Chief Architect at MayaData and core maintainer of OpenEBS @MayaData
- Kiran Mova, chief architect and community leader for the CNCF project OpenEBS, discussed “Container Attached Storage”, an approach that enables storage on a per-workload basis with Kubernetes itself.
- There is also a brief demo explaining how to operate OpenEBS’s Container Attached Storage with other open source software and services.
Tutorials, tools, and more that take you on a deep dive into the code.
Jessica Cherry, Opensource.com
Jessica Cherry, Opensource.com
- An article that describes Lens as an IDE (Integrated Development Environment) tool that manages Kubernetes clusters with a GUI instead of a CLI.
Dominique Vernier, Red Hat
- An article that describes how to create operators with profile reporting and deploy them in a Kubernetes environment to check code coverage.
- The author recently rebuilt the printfacts server based on the warp web server framework.
- It is set to automatically deploy to a Kubernetes cluster for each commit to the source repository, and explains how to set it.
Ronak Nathani, LinkedIn
- The author digs a little deeper into the iptables rules for the NodePort type Service, sharing some of what he learned and sharing answers to the following questions he encountered while working with kube-proxy.
- “What happens when a non-kubernetes process starts using a port that’s allocated as a NodePort to a service?”
- “Does the service endpoint continue to route traffic to pods if kube-proxy (configured in iptables mode) process dies on the node?”
Nandor Kracser, Banzai Cloud
- As the author prefaced, “Today’s post is going to be rather technical, since we’ll be discussing authenticating Kubernetes applications with external systems through OIDC issuer discovery. “, he explained technically with easy-to-read diagrams and code blocks.
- He was using Vault on Kubernetes as an OIDC (OpenID Connect) consumer and he was using a simple client application running in a cluster to access a Vault instance with a ServiceAccount token provided.
- An article that describes how to deploy a 5G core PoC using the KubeOne project. Implemented using the AWS environment.
Betty Junod, Solo.io
- It is a tutorial as the title says, with a sample YAML configuration file and a demo video.
Solo.io Engineering Team
- The first post in the blog series. They started with ”We will dig into specific challenge areas for multi-cluster Kubernetes and service mesh architecture, considerations and approaches in solving them. For our first post, we’ll focus on service discovery and how we need to approach it for multi-cluster environments.”.
Chris Short, Matt Dorn, and Michael Hrivnak, Red Hat
- Twitch video at Red Hat Summit titled “Building Kubernetes Operators with Ansible Workshop”.
Tom Sweeney (Red Hat)
- An article explains the problem and the solution the author recently experienced and commented that “I recently experienced a problem where image version information wasn’t updating after a build trigger event. This issue left us with out-of-date images. Clearly, not acceptable.”.
Sudip Sengupta, Hackernoon
- The three steps Kubernetes uses to enforce security access and permissions are Authentication, Authorization, and Admission, and this article focuses on Authentication.
Ameya Shenoy, BrowserStack
- The story of the author trying to block ads at the DNS level using WireGuard VPN to improve the web browsing experience and privacy of the family.
- He uses the uBlock Origin extension in his browser to block ads.
Articles, announcements, and morethatgive you a high-level overview of challenges and features.
Robert Brennan, Fairwinds
- They discussed the 5 key issues that impact cost estimates, and looked at the strategies used to overcome these issues in Fairwinds Insights cost dashboards.
- Bin Packing
- Resource Asymmetry
- Cost Attribution
- Resource Ranges
- Noisy Neighbors
- They said that “We’ll present a more in-depth review of right sizing in a forthcoming blog.”
Forrest Brazeal, A Cloud Guru
- EKS migration of Basecamp’s mail application “HEY”. It was an interesting story such as the migration from GKE to EKS and the utilization of Spot Instances.
- HEY is getting more talked about, but I haven’t touched it yet so I decided to give it a try.
Puja Abbassi, Giant Swarm
- The first article in the blog series. It takes a closer look at some of the most popular and interesting technologies aimed at addressing configuration management issues.
- It explores the following tools to see what they offer and understand their pros and cons. Helm Package Manager and Kustomize were already released (On August 7th, 2020), so if you are interested, you can click the link below.
○ Helm Package Manager
Mike Elsmore, Logz.io
- It analyzes the principles of chaos engineering, understands how and why it works, and describes how to use chaotic engineering to protect your Kubernetes environment.
Gedalyah Reback, Logz.io
- Through explaining some of the core concepts behind Kubernetes, it focuses on providing five resources to help you get it right.
- Cluster Administration
- Kubernetes Storage
- Learn Kubernetes Security
- Governor Observability
Kaslin Fields, Kubernetes Upstream Marketing Team
- It tells the story of how Kubernetes contributors work together to provide a container orchestrator that works for both Linux and Windows with the spotlight of “SIG-Windows”.
- Kubernetes’ contributor gives a spotlight on “SIG-Windows” as he talks about how he provided a container orchestrator that works on both Linux and Windows.
- A transcript of the “Livin’ on the Edge” podcast. This week’s guest, Dave Sudia (Senior DevOps Engineer at GoSpotCheck).
- They talked about creating an effective local developer experience for Kubernetes, migrating from Heroku, building a Kubernetes-based platform (PaaS) as a service, and more.
- The story of “How his team developed an understanding of all of the personas involved with creating a platform” was interesting.
Mia Platform Blog
- Starting with“We are living in the API economy.”, “they talked about “API Security, what it is and how to approach it” and explained best practices with infrastructure strategies.
Hrishikesh Deodhar, InfraCloud Technologies
- It describes the K10 Data Management Platform” by Kasten.
- The platform functions as a way to perform backup/restore of Kubernetes applications and their volumes.
- CNCF announced the start of a new SIG (Special Interest Group) CNCF SIG Contributor Strategy “.
- Check Charter and Contributing Guide for details.
- Chairs are Paris Pittman, Josh Berkus, and Stephen Augustus.
- They meet Bi-weekly Thursdays with 1 hour regular meetings from 10:30a.m. PT (17:30 UTC).
Upcoming CNCF webinars
You can check some Recorded Webinars and Upcoming Webinars here. The following are posted as Upcoming CNCF webinars at that moment.
Member Webinar: Kubernetes Security Anatomy and the Recently Disclosed CVEs
Gadi Naor, CTO & Co-Founder @Alcide
July 21, 2020 10:00 AM Pacific Time
Ambassador Webinar: Building application management platform with Open Application Model
Lei Zhang (Harry), Staff Engineer @Alibaba
This webinar will be delivered in Chinese.
July 22, 2020 10:00 AM China Standard Time
Member Webinar: Kubernetes Secrets Management: Build Secure Apps Faster Without Secrets
Jody Hunt, Director of DevOps Security @CyberArk
July 22, 2020 7:00 AM Pacific Time
Member Webinar: Implementing Canary Releases on Kubernetes w/ Spinnaker, Istio, and Prometheus
Oleg Chunikhin, CTO @Kublr
July 22, 2020 1:00 PM Pacific Time
Member Webinar: Observability of multi-party computation with OpenTelemetry
Antoine Toulme, Engineering Manager @Splunk
Dave McAllister, Sr. Technical Evangelist @Splunk
July 23, 2020 10:00 AM Pacific Time
Member Webinar: One large cluster or lots of small ones? Pros, cons and when to apply each approach
Flavio Castelli, Distinguished Engineer @SUSE
July 24, 2020 10:00 AM Pacific Time
Member Webinar: Kubernetes Policies 101
Eran Leib, Founder, VP Product Management @Apolicy
Spenser Paul, Director of Sales, North America @DoiT International
July 28, 2020 10:00 AM Pacific Time
Member Webinar: GitOps Continuous Delivery with Argo and Codefresh
Dan Garfield, Chief Technology Evangelist @Codefresh
July 29, 2020 1:00 PM Pacific Time
Member Webinar: Cluster API — Yesterday, Today, Tomorrow
Saad Malik CTO & Co-Founder @Spectro Cloud
Jun Zhou Chief Architect @Spectro Cloud
July 30, 2020 10:00 AM Pacific Time
Project Webinar: How We Doubled System Read Throughput with Only 26 Lines of Code
July 31, 2020 10:00 AM Pacific Time
Member Webinar: Comparing eBPF and Istio/Envoy for Monitoring Microservice Interactions
Roko Kruze, Solutions Engineer @Flowmill
Mike Cohen, Co-Founder and COO @Flowmill
Aug 4, 2020 10:00 AM Pacific Time
Member Webinar: Debugging your debugging tools; What to do when your service mesh goes down in production?
Neeraj Poddar, Co-founder and Chief Architect @Aspen Mesh
Aug 5, 2020 7:00 AM Pacific Time
Member Webinar: Making Data Work for Developers with Kubernetes & Cassandra
Chris Splinter, Sr. Product Manager — Developer Solutions @DataStax
Patrick McFadin, VP of Developer Relations @DataStax
Aug 5, 2020 1:00 PM Pacific Time
Member Webinar: Maximizing M3 — Pushing performance boundaries at scale in a cloud-native distributed metrics engine
Ryan Allen, Senior Software Engineer @Chronosphere
Aug 6, 2020 10:00 AM Pacific Time
Member Webinar: Hardware for Kubernetes, Peeling Back the Layers
Erik Reidel, SVP Compute & Storage Solutions @ITRenew
Aug 11, 2020 10:00 AM Pacific Time
Project Webinar: Kubernetes 1.19
Kubernetes release team
Aug 28, 2020 10:00 AM Pacific Time
Member Webinar: Getting started with container runtime security using Falco
Loris Degioanni, CTO and Founder @Sysdig
Sept 2, 2020 1:00 PM Pacific Time
How about those articles? Do you have any interest in any?
Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.