- In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
- Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
- I hope it contributes to the people browsing this kind of information as a reference.
The fundamental importance of learning from incidents in building resilient systems is often hard to fully understand when fighting to fix issues. This presentation neatly summarises a bunch of recent research.
- The title is “Presentation: “Findings From The Field””.
- A slide the author delivered a few weeks ago at a virtual event “DevOps Enterprise Summit London”. The video is waiting for release.
There’s a line-up of phrases I’ve seen, so its’ from John Allspaw, who I’ve covered in this blog before. This is a summary of research results that he has been closely related to incidents on site for two years.
- The title is “BRAZEAL: THE LIFT-AND-SHIFT SHOT CLOCK”.
- It explains the following three points when migrating to the cloud with “Lift-and-shift”, the transition of the balance between value and danger, and the true cloud transformation should not be put off.
- You need SRE
- You need governance
- You need a culture of comprehensive learning
- The title is “OPERATING UNDER UNCERTAINTY”.
- A page that publishes videos and slides from the titled webinar.
- The Presenter, Certified Partner of AWS/Kubernetes, explains the best practices and architecture of AWS. The topics are very organized and I recommend it.
- The title is “What is SRE?”
- An article summarizing the elements of SRE and how to get started. You can also check useful sources of information.
- The title is “Dockerfile Security Checks using OPA Rego Policies with Conftest”.
- A scenario that you can learn and practice through hands-on with katacoda is also introduced.
- The title is “7 Pipeline Design Patterns for Continuous Delivery”.
- An article that describes the title “7 Design Patterns for CD (Continuous Delivery)” to help an organization make a huge leap in speed and stability and to execute a team at an elite level.
The question of whether you should directly call one Lambda function from another comes up regularly in Serverless architecture conversations. This post has some tips why this isn’t always a good idea and when to avoid.
- The title is “Are Lambda-to-Lambda calls really so bad?”
- The author examined the subject of this title, and concluded “It depends!”. It provides a decision tree to help readers’ decisions.
- The title is “Logging — let’s do it right!”
- The author who changed jobs to a startup noticed that his own logging method in the past was wrong, and explains the past and present methods by giving concrete examples.
- The GitHub page of the CDK (Cloud Development Kit) for Terraform, an OSS tool that allows developers to define cloud resources in a programming language using Terraform.
SRE Weekly Issue #228 July 19th, 2020
They don’t. They just don’t.
[…] one slow block device can affect the performance of processes even when those processes don’t use the slow block device.
Alex Yates — Octopus Deploy
- The author explains his idea with the title, “Change Advisory Boards Don’t Work”. CAB(Change Advisory Board/Change Approval Board ) reviews the changes to the production environment.
- It's a painful sentence, but could be true.
○ It’s a noble goal but, unfortunately, CABs mostly do more harm than good.
- Finally, he gives advice on "What to do if your organization is thinking about forming a CAB" and "What to do if you already have a CAB".
Whoops, forgot to include this one last week.
On June 30, Google’s email delivery service was targeted in what we believe was an attempt to bypass spam classification. The result was delayed message processing and increased message queuing.
- A follow-up article on Google Cloud’s Gmail issue.
- Delays in sending and receiving emails, and an increase in the number of emails judged as spam occurred.
My favorite part is the focus on blame awareness:
But it’s not enough to just be blameless — it’s also important to be blame-aware. Being blame-aware means that we are aware of our biases and how they may impact our ability to view an incident impartially.
Isabella Pontecorvo — PagerDuty
- Read J. Paul Reed (Senior Applied Resilience Engineer at Netflix)’s talk about postmortems, best practices, and what steps you can take to succeed.
- It’s a good idea to vote for the top five follow-up actions after postmortem and look back six weeks later. A sense of accomplishment can be obtained and leakage can be prevented. It is essential to focus on “how an incident was triggered” instead of “who caused it” in the previous stage in order to list the elements of outages.
Netflix has a team dedicated to the overall reliability of their service.
Practically speaking, this includes activities such as systemic risk identification, handling the lifecycle of an incident, and reliability consulting.
Hank Jacobs– Netflix
- Tech blog explaining the company’s centralized SRE best practices by Hank Jacobs (Senior Site Reliability Engineer) of the CORE (The Critical Operations and Reliability Engineering) team responsible for reliability of the entire Netflix service.
- The “Service Ownership Model” and the composition and roles of the CORE team are very interesting.
As the highlight of this article of me, it says that “Incident management at Netflix doesn’t follow common management practices like the ITIL model.” and describes their model.
Another good reference if you’re looking to bootstrap SRE at your organization.
Rich Burroughs — FireHydrant
- I will skip it because I mentioned it in DEVOPS WEEKLY ISSUE #499 above.
Bill Duncan’s back with an easy and very close approximation for the “Tail at Scale” formula. The question it answers is: how many nines do you need on all of your backend microservices for X nines on the frontend?
- Here’s a quick and easy approximation to the probabilistic formulas for measuring customer experience described in my two articles that I’ve previously covered in this blog.
○ The Tail at Scale
○ The Tail at Scale Revisited
Tons of great links in here with enticing descriptions to make you want to read them. Includes books, tools, hiring, certification, and general SRE goodness.
Emily Arnot — Blameless
- In this blog post, it lists SRE’s important resources and explains for those who want to “look to get up to speed on SRE fundamentals with the best SRE books and best DevOps books” or hope to “expand its SRE knowledge into new domains”.
- I appreciate this comprehensive post including tools and adoption as topics.
SRE is all about keeping the user experience working, and working with product-focused folks can really help. For more on this, check out my former coworker Jen Wohlner’s awesome SRECon19 talk on SRE & product management.
Samantha Coffman — HelloFresh
- A two-part article that describes HelloTech’s efforts on how to advocate a product mindset within their platform team.
- As background, it said first that “Having mastered Product Management in their end products and realised the benefit of improving the customer experience, more companies are shifting their attention to how they can apply the same techniques to the rest of their organisation.”.
- In this Part 1, the following three points are included.
- What is Wrong with Traditional Platform Teams Without Product Representation
- The Benefits of Adopting a Product First Mindset for Platform at HelloTech
- Best Practices from HelloTech for Adopting a Product Platform Mentality
○ Cloudflare had a 50% drop in traffic served by their network subsequent to a BGP issue. Linked is their analysis including snippets of router configurations. Lots of services suffered contemporaneous outages possibly stemming from Cloudflare’s, including Discord, Postmates, Hosted Graphite, and DownDetector.John Graham-Cumming — Cloudflare
○ Twitter had a major security breach, and as part of their response, they temporarily cut off large parts of their service. Click for their post about what happened.
- Microsoft Outlook
○ Notably, the outage involved the Outlook application that people run on their computer, not the cloud version.
○ Also a control plane incident later that day.Full disclosure: Fastly is my employer.
KubeWeekly #226 July 24th
Editor’s pick of the highlights from the past week.
Following the successful launch of the Cloud Engineer Bootcamp last month, The Linux Foundation and CNCF heard from many sysadmins, developers, engineers, and others who wanted a similarly structured program to help them learn the skills necessary to move into cloud engineering but with more advanced training. These individuals did not need the beginner training courses included Cloud Engineer Bootcamp.
That led us to launch a new Advanced Cloud Engineer Bootcamp. The program includes six training courses and registration for the Certified Kubernetes Administrator (CKA) exam, along with dedicated online support forums and weekday live video chat with instructors. Designed for working professionals, the program can be completed in as little as six months with around 10 hours per week of study time.
Learn more about the program and sign up today!
- Following the Cloud Engineer Bootcamp in June, CNCF provided training courses for the Advanced Cloud Engineer Bootcamp. The regular price of $999 was being saved for $599 until July 31st(at that moment).
- KubeCon + CloudNativeCon EU Virtual Session Spotlight
The countdown to KubeCon + CloudNativeCon EU Virtual on August 17–20, 2020 is on! As we approach the event, we curated a few recommended sessions that we don’t want you to miss. Please see the feature for this week and be sure to register today!
Tutorial: KubeEdge Hands on Workshop — Build Your Edge AI App on Real Edge Devices
Presented by: Zefeng Wang, Huawei & Zhang Jie, China Unicom
This workshop is intended to invite participants to get hands on experience building a real edge computing solution with KubeEdge, end-to-end.
Starting from deploying and provisioning an edge node(e.g. Raspberry Pi), followed with device modeling and connectivity setup, then building a video stream machine learning based solution.
Through this exercise, participants will get first hand experience to understand the orchestration engine build on top of Kubernetes, understand the edge computing node setup mechanism, learn the device modeling concept for IoT Edge scenarios. And develop a state-of-art AI based video stream processing flow, all in a 30 minutes session.
- KubeCon + CloudNativeCon EU Virtual highlights the “Tutorial: KubeEdge Hands on Workshop — Build Your Edge AI App on Real Edge Devices” session. Schedule: 8/17 (Monday) 16:55–18:15 CEST (Central European Summer Time).
- Sessions for the following three purposes.
- Understanding the orchestration engine built on top of Kurbnetes
- Understanding the edge computing node setup mechanism
- Learning the device modeling concepts for IoT Edge scenarios
ICYMI: CNCF Webinars
You can view all CNCF recorded and upcoming webinars here.
Eduardo Silva, Principal Engineer @Treasure Data, Masoud Koleini, Staff Research Software Engineer @Arm & Wesley Pettit, Software Developer Engineer @AWS
- Introducing Kubernetes logging and performance improvements and new features included in the new major v1.5 release of Fluent Bit.
- It dives into the new features on this major release that includes performance improvements and new connectors for Google Stackdriver, Amazon Cloudwatch, LogDNA, New Relic and PostgreSQL.
Melissa Sussmann, Product Marketing Lead @Puppet & Kenaz Kwa Principal Product Manager @Puppet
- The explanation is based on the following points.
- For a lot of organizations, home-grown glue logic is inconsistent, not repeatable, and expensive to maintain hundreds of event-based workflows and thousands of combinations.
- They believe that the answer lies in automation workflows. In particular, workflows-as-code that can be triggered by events.
- They want to replace engineers’ home-grown digital duct tape with reusable, event-driven workflows.
Gadi Naor, CTO & Co-Founder @Alcide
- They touched on two CVE disclosures recently made by the community (CVE-2020–8555, CVE-2020–8552) and reviewed a holistic preventative prescription to Kubernetes security and how it can be used to detect and prevent exploits of this kind and others.
- I’ll take a closer look at this later.
Oleg Chunikhin, CTO @Kublr
- It describes implementing a canary release on Kubernetes using Spinnaker, Istio, and Prometheus. It’s a one-hour session with plenty of demos starting around 23 minutes.
Jody Hunt, Director of DevOps Security @CyberArk
- It describes best practices and secret management challenges for securing application access within Kubernetes. It’s easy to understand through the explanation and the demo.
Lei Zhang (Harry), Staff Engineer @Alibaba
- The language for this webinar is Chinese.
- It Introduces the practices and principles for building an application management platform using Kubernetes.
Antoine Toulme, Engineering Manager @Splunk & Dave McAllister Sr, Technical Evangelist @Splunk
- It introduces reference architectures for integrating technologies such as Kubernetes, OpenTelemetry Collector, and Hyperledger Fabric.
Tutorials, tools, and more that take you on a deep dive into the code.
Andrew Block, Red Hat
- Introducing the “OpenShift Log Forwarding API,” which makes it easier to integrate OpenShift and Splunk.
- It manages the life cycle of Fluentd transfer instances in combination with the Helm chart. Instead of managing complex configurations, both OpenShift administrators and end users can easily deploy the entire solution so they can focus on their business-critical tasks.
Jorge Salamero Sanz, Sysdig
- A Step-by-step instructions on best practices for alerts on the Kubernetes platform and orchestration. It contains examples of PromQL alerts.
- If you’re new to Kubernetes and monitoring, I recommend my previous article, Monitoring Kubernetes in Production.
Marton Sereg, Banzai Cloud
- Banzai Cloud had written about Istio’s connectivity, monitoring, and safety, but not much about control. To fill that gap, this article described Istio’s access control model AuthorizationPolicies.
Hiro Osaki, IT Next
- As titled, it says that “CRD (Custom Resource Definition) is just a Kubernetes table”, and explains carefully using database diagrams, YAML, and CLI screens.
Shell-operator is a tool for running event-driven scripts in a Kubernetes cluster
- The GitHub page for the “Shell-operator”, an OSS tool for event-driven scripting on Kubernetes clusters. It provides an integration layer for Kubernetes clusters and shell scripts with scripts as event trigger hooks.
Christian Posta, Solo.io
- It explained some common use cases for Istio and some good practices to help you with things like rotating root certificates, intermediates, and if necessary these various certificates for your Certificate Authority.
- Commentary videos of Part 0~4 are embedded. I would like to review this along with the video while getting my hands dirty.
- It briefly describes the hidden splendor of kops that may help in provisioning Kubernetes clusters.
Articles, announcements, and morethatgive you a high-level overview of challenges and features.
Harrison Sweeney and Mark Church, Google Cloud
- It walks through the various factors to consider when publishing an application on GKE.
- It describes how each factor affects the exposure of your application and highlights which networking solution each requirement leads to.
- Assuming you’re familiar with Kubernetes concepts such as deployments, services, and ingress resources, it distinguishes between different exposing methods, from inside to outside, to multi-cluster.
Adam Glick and Craig Box, Kubernetes Podcast from Google
- Kubernetes Podcast by Google employees. The current co-hosts are Craig Box and Adam Glick.
- Google Cloud’s David Ashpole (TL of Kubernetes SIG Instrumentation, and the maintainer of cadvisor) is a guest.
- The topics of my interest in News of the week are:
○ Spring Cloud Data Flow for Kubernetes from VMware; part of the Spring Runtime package
○ Custom Pod Autoscaler (and docs) by Jamie Thompson
○ Ingress support added to AWS App Mesh
○ Threat Alert: Attacker Building Malicious Images Directly on Your Host from Aqua Security
Stewart Reichling and Srini Polavarapu, Google Cloud
- It introduces Traffic Director , a control plane managed by Google Cloud to solve the first barrier to service mesh adoption.
Chanwit Kaewkasi, WeaveWorks
- Using Backstage, Spotify’s open source framework , It introduced how to create a GitOps plugin with a UI that you can provide through the developer portal.
Satyajit Das, Red Hat
- To explain how Kubernetes works, it uses an analogy, the story of “using a rental house”.
Idit Levine, Solo.io
- Introductory article of the management function “Gloo Federation” of the new multi-cluster API gateway.
- It has two goals: to simplify and globalize the management of multiple Gloo instances across multiple clusters, and to provide advanced multi-cluster control capabilities, with the following characteristics:
- Global Dashboard
- Federated Configuration
- Multi-Cluster Failover Routing
- Location-Based Routing
- Role-Based Access Control
Mike Melanson, The New Stack
- An article about the history and future of Red Hat’s “Operator Framework” donated to the CNCF.
- Operator Hub is separated from Operator Framework. There are no plans to donate to CNCF at this time.
Daniel Hochman and Derek Shaller, Lyft
- An announcement for open sourcing “Clutch,” an extensible UI and API platform for Lyft’s infrastructure tools.
Kevin Casey, Enterprisers Project
- Introducing the following six workflows and processes that can be automated with Kubernetes.
- App setup/Installation
- Pod and node scaling
- Persistent storage management
- Chaos testing
- Deployment and versioning of Custom Resource Definitions (CRDs)
- Container and Kubernetes security
Emily Omier, Nirmata
- It explains the necessity of introducing appropriate tools to avoid the “unmanageable Day 2 nightmare”, the operation phase after the introduction of containers and Kubernetes.
- Lastly, the company ties and introduces its dashboard with the easy-to-understand and controllability.
William Morgan, Buoyant
- It introduced case studies of the adoption of Linkerd by four companies, Nordstrom, finleap connect, Paybase and Subspace. The case study highlights three main themes:
- Security is a driving force behind service mesh adoption.
- Istio is the starting point, but Linkerd is the final destination.
- Latency matters, and the service mesh can help.
Gareth Rushgrove, Open Policy Agent maintainer
- The Maintainer of OPA(Open Policy Agent), Gareth Rushgrove announced that Conftest joined the OPA project.
Ramon Guiu, New Relic
- New Relic’s announcement to make their agents, integrations, and SDKs available under an open source license.
- It said that “Starting today, our agents for C, Go, .NET, Node, Python, and Ruby, as well as the Infrastructure agent, Infrastructure integrations, the Infrastructure Integrations SDK, and Telemetry SDKs are available and open to contributions in New Relic’s GitHub organization”.
- The remaining agents will be available soon — Java in September, PHP in October, and Browser and Mobile in 2021.
Upcoming CNCF webinars
You can check some Recorded Webinars and Upcoming Webinars here. The following are posted as Upcoming CNCF webinars at that moment.
Member Webinar: One large cluster or lots of small ones? Pros, cons and when to apply each approach
Flavio Castelli, Distinguished Engineer @SUSE
July 24, 2020 10:00 AM Pacific Time
Member Webinar: Kubernetes Policies 101
Eran Leib, Founder, VP Product Management @Apolicy
Spenser Paul, Director of Sales, North America @DoiT International
July 28, 2020 10:00 AM Pacific Time
Member Webinar: GitOps Continuous Delivery with Argo and Codefresh
Dan Garfield, Chief Technology Evangelist @Codefresh
July 29, 2020 1:00 PM Pacific Time
Member Webinar: Cluster API — Yesterday, Today, Tomorrow
Saad Malik CTO & Co-Founder @Spectro Cloud
Jun Zhou Chief Architect @Spectro Cloud
July 30, 2020 10:00 AM Pacific Time
Project Webinar: How We Doubled System Read Throughput with Only 26 Lines of Code
July 31, 2020 10:00 AM Pacific Time
Member Webinar: Comparing eBPF and Istio/Envoy for Monitoring Microservice Interactions
Roko Kruze, Solutions Engineer @Flowmill
Mike Cohen, Co-Founder and COO @Flowmill
Aug 4, 2020 10:00 AM Pacific Time
Member Webinar: Debugging your debugging tools; What to do when your service mesh goes down in production?
Neeraj Poddar, Co-founder and Chief Architect @Aspen Mesh
Aug 5, 2020 7:00 AM Pacific Time
Member Webinar: Making Data Work for Developers with Kubernetes & Cassandra
Chris Splinter, Sr. Product Manager — Developer Solutions @DataStax
Patrick McFadin, VP of Developer Relations @DataStax
Aug 5, 2020 1:00 PM Pacific Time
Member Webinar: Maximizing M3 — Pushing performance boundaries at scale in a cloud-native distributed metrics engine
Ryan Allen, Senior Software Engineer @Chronosphere
Aug 6, 2020 10:00 AM Pacific Time
Member Webinar: Hardware for Kubernetes, Peeling Back the Layers
Erik Reidel, SVP Compute & Storage Solutions @ITRenew
Aug 11, 2020 10:00 AM Pacific Time
Member Webinar: Hardware for Kubernetes, Peeling Back the Layers
Erik Reidel, SVP Compute & Storage Solutions @ITRenew
Aug 11, 2020 10:00 AM Pacific Time
Member Webinar: The Open-Source Observability Playbook
Hen Peretz, Head of Solutions Engineering @Epsagon
Aug 12, 2020 7:00 AM Pacific Time
Member Webinar: Migrating Real-Time Communication Applications to Kubernetes at Scale: Learnings from 8×8’s Experience
Michael Laws, Sr. Site Reliability Engineer/DevOps at 8×8
Pankaj Gupta, Sr. Director at Citrix
Aug 12, 2020 1:00 PM Pacific Time
Project Webinar: Kubernetes 1.19
Kubernetes release team
Aug 28, 2020 10:00 AM Pacific Time
REGISTER NOW »
Member Webinar: Getting started with container runtime security using Falco
Loris Degioanni, CTO and Founder @Sysdig
Sept 2, 2020 1:00 PM Pacific Time
REGISTER NOW »
How about those articles? Do you have any interest in any?
Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.