SRE / DevOps / Kubernetes Weekly Collection#28(Week 33)

  • In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
  • Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
  • I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #502 August 9th, 2020
SRE Weekly Issue #231 August 10th, 2020
KubeWeekly #229 August 14th

DEVOPS WEEKLY ISSUE #502 August 9th, 2020

News

A list of anti-patterns for transformation projects in large organisations. Good advice on choosing technology, technical management, roadmaps and more.

  • The title is “Ten’antipatterns that are derailing technology transformations”.
  • The following “10 anti-patterns” observed by the authors over 50 major organizations are explained as the obstacles to digital transformation.
  1. Force-fitting technology solutions: Are you choosing technology out of context?
  2. Adopting cutting-edge tech that’s not fully mature: Are you adopting new technology that seems promising but doesn’t have a proven track record?
  3. Building out your own cloud infrastructure without sufficient capabilities: Have you let security and regulation block your adoption of public cloud?
  4. Initiating big-system-replacement programs: Are you focusing on system replacement rather than improving existing systems in a way that is faster and more cost-effective?
  5. Focusing on architecture and tooling improvements without enhancing process and delivery discipline: Did you re-architect and implement new tooling but forget to adapt the delivery processes?
  6. Focusing on outputs rather than business outcomes: Are your technologists focused on output instead of business/technology outcome?
  7. Managing IT purely for cost: Are you sacrificing significant value by overindexing on price and cost?
  8. Investing in developing new platforms without involving the business: Is your primary focus platform development instead of platform adoption by the business?
  9. Outsourcing your core value streams: Are vendors doing the work that creates the most value for your business?
  10. Building up an army of managers rather than developing an engineering culture: Do you value your managers more than your engineers?

A detailed incident review for the recent Quay container registry outage. Reading these sorts of reviews can help everyone learn from incidents, this one related to a storm of database connections.

  • The title is “About the Quay.io Outage: Post Mortem”.
  • Quay.io’s postmortem for two outages (5/19 early morning (Eastern Daylight Time (EDT), 5/28 pre-day (EDT)).

A great post on dashboard design, with lots of reasoning, hints, tips and examples.

  • The title is “Building dashboards for operational visibility”.
  • AWS “Amazon Builders’ Library” → “SOFTWARE DELIVERY AND OPERATIONS | LEVEL 300” article that explains the dashboard in AWS according to the following items.
    ○ Dashboarding at Amazon
    ○ Types of dashboards
    ○ High-level dashboards
    ○ Low-level dashboards
    ○ Dashboard design
    ○ Dashboard maintenance
    ○ Conclusion

A look at several tools that are useful to validating and testing Kubernetes configuration files. Useful comparison table and examples of each of the different tools.

  • The title is “Validating Kubernetes YAML for best practice and policies”.
  • TL;DR:
  • This article compares six static tools for validating and scoring Kubernetes YAML files to ensure best practices and compliance.
  1. Kubeval
  2. Kube-score
  3. Config-lint
  4. Copper
  5. Conftest
  6. Polaris
  • The Kubernetes YAML file static checking ecosystem can be grouped into the following three categories:
  1. API validators — Tools in this category validate a given YAML manifest against the Kubernetes API server.
  2. Built-in checkers — Tools in this category bundle opinionated checks for security, best practices, etc.
  3. Custom validators — Tools in this category allow writing custom checks in several languages such as Rego and Javascript.

A good discussion of all things Service Mesh and the SMI specification.

  • The title is “Service Mesh With Michelle Noorali and Delyan Raychev”.
  • Service mesh, SMI (Service Mesh Interface) , and Open Service Mesh are explained by members who contribute to OSS/CNCF and belong to MS.
  • The relationship with Kubernetes is also explained, and it would be a good preparation for KubeCon + CloudNativeCon EU.

A post on using Conftest and Regula to help write secure Terraform code and test as part of a CI process.

  • The title is “Securing Your Terraform Pipelines with Conftest, Regula, and OPA”.
  • The security and operations team describes tools (Terraform/OPA/Conftest/ Regula ) that enable requirements as “policy-as-code”.

Tools

Open Service Mesh is a new lightweight, extensible, service mesh for dynamic microservice environments. It provides out-of-the-box observability features and uses SMI for configuration.

Sysbox is a new container runtime that makes it easier to run low-level software, like Systemd, Docker, and Kubernetes, in containers. You can run it with Docker too due to the pluggable runtimes feature.

  • GitHub page of the new OSS container runtime “Sysbox”.

We’re starting to see application frameworks and developer tools provide high-level abstractions for running on platforms like Kubernetes. Tye is an interesting .NET tool that eases running .NET applications on cloud native platforms.

  • GitHub page of “Tye”, a developer tool that makes it easy to develop, test, and deploy microservices and distributed apps.

Turandot allows for using TOSCA with Kubernetes. TOSCA provides a high-level service description aimed at portability and interoperability between underlying infrastructure.

  • Kubernetes Web page of “Turandot”, a tool for orchestrating and configuring workloads using TOSCA (Topology and Orchestration Specification for Cloud Applications) .
    ○ It answers the following questions in the FAQ.
    ○ Is this a lifecycle manager (LCM) for Kubernetes workloads?
    ○ Why doesn’t Turandot include a workflow engine?
    ○ Why is there a built-in inventory? Shouldn’t the inventory be managed externally?
    ○ Why use TOSCA and CSARs instead of packaged Helm charts?
    ○ Why is it called “Turandot”?

Copper is a configuration file validator for Kubernetes. It supports writing bespoke tests using a built-in Javascript DSL.

  • The GitHub page of a simple OSS tool “Copper” that performs validation of configuration files such as Kubernetes.

SRE Weekly Issue #231 August 10th, 2020

Articles

Improving Postmortems from Chores to Masterclass with Paul Osman

The lead SRE at Under Armour(!) has a ton of interesting things to share about how they do SRE. I love their approach to incident retrospectives that starts with 1:1 interviews with those involved.

Paul Osman — Under Armour (Blameless Summit)

  • Transcript of a presentation video on the theme of “How to take postmortems or incident retrospectives to a new level.” at the 2019 Blameless Summit. The video is embedded.
  • They are changing the perspective of postmortem. The idea of ​​the root cause was also meaningful to consider the idea of that.
  • It’s coming to the conclusion that our goal in doing these postmortems is not actually to understand what happened. It’s not to understand a clear causal chain of events that led to an incident. It’s actually to understand the context that people were operating within when responding to an incident that either helped or hindered their ability to make decisions.

About the Quay.io Outage: Post Mortem

A routine infrastructure maintenance had unintended consequences, saturating MySQL with excessive connections.

Daniel Messer — RedHat

  • I will skip it, since it is covered in DEVOPS WEEKLY ISSUE #502 above.

The 2020 Midland County Dam Failure

This report details the complex factors that contributed to the failure of a dam in Michigan in May of this year.

Jason Hayes — Mackinac Center for Public Policy

  • Report on floods in Mid-Michigan (2020/05/19) with more than 2,500 houses and buildings damaged.

Heroku Incident #2090 Follow-up

This incident involved a DNS failure in Heroku’s infrastructure provider (presumably AWS).

Heroku

  • Follow-up information for the outage that occurred at Heroku from July 28, 2020 08:22 UTC to 10:28 UTC.
  • There were user impacts such as intermittent errors from APIs and other tools, and possible problems connecting to US data services.

Theory vs. Practice: Learnings from a recent Hadoop incident

This incident at LinkedIn impacted multiple internal customers with varying requirements for durability and latency, making recovery complex.

Sandhya Ramu and Vasanth Rajamani — LinkedIn

  • An article by Linkedin comparing how the practice of the DR strategy counters theories in the cloud environment in the recent Hadoop incident.

GitHub Availability Report: July 2020

This report includes a description of an incident involving Kubernetes pods and an impaired DNS service.

Keith Ballinger — GitHub

  • A report for July with August 5th (Wednesday) of the Availability Report that GitHub started on the first Wednesday of every month, which I mentioned earlier in this blog.
  • The problem that occurred at 08:18 UTC on 7/13 and continued for 4 hours and 25 minutes is taken up.
  • The causes are the following.
  1. A single container within the Pod was exceeding its defined memory limits and being terminated.
  2. The container in the Pod was configured with an ImagePullPolicy of Always, which instructed Kubernetes to fetch a new container image every time.
  3. Due to a routine DNS maintenance operation that had been completed earlier, our clusters were unable to successfully reach our registry resulting in Pods failing to start.

Incident Report: Investigating an Incident That’s Already Resolved

In this report, Honeycomb describes how they investigated an incident from the prior week that their monitoring had missed.

Martin Holman — Honeycomb

  • One week after the occurrence of the failure, a story in which the user inquired about the failure and investigated it.
  • It was personally nice to see the postmortem example of “where we got lucky” from the comment of SRE guru (Liz Fong-Jones).

Outages

KubeWeekly #229 August 14th

The Headlines

Back by Popular Demand — Free Keynote and Expo Hall Pass for KubeCon + CloudNativeCon EU 2020 Virtual!

KubeCon + CloudNativeCon EU 2020 Virtual is happening next week! Are you new to cloud native or CNCF? First time attending? We have you covered! Our free Keynote + Expo Hall Only Pass brings the key pieces of the conference together, including access to:

All Keynote Sessions available with closed captioning in 18 languages
Virtual Expo Hall where you can visit our sponsors to try the latest demos, talk to experts, and score some swag.
Sponsor Demo Theater showcasing community leaders as they demonstrate how they are adopting Kubernetes and other open source technologies
Project Pavilion where you can engage with Project Maintainers + Leads
Looking for the immersive experience including keynote and breakout sessions, sponsor showcase, and conference activities (co-located events not included)? The Full Access Pass is for you. Register today!

  • KubeCon + CloudNativeCon EU 2020 Virtual pass, available sessions and registration information being held at the beginning of this week.

Special Offer — Save on LF Training with KubeCon + CloudNativeCon EU 2020 Virtual

When you register for the Full Event Pass AND attend KubeCon + CloudNativeCon Europe Virtual, you are eligible to receive a training discount. The offer includes:

50% off CKA exam OR CKAD exam

30% off other courses or exams from LF Training!*

Be sure to reserve your spot today and save on an upcoming training session. It’s a win-win!

Details on how to access and download the coupon will be provided to registered attendees in the pre-event attendee email (coming the week of August 10). The coupon is only available for attendees of KubeCon + CloudNativeCon Europe Virtual and will expire on 23:59 UTC on August 20, 2020. Cannot be combined with any other discount or promotion. Only valid for net new training purchases, cannot be applied to previously purchased exams or bundles.

50% off CKA exam OR CKAD exam

Visit the Linux Foundation training websites below to purchase your CKA or CKAD exam for $150 (typically $300). You may choose between the exam-only option listed at the top or a course+exam bundle listed in the Combine & Save section.

30% off other courses or exams from LF Training!* Visit the Linux Foundation training website and view the course catalog. This voucher may be applied towards any course or exam available for purchase in the catalog. Just add your coupon code during checkout to see your total discount.

  • The KubeCon + CloudNativeCon EU 2020 Virtual Training and examination discount information available if you register and participate in Full Event Pass.

KubeCon + CloudNativeCon EU Virtual Session Spotlight

The countdown to KubeCon + CloudNativeCon EU Virtual on August 17–20, 2020 is on! As we approach the event, we curate a few recommended sessions that we don’t want you to miss. Please see the feature for this week and be sure to register today!

Don’t miss our co-located events — August 17 (additional registration required)!

Jump start your education or get that topic deep dive by attending a co-located event! There are many options, so you are sure to find that extra something you’ve been looking for.

AWS Container Day 2020 hosted by AWS

Building a DevOps Pipeline with Kubernetes and Apache Cassandra™ hosted by DataStax

Cloud Native ROOST Hack-a-thon presented by Zettabytes

Cloud Native Security Day hosted by CNCF

KubeAcademy: Introduction to Containers and Kubernetes hosted by VMware

NSMCon hosted by the Network Service Mesh Community

Serverless Practitioners Summit hosted by CNCF

ServiceMeshCon hosted by CNCF

It’s easy to add a co-located event to an existing registration! Log into your existing registration, enter your confirmation number, and modify by going back through the registration pages to add.

Register now!

  • KubeCon + CloudNativeCon EU Virtual spotlights on co-located events.
  • “AWS Container Day 2020 hosted by AWS” is for APAC from 8/19 ⋅ 10:00~18:00 (JST), so check it out. Click here to apply.

ICYMI: CNCF Webinars

You can view all CNCF recorded and upcoming webinars here.

CNCF Ambassador Webinar: GitOps, DSL and App Model — Getting Started Building Developer Centric Kubernetes

Lei Zhang, Staff Engineer @Alibaba

  • The lessons learned from the provision of services to end users of the broad cloud-native community are specifically explained in the following five points.
  1. Why are end users not satisfied in Kubernetes?
  2. Is PaaS the right answer?
  3. What is “developer-centric” Kubernetes?
  4. How can we build it? Anything is missing in the picture?
  5. Is GitOps and DSL part of the story? What about OAM?

CNCF Member Webinar: Hardware for Kubernetes, Peeling Back the Layers

Erik Reidel, SVP Compute & Storage Solutions, ITRenew

  • It shows how to harness the power of a broad and deep hardware ecosystem to enable cloud native applications.
  • Peeling back the usually hidden infrastructure layers and demonstrates the latest innovations in hyperscale design.

CNCF Member Webinar: Migrating Real-Time Communication Applications to Kubernetes at Scale: Learnings from 8×8’s Experience

Lance Johnson, Director of Engineering, Cloud R&D @8×8 Michael Laws, Sr. Site Reliability Engineer @8×8 Pankaj Gupta, Senior Director @Citrix

  • Sharing the first-hand experience of the 8x8 DevOps team, a company that provides global cloud communication platforms to customers.
  • It describes important considerations and lessons learned when successfully migrating VoIP from their on-premises environment to Kubernetes on AWS.

CNCF Member Webinar: The Open-Source Observability Playbook

Hen Peretz, Head of Solutions Engineering @Epsagon

  • The goal is not only to understand the tool itself, but also to describe best practices for using the tool.

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

5 reasons to run Kubernetes on your Raspberry Pi homelab

Seth Kenlon, Red Hat

  1. Network-attached storage for your home
  2. Education and upskilling
  3. Web server
  4. Containers
  5. Web development

7 Best Practices for Writing Kubernetes Operators: An SRE Perspective

Manuel Dewald, Red Hat

  • Starting with the introduction of Red Hat OpenShift Dedicated (OSD) running on AWS and GCP, the OSD operation team has changed from the conventional operation team to SRE, and what the SRE team learned from the process of creating and maintaining Operator are described below as seven best practices.
  1. Use the Operator SDK
  2. Avoid Overstuffed Functions
  3. dempotent Subroutines
  4. One Custom Resource Modification at a Time
  5. Wrap External Dependencies
  6. Test Your Code
  7. Reconciling Return Values

DevNation Tech Talk: 10 awesome Kubernetes tools every user should know

Alex Soto and Burr Sutter, Red Hat

  • From the Red Hat OpenShift team, a Twitch video introduces the following 10 tools that every Kubernetes user should know:
  1. k9s
  2. Kubectl Aliases
  3. Stern
  4. Dive
  5. cube
  6. Cuba-PS1
  7. Kubectx
  8. KubeSpy
  9. Kube-shell
  10. Kubectl

How to monitor etcd

David Lorite Solanas, Sysdig

  • It touches on the importance and mechanism of etcd, touches on common mistakes, and explains the monitoring method.

The “podman play kube” command now supports deployments

Matthew Heon, Red Hat

  • It introduced support for Kubernetes resource Deployment with podman play kube command in Podman v2.0 and future plans.

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

Independent Open Source, with Alex

Craig Box and Adam Glick, Kubernetes Podcast from Google

Service Mesh With Michelle Noorali and Delyan Raychev

Arrested DevOps Podcast, Bridget Kromhout, Microsoft

  • I will skip it, since it is covered in DEVOPS WEEKLY ISSUE #502 above.

Introducing Tekton Hub

CD Foundation

  • Red Hat is collaborating with the Tekton community and announced a preview release of Tekton Tasks, Pipelines, and Tekton Hub , which facilitates all Tekton search and discovery.

Envoy 1.15 introduces a new Postgres extension with monitoring support

Fabrízio Mello and Álvaro Hernández, OnGres

  • CNCF blog featuring support for new Postgres plugins and monitoring in Envoy v1.15.

Protecting Kubernetes applications data using Kanister

Vivek Singh, InfraCloud

  • An article that explains how to protect the data of applications running on Kubernetes using the OSS tool “Kanister”.

21 CNCF Interns Graduate from the Q2 2020 Linux Foundation CommunityBridge Program

CNCF staff

  • Twenty-one interns completed the intern program Q2 2020 at CNCF’s “Community Bridge program”.
  • They participated in 14 graduated, incubating and sandbox projects of CNCF. Introducing each participating project, mentor, and comment along with photographs.

Upcoming CNCF webinars

You can check some Recorded Webinars and Upcoming Webinars here. The following are posted as Upcoming CNCF webinars at that moment.

Ambassador Webinar: Navigating the service mesh ecosystem

Aug 14, 2020 10:00 AM Pacific Time

Member Webinar: Modern Software Development Pipeline: A Security Reference Architecture

Aug 25, 2020 10:00 AM Pacific Time
REGISTER NOW »

Member Webinar: Local Development in The Age of Kubernetes

Aug 26, 2020 7:00 AM Pacific Time
REGISTER NOW »

Member Webinar: MLOps automation with Git Based CI/CD for ML

Aug 26, 2020 1:00 PM Pacific Time
REGISTER NOW »

Member Webinar: How to migrate databases into Kubernetes?

Aug 27, 2020 10:00 AM Pacific Time
REGISTER NOW »

Project Webinar: Kubernetes 1.19

Aug 28, 2020 10:00 AM Pacific Time
REGISTER NOW »

Member Webinar: Let’s Untangle The Service Mesh *Dominik Tornow, Principal Engineer @Cisco **Sept 1, 2020 10:00 AM Pacific Time REGISTER NOW »

Member Webinar: Getting started with container runtime security using Falco

Sept 2, 2020 1:00 PM Pacific Time
REGISTER NOW »

Member Webinar: Running the next generation of cloud-native applications using Open Application Model (OAM)
Sept 3, 2020 10:00 AM Pacific Time**
REGISTER NOW »

Member Webinar: Arm Developer Experience Spanning Cloud, 5G and IoT


Sept 8, 2020 10:00 AM Pacific Time
REGISTER NOW »

Member Webinar: Building a Cloud-Native Technology Stack that Supports Full Cycle Development

Sept 9, 2020 7:00 AM Pacific Time
REGISTER NOW »

Member Webinar: Achieving Least Privilege Access in Kubernetes


Sept 11, 2020 10:00 AM Pacific Time
REGISTER NOW »

Member Webinar: Effective Kubernetes Onboarding

Sept 16, 2020 1:00 PM Pacific Time
REGISTER NOW »

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara

Written by

An infra engineer in Tokyo, Japan. Grew up in Athens, Greece(1986–1992). #Network, #Kubernetes, #GCP, #AWS SAP, #National Tour Guide for English

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store