SRE / DevOps / Kubernetes Weekly Collection#31(Week 36)

Image for post
Image for post
  • In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
  • Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
  • I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #505 August 30th, 2020
SRE Weekly Issue #233 August 30th, 2020
KubeWeekly #231 September 4th, 2020

DEVOPS WEEKLY ISSUE #505 August 30th, 2020

A good post on the problems with the word automation when getting buy-in across an organisation for change, and why perception is important.

  • The title is “Why You Should Avoid Calling it Automation”.
  • It explains why you shouldn’t call it automation, or at least the pitfalls of describing what you are doing as “automation.”
  • The following points explain the good, bad, and ugly points of automation.
    ○ The good — Executives understand the word and will likely buy into initiatives promising automation because they equate it with cost reduction and quality improvement.
    ○ The bad — Automation sets a very high bar.
    ○ The ugly — Someone or some group in your organization is going to equate your efforts to develop or improve automation with their loss of job, function, or importance.

A look at the hidden world of legacy IT systems. Explores some notable incidents and the need for us to learn more about building systems that remain operable in the long term.

  • The title is “Inside the Hidden World of Legacy IT Systems”.
  • Taking a look at what happened at the state and federal levels in the United States, “The pandemic has acted as a powerful outgoing tide that has exposed government’s dependence on aging legacy IT systems.” Article explaining.z
  • I think the point I want to convey is included in this word.
    ○ The best way to deal with legacy IT is to never let IT become legacy.
  • I would like to confirm the details of “Some Major Legacy System Debacles of The Last Decade” one by one.

A post on the risks of adding a catch-all platform team as you grow your organisation and why evolving towards platform components and reuse can be more scalable.

  • The title is “A case against “Platform Teams””.
  • While many tech companies are considering an internal platform team, this article describes some versions the author encountered that didn’t work well. Tl;dr is below.
    ○ It is better to operate multiple platform teams specializing in their own techno-business domains than to operate a single platform team.
    ○ Platform Thinking is not about reuse, it is about facilitating evolution, and fixating on reuse destroys that opportunity.
    ○ Instil platform thinking in all teams and allow self-reliant domains and platforms to emerge organically instead of forcing the issue up front.

How different organisations apply SRE is either pragmatism or heresy, depending on your point of view. This post explores why the tension between the Google-envisaged application and other organisations.

  • The title ”You don’t need SRE. What you need is SRE.”

I’ve covered it on this blog before, so I will skip.

A look at 5 large recent outages from different services, looking at what they handled well during and after the incident. Lots of good observations.

  • The title is “5 Biggest Outages of Q2 2020.”
  • Top 5 quarterly downtime list of StatusGator for 2020 Q2 , which monitors the status pages of over 800 cloud-based services and delivers instant notifications to users.
  • It commented that “We hope our findings will motivate DevOps engineers to see how others deal with service outages so they can improve their own reliability. ”.
  • TOP5 is below.
  1. World-wide Slack outage, May 12, 2020
  2. Zoom shutdown, May 17, 2020
  3. GitHub was inaccessible… Again. June 29, 2020
  4. IBM Cloud went down along with the status page, June 10, 2020
  5. T-Mobile Flushes Its Network Down the Drain, June 15, 2020

A good writeup of several of the sessions at the recent KubeCon event, including lots of links and twitter threads from the keynotes and several other talks.

  • The title is “KubeCon Europe 2020 Wrapup”.
  • An article summarizing KubeCon Europe 2020, citing mainly Twitter comments from the author, Rich Burroughs.

The Tekton project (which provides a pipeline resource for Kubernete) has been maturing and is now launching a preview of Tekton Hub to make it easier to find tasks and pipelines.

  • The title is “Introducing Tekton Hub”.
  • It introduces Tekton Hub, a preview release of the CD Foundation.

A good reminder about the limits of containers as a security boundary, looking at escaping from a container and how to secure the underlying Docker socket.

  • The title is “A Tale of Escaping a Hardened Docker container”.
  • Red Timmy Security, at the request of a customer, tested the robustness of the new architecture of the Docker infrastructure by bypassing it. The customer checked the robustness of other security companies and built this new architecture.

Open Policy Agent has mainly been associated with Kubernetes use cases, but it’s increasingly being used in other places too. This post explores using it to test Terraform plans.

  • The title is “Policy As Code using Conftest with Terraform”.
  • An article explaining how to test the Terraform Plan using OPA.
  • The next article will cover process automation using Github custom hooks and Atlantis.

Shameless plug. My employer, Snyk, is holding an online event about all things application security and devops on October 21st/22nd. Free to register and the CFP is open now.

  • A link to “SnykCon” sponsored by Synk, to which Gareth Rushgrove, the editor of this e-mail magazine, belongs.
  • A free multi-track event designed for development, security and operations teams to accelerate the development of secure software.
  • 100% of the proceeds go to the charity partner Bill & Melinda Gates Foundation. The catchphrase is below.
    ○ Join the world’s strongest community of DevSecOps practitioners and leaders

Werf is a tool designed to simplify and speed up the delivery of applications from Git. It will build Docker images and deploy to Kubernetes, and is designed to integrate with existing CI systems.

  • OSS CLI tool “werf” GitHub page that simplifies and accelerates application delivery. It is written in Golang.

KubeCarrier is an open source project for managing applications and services across multiple Kubernetes Clusters. The accompanying blog post explains the use case nicely and provides an introduction.

  • An introduction page of “Kube Carrier” of Kubermatic (formerly Loodse).
  • An OSS tool that manages applications and services across multiple Kubernetes clusters. It provides a framework for centralizing the management of services and provides these services to external users through a self-service hub.

Click here for the GitHub page.

SRE Weekly Issue #233 August 30th, 2020

Keeping Google Meet ahead of usage demand during COVID-19

In this post, I’ll share how we ensured that Meet’s available service capacity was ahead of its 30x COVID-19 usage growth, and how we made that growth technically and operationally sustainable by leveraging a number of site reliability engineering (SRE) best practices.

Samantha Schaevitz — Google

  • GCP’s article on the rapid increase in Google Meet usage associated with COVID-19.

It shares how we ensured that Meet’s available service capacity was ahead of its 30x COVID-19 usage growth, and how they made that growth technically and operationally sustainable by leveraging a number of site reliability engineering (SRE) best practices.

Battleshorts, exaptations, and the limits of STAMP

I love the concept of “battleshorts” just as much as I’ve been enjoying this series of articles analyzing STAMP.

Lorin Hochstein

  • This blog covers this author’s article weekly, but recently this STAMP article has been around all the time. This time, He talks about two threats (battle short and exaptation) that made him think of STAMP limits.
  • Battleshort is a term which has official definition in NATO’s documentation
  • The capability to bypass certain safety features in a system to ensure completion of the mission without interruption due to the safety feature

Incident Review: Meta-Review, August 2020

Honeycomb had 5 incidents in just over a week, prompting not only their normal incident investigation process, but a meta-analysis of all five together.

Emily Nakashima — Honeycomb

  • Review on August’s incidents at Honeycomb. From July 28th to August 6th, the company caused a total of five failures in production and dogfooding environments, three of which affected users.

Chromium’s impact on root DNS traffic

Why is Chromium responsible for half of the DNS queries to the root nameservers? And why do they all return NXDOMAIN?

Matthew Thomas — APNIC

  • It describes the impact of Chromium, an open source project that is the foundation of the Google Chrome browser, on DNS traffic.
  • Being a Chrome user , I was able to learn about the benefits and impacts of Omnibox and think again.

That Moment

“That Moment” when your fire suppression system triggers and the fire department shows up. This is part war story and part description of incident response practices.

Ariel Pisetzky — Taboola

  • One Friday, the story of the author who responded to a problem that a firefighter rushed to his data center.
  • There were no fires in the data center, and miswiring of the fire extinguishing system triggered a fire alarm, which turned off the power and dumped a suppression agent into the “offending” electrical circuit. As a result, firefighters were called in and it was time for a power outage at the data center.

Google Cloud Issue Summary Multiple Products — 2020–08–19

An overload in an internal blob storage system impacted many dependent services.


  • GCP issue summary for 8/19.
  • Many products in GCP and G Suite were affected. The total disability time is 6 hours and 35 minutes.

Scaling services with Shard Manager

Sharding as a service, now there’s an interesting idea.

Gerald Guo, Thawan Kooburat — Facebook

  • Engineer blog post from Facebook. The story of building a Shard Manager to address the following two problems.
    ○ It’s no trivial task to scale the wide range of back-end services needed for Facebook’s products.
    ○ Many of our teams were building their own custom sharding solutions with overlapping functionalities
  • “The concept of using sharding to scale services is not new. However, to the best of our knowledge, we are the only generic sharding platform in the industry that achieves wide adoption at our scale.”.
    ○ Shard Manager manages tens of millions of shards hosted on hundreds of thousands of servers across hundreds of applications in production.
    ○ The sharding is explained in detail. I will read it again later.

What is a Kubernetes Operator and Why it Matters for SRE

In Kubernetes Operators: Automating the Container Orchestration Platform, authors Jason Dobies and Joshua Wood describe an Operator as “an automated Site Reliability Engineer for its application.” Given an SRE’s multifaceted experience and diverse workload, this is a bold statement. So what exactly can the Operator do?

Emily Arnot — Blameless

  • It explains the Kubernetes Operator — the Kubernetes function at the heart of customized automation — and discusses how it can evolve your SRE solution.
    ○ What the Kubernetes Operator can do
    ● Kubernetes Operators complete sophisticated tasks
    ● Kubernetes Operators control custom resources and applications
    ● Kubernetes Operators make stateful decisions
    ○ Kubernetes Operators and SRE
    ● Operator monitoring, SLIs, and SLOs
    ● Automating SRE application deployment
    ● Operators and incident management

KubeWeekly #231 September 4th

Editor’s pick of the highlights from the past week.

Cloud Native Computing Foundation Announces TiKV Graduation

Congratulations to the TiKV team on the project’s graduation within CNCF! TiKV is an open source distributed transactional key-value database built in Rust. It provides transactional key-value APIs with ACID guarantees. The project provides a unifying distributed storage layer for applications that need data persistence, horizontal scalability, distributed transactions, high availability, and strong consistency, making it an ideal database for the next-generation cloud native infrastructure.

  • It has announced that the “TiKV” project is now the 12th project approved as the Graduation level by CNCF.
  • TiKV is a Rust distributed transaction key-value database. It was developed by PingCAP as a storage backend for TiDB, entered the CNCF as a Sandbox project in August 2018, and was at the incubating level in April 2019.

Submit to speak at EnvoyCon 2020 — submissions due on Sept 4!

EnvoyCon is happening on October 15, 2020, and we invite you to share your knowledge with the community! Envoy is a widely adopted, open source network proxy, designed as a layer 7 edge and service proxy for cloud native applications, initially developed by Lyft and now hosted under Cloud Native Computing Foundation (CNCF). EnvoyCon brings together the community to share best practices, recent developments, and see live demos. The call for proposals closes 23:59 UTC on Friday, September 4.

  • Regarding the CFP of “EnvoyCon” to be held on October 15th, the deadline is 9/4 23:59 UTC. There seems to be no news of an extension of deadline as of 9/5.

You can view all CNCF recorded and upcoming webinars here.

CNCF Member Webinar: Let’s untangle the Service Mesh

Dominik Tornow, Principal Engineer @Cisco

  • It describes distributed systems, services, and service meshes in detail, and solves mesh problems.
  • It provides accurate and concise answers to the following questions.
    ○ “What is a Service Mesh?”
    ○ “What does a Service Mesh do?”
    ○ “How does a Service Mesh work?”

CNCF Member Webinar: Getting started with container runtime security using Falco

Loris Degioanni, CTO and Founder @Sysdig

  • It provides an overview of cloud native security, various aspects centered on the runtime, and what has influenced the development of the CNCF container security project Falco.
  • Through demonstrations, it shares with the CNCF community on how Falco is being used for real-world workloads, and finally with updates on Falco’s adoption, maturity, and future prospects within the CNCF.

CNCF Member Webinar: CNCF has 99+ K8S distros, and this is how and why) we built one more: OKD4 on FCOS

Christian Glombek, Vadim Rutkovsky, Charro Gruver and Dusty Mabe @Red Hat

  • It describes open cross-collaboration between the Kubernetes, OpenShift and Fedora communities and how to build the latest GA release of OKD4 and accelerate the rapid innovation and development process built in this open source community.

CNCF Member Webinar: Running the next generation of cloud-native

applications using Open Application Model (OAM) Dr. Ryan Zhang, Staff Software Engineer @Alibaba Cloud

  • It describes the Open Application Model (OAM), which defines a standard way to build and run cloud-native applications. A live demo shows the latest development methods for OAM realms.

Tutorials, tools, and more that take you on a deep dive into the code.

Let’s learn about getting involved with open source in different way

  • A YouTube video recording “Learn about Getting involved with Open Source in different ways” at Zoom. It covers the Kubernetes community as an example.

Building a multi-cluster authentication portal

  • It explains how to build a multi-cluster authentication portal and points.
  • TL;DR
    ○ Support authentication to multiple clusters from a single portal
    ○ Use the dashboard and kubectl on each cluster
    ○ Integrate both managed and on-premises clusters
    ○ Automate cluster authentication and onboarding

Portainer 2.0 CE extended for Kubernetes

  • An article that introduces Portainer 2.0 and some of its features.
  • Portainer is a lightweight management UI that allows you to easily manage various Docker environments (Docker hosts or Swarm clusters). Kubernetes can be managed from 2.0.
  • The following scenarios are introduced.
    ○ Installing Portainer on Managed k3s by Civo Cloud
    ○ Installing Portainer Agent on GKE

ContainerSolutions/kubernetes-examples: Minimal self-contained examples of standard Kubernetes features and patterns in YAML

  • The Kubernetes Examples GitHub page. A YAML reference repository containing kubernetes features and a canonical and as-simple-as-possible demo of the functionality and features.

How a Kubernetes pod gets an IP address

  • It shares what the author learned about various network components and how all pods are combined in a kubernetes cluster to get an IP address.
  • I have already seen someone who has already shared this article! The kubelet/contained/CNI diagram is wonderful and easy to understand. The commentary is also very polite and will be very helpful in understanding Kubernetes’ network. I bookmarked.

Global load balancer for OpenShift clusters: an operator-based approach

  • An Operator-based approach that illustrates how to configure a global load balancer before the fleet of an OpenShift cluster.

Public Sector on Air: SPARTA OCP 4.5 air-gapped AWS GovCloud

  • A Twitch video by Rad Hat’s OpenShift Public Sector team(Team Sparta).
  • A case study of OCP 4.5 in an Air-Gapped environment. It is explained using Air-Gapped AWS Gov Cloud as a reference architecture.

Hide my secrets — visual studio marketplace

  • It introduces the Visual Studio Code extension that hides Secret text in YAML files.

redhat-cop/rego-policies: Rego policies collection

  • A GitHub page with a collection of Rego policies. Click here for a list of policies.

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

Keptn, with Alois Reitbauer

Craig Box and Adam Glick, Kubernetes Podcast from Google

  • Kubernetes Podcast by Google employees. The current co-hosts are Craig Box and Adam Glick.
  • Alois Reitbauer, Chief Technical Strategist of Dynatrace, and co-chair of the CNCF App Delivery SIG, is a guest.
  • The topics of interest in News of the week are:
    Kuberntes client comparison by Yolan Vloeberghs and Pieter Vincken
    Distributed tracing overview by Jonathan Gold

The cloud native landscape: The provisioning layer explained

Catherine Paganini and Jason Morgan, Buoyant

  • A series of articles about cloud native landscape from The New Stack.
  • The previous article provided an overview of CNCF’s cloud-native ecosystem, and this article is in a series that examines each layer. The theme this time is provisioning.
  • In the next article, it will cover cloud native storage, container runtimes, and networking with the theme of runtime layers.

Learning Kubernetes: The need for a realistic playground

Daniel Bryant, Datawire

  • An article from The New Stack too.
  • It says that “Providing a Kubernetes playground is essential for the learning journey associated with this framework.” and introduces Katacoda, Go playground, Open Policy Agent Rego playground , etc., and using Helm chart, Terraform, kubeadm, etc. for creating/deleting with matching the needs of engineers to have a simple playground.

The Kubernetes handbook

Farhan Hasin Chowdhury, Freecodecamp

  • A handbook that carefully explains, illustrated, and hands-on the basics of Kubernetes. There is also the Docker Handbook . It is organized on one page, and you can check/complement your own understanding in order from the top.

Manage all your Kubernetes clusters with Anthos attached clusters

Matthew DeLio and Bradley Wong, Google Cloud

  • An article introducing the new feature of Anthos “Anthos attached clusters” by GCP.
  • By connecting AWS EKS and Azure Kubernetes Service cluster to the control plane on GCP, you can manage clusters across multiple and hybrid clouds in a single pane view. It is expected to be able to connect to other clouds one after another.

Kubernetes troubleshooting: 7 essential steps for delivering reliable applications

Alex Zhitnitsky, Product Marketing Director, OverOps

  • A recent Webinar created in collaboration with CNCF and an article composed by Brandon Groves and Ben Morrise of the OverOps engineering team. The original Webinar is also embedded in the page.
  • The outline is below.
    Phase 1: Build & Testing
    ○ Item #1: Static Analysis
    ○ Item #2: Unit Tests
    ○Item #3: Integration and End-to-end testing
    Phase 2: Staging / User Acceptance Testing (UAT)
    ○ Item #4: Performance/Scale testing
    ○ Item #5: Staging go-no-go decision
    Phase 3: Production
    ○ Item #6: Build Rollout
    ○ Item #7: Rollback Criteria

GitOps gains momentum among Kubernetes deployment tools

Beth Pariseau, Tech Target

  • An article that touches on the following points, etc. regarding the movement and future of GitOps so far.
    ○ Rivalry in the CNCF of Argo CD and Flux CD
    ○ push/pull-style approach
    ○ Among some GitOps proponents companies, Jenkins is used as a mature CI/CD tool in the engineering plastic production environment (Argo/Flux is not mature yet)
    ○ Opinion of the lack of truly mature CD tools in a cloud native world
    ○ Opinion that there are already too many tools

Upcoming CNCF webinars

You can check some Recorded Webinars and Upcoming Webinars here. The following are posted as Upcoming CNCF webinars at that moment.

Member Webinar: Arm Developer Experience Spanning Cloud, 5G and IoT
Darragh Grealish, Co-Founder @56K.Cloud
Marc Meunier, Sr. Manager, SW Ecosystem Development @Arm
Sept 8, 2020 10:00 AM Pacific Time

Member Webinar: Building a Cloud-Native Technology Stack that Supports Full Cycle Development
Daniel Bryant, Product Architect @Datawire
Sept 9, 2020 7:00 AM Pacific Time

Member Webinar: Highly scalable SaaS Apps on Kubernetes: Real Life Case Studies
Ram Kailasanathan, Senior Director Product Management @Oracle
Sept 9, 2020 1:00 PM Pacific Time

Member Webinar: Kubernetes and Networks: why is this so dang hard?
Tim Hockin, Principal Software Engineer @Google
Sept 10, 2020 10:00 AM Pacific Time

Member Webinar: Achieving Least Privilege Access in Kubernetes
Eran Leib Co-Founder and VP Product Management @Apolicy
Gregg Ogden Senior Product Marketing Manager @Aqua Security
Sept 11, 2020 10:00 AM Pacific Time

Ambassador Webinar: Hybrid Serverless Development using Quarkus and Kubernetes
Daniel Oh, Principal Technical Marketing Manager @RedHat and CNCF Ambassador
Sept 11, 2020 1:00 PM Pacific Time

Member Webinar: ChubaoFS Best Practices
Wei Ding, Staff Engineer
Sept 15, 2020 10:00 AM Pacific Time

Member Webinar: How To Run Kubernetes Securely and Efficiently
Joe Pelletier, VP, Products Fairwinds @Fairwinds
Robert Brennan, Director, Open Source @Fairwinds
Sept 16, 2020 7:00 AM Pacific Time

Member Webinar: Effective Kubernetes Onboarding
Kathleen Juell, Developer, DODX @DigitalOcean
Sept 16, 2020 1:00 PM Pacific Time

Member Webinar: Declaratively managing apps in a multi-cluster world
Fernando Ripoll, Solution Engineer @Giant Swarm
Sept 17, 2020 10:00 AM Pacific Time

Member Webinar: Critical DevSecOps considerations for multicloud Kubernetes
Nutanix and Sysdig
Sept 18, 2020 10:00 AM Pacific Time

Member Webinar: Using KubeVirt in telcos
Abhinivesh Jain, Distinguished Member of Technical Staff @Wipro
Sept 23, 2020 7:00 AM Pacific Time

Member Webinar: Mitigating Kubernetes attacks
Wei Lien Dang, Head of Strategy @StackRox
Sept 23, 2020 1:00 PM Pacific Time

Member Webinar: AWS controllers for Kubernetes — AWS services, now Kubified!
Jay Pipes, Principal Open Source Engineer @Amazon Web Services
Sept 24, 2020 10:00 AM Pacific Time

Project Webinar: Kubernetes 1.19
Kubernetes Release Team
Sept 25, 2020 8:00 AM Pacific Time

Member Webinar: VanillaStack as a platform for a truly vendor-agnostic open-source ecosystem
Karsten Samaschke, CEO @Cloudical
Sept 29, 2020 10:00 AM Pacific Time

Member Webinar: Self service Kubernetes for enterprises
Jim Bugwadia, Founder and CEO @Nirmata
Sept 30, 2020 10:00 AM Pacific Time

Member Webinar: Kubernetes native two-level resource management for AI/ML workloads
Diana Arroyo Software Engineer @IBM Research
Alaa Youssef, Manager, Container Cloud Platform @IBM Research
Oct 7, 2020 10:00 AM Pacific Time

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara

Written by

An infra engineer in Tokyo, Japan. Grew up in Athens, Greece(1986–1992). #Network, #Kubernetes, #GCP, #AWS SAP, #National Tour Guide for English

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store