SRE / DevOps / Kubernetes Weekly Collection#82(Week 34, 2021)

Yoshiki Fujiwara
8 min readAug 29, 2021
  • In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
  • Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
  • I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #556 August 22nd, 2021
SRE Weekly Issue #284 August 22nd, 2021
KubeWeekly # 274 August 27th, 2021

DEVOPS WEEKLY ISSUE #556 August 22nd, 2021

News

Access and Identify is a deep topic, no more so when it comes to AWS. This post does a good job of explaining the current situation and related problems, and discusses potential improvements.

  • The title is “AWS Doesn’t Know Who I Am. Here’s Why That’s A Problem.”.
  • It shares about AWS access, ID, current status and ideas around IAM.

Gatekeeper (the Kubernetes policy enforcement tool) now has an alpha capability to mutate resources. This post is a good introduction, including an example of automatically fixing pod security admission issues.

  • The title is “Mutating Kubernetes resources with Gatekeeper”.
  • It explains how to use mutation in the Kubernetes resource with Gatekeeper. Gatekeeper’s mutation feature is still in alpha and should not be used in production.

Everyone likes the idea of a single root cause when a problem occurs. This post compares that to how we think about successes, to make the point about the fragility of looking for a singular root cause.

  • The title is “Root cause of failure, root cause of success”.
  • As the title and Editor’s comments above, it explains that finding a single root cause of a system failure is just as meaningless as finding a single root cause of system success.

A detailed post on how best to audit and secure an AWS account.

  • The title is “How to audit and secure an AWS account”.
  • The contents of the title are explained in detail in the following items.
    ○ How to audit an AWS account
    1. Generate and maintain a complete list of assets
    2. Secure IAM
    3. Find public resources
    4. Use AWS Organizations
    5. Ensure audit logs are enabled
    6. Turn on security controls
    7. Build data flow diagrams and network maps if none exist
    8. Pick a standard
    9. Build a risk register to track findings
    ○ How to secure an AWS account
    ○ Improving security of AWS environments

A good post on service level objectives. It starts out with a good introduction, but it’s nice to see concrete examples and discussion of how to implement this in code, in this case with Ruby and Java.

  • The title is “Focusing on What Matters: Using SLOs to Pursue User Happiness”.
  • It outlines some philosophical approaches to defining SLOs, how they can help with prioritization, and the tools currently available to “Betterment Engineers” to make this process a little easier.

If you’ve read much about SRE, you’ll probably have heard of the four golden signs of monitoring. This post provides a quick introduction and suggests some improvements and gaps.

  • The title is “How to Improve Upon Google’s Four Golden Signals of Monitoring”.
  • The contents of the title are explained with the following points.
    ○ The Four Golden Signals, defined
    ○ Benefits of the Four Golden Signals
    ○ Thinking beyond the Four Golden Signals
    ○ Conclusion: Getting more from the Golden Signals

The infrastructure for storage and usage of internal data is an ever-growing part of lots of operations teams responsibilities. This post provides a useful high level view of such a modern data platform.

  • The title is “The Anatomy of an Active Metadata Platform”.
  • The contents of the title are explained with the following points.
  1. The metadata lake: A single central store for metadata
  2. Programmable-intelligence bots
  3. Embedded collaboration plugins
  4. Data process automation
  5. Reverse metadata Summing up

Tools

Secrets and Kubernetes can be a challenge. This webhook provides one option, injecting secrets into Kubernetes resources from various secrets managers including Vault, AWS, GCP and Azure secrets managers.

  • The GitHub page of a Kubernetes admission webhook, “k8s-vault-webhook” that listens for events related to Kubernetes resources for inserting secrets directly into pods, secrets, and Configmaps from Secret Manager.

Kubescape is a new security scanning tool for checking the setup of a Kubernetes cluster, based on the recently published NSA and CISA guidance.

SRE Weekly Issue #284 August 22nd, 2021

Articles

Alerting on SLOs like Pros

Soundcloud is very clear on the fact that they are not at Google scale. It’s interesting to see how they apply SRE principles at their scale.

Björn “Beorn” Rabenstein — SoundCloud

  • An article dated June 4th, 2019. It describes that SoundCloud implemented the concept of SLO/error budget and rectifying action by error budget burn and enabling them to fulfill SLO without flooding their engineers on call with an unsustainable amount of pages.

Distributed Troubleshooting

Here’s why Target set up their ELK stack, and how they used it to troubleshoot a problem in ElasticSearch itself.

Dan Getzke — Target

  • An article dated April 5th, 2017. Target shares insights in extending its platform for troubleshooting distributed systems.

Error Budgets and their Dependencies

A key point in this article is that calculating your error budget as just “100% — SLO” goes about things backward.

Adam Hammond — Squadcast

  • It describes the responsibilities for planned and unplanned outages that can occur in the system and how the team can efficiently calculate error budgets.

Capacity Planning at Scale

They periodically scale up their systems just to test and be sure they’ll be ready for big events like Black Friday / Cyber Monday.

Kathryn Tang — Shopify

  • It describes an approach to capacity planning, how they rolled it out across their organization and to dozens of teams, and how to validate their capacity plans with scalability tests to make sure they work.

How to drive ownership in microservices

In this post, we’ll focus on service ownership. Why is service ownership important? How should teams self-organize to achieve it? Where’s the best place to start?

Cortex

  • Focusing on service ownership, it asks and explains three questions excerpted by the Editor above.

One, Two, Skip a Few…

This fun troubleshooting story hinges around the internal details of how PostgreSQL’s sequences work.

Pete Hamilton — incident.io

  • From the unexpected jump of incident ID from the customer, it describes the case of investigating and dealing with the inner workings of Postgres.

KubeWeekly #274 August 27th, 2021

The Headlines

Editor’s pick of the highlights from the past week.

KubeCon + CloudNativeCon North America co-located event schedules are live!

We’re excited to announce that the schedules* for all CNCF-hosted co-located events are now available! Co-located events, including those that are sponsor-hosted, will take place on October 11–12 both in-person and virtually. Be sure to add to your registration today! Please note that an additional fee is required to attend.

*The schedule for the Kubernetes Contributor Summit will be announced on October 5.

  • As mentioned above, the schedule for the co-located event of KubeCon + CloudNativeCon North America has been released. I am considering which one I will participate in.

ICYMI: CNCF online programs this week

A weekly summary of CNCF online programs from this week.

Easy, secure Kubernetes authentication with pinniped

Matt Moyer & Margo Crawford, VMware

  • Approximately 35 minutes session by the two maintainers of “Pinniped”, Matt Moyer and Mr. Margo Crawford, introduces how to install and use the Pinniped, and how to link the Kubernetes cluster to common enterprise SSO solution, a simple login flow to cluster users.

Gear up for performance — Leveraging eBPF on Openshift with Project Calico

Chris Tomkins, Tigera

  • An approximately 1-hour session that introduces how to use OpenShift on the eBPF data plane with the following points.
    ○ How to leverage Calico eBPF on OpenShift
    ○ How eBPF brings improved performance
    ○ How eBPF brings improved service handling
    ○ Best practices for an eBPF implementation in OpenShift

What is the cost of a secret?

Presented by Steve Giguere, Bridgecrew by Prisma Cloud and sponsored by Palo Alto Networks

  • A 22-minute session that explains the pitfalls of accidentally exposing Secrets and best practices for avoiding future exposure.

Next generation observability using open source monitoring

Presented by Karl Gouverneur, Northwestern Mutual and sponsored by OpsCruise

  • A 57-minute session that explains the following three points along with the title.
  1. Get deep insights into your application from open-source CNCF monitoring
  2. Leverage real-time analytics for proactively detecting, isolating and resolving problems, and
  3. Learn how Ops teams can stay on top of your modern applications and infrastructure

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

How to secure containers with Cosign and Distroless images

Jeswin K. Ninan, InfraCloud

  • As the title suggests, it introduces Cosign and Distroless container images to help you deploy your application container more securely and run it in production .

Writing a Kubernetes validating webhook using Python

Kristijan Mitevski, Mitevski blog

  • As the title suggests, the author writes its own validating webhook in Python.

How to leverage Insomnia as a GraphQL client

Garen Torikian, Kong

  • As the title suggests , here are some of the highlights of using “Insomnia” as a GraphQL client. A demo video of about 11 minutes is embedded in the web page.

Serverless storage for your functions from the Datastax Astra DB

Alex Ellis, OpenFaaS

  • It introduces Datastax’s AstraDB and explains how to provide convenient Pay As You Go storage for serverless capabilities.

kubescape is the first tool for testing if Kubernetes is deployed securely as defined in Kubernetes Hardening Guidance by to NSA and CISA

  • Since it is taken up in DEVOPS WEEKLY ISSUE #556 above, I will skip it.

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

KEDA, with Tom Kerkhove

Craig Box, Kubernetes Podcast from Google

Maintaining Envoy proxy with Snow Petterson
Curiefense podcast

  • An approximately 34 minutes podcast episode and script featuring Snow Pettersen, a member of Lyft’s Resilience team and Senior Maintainer of Envoy Proxy. It touches on its experience working with cloud native technology at Square, Netflix, and Lyft, starting with the changes over the last few years.

Understanding artifact registry vs. container registry

Dustin Ingram Google Cloud

  • It introduces GCP’s fully managed service, “Artifact Registry”, and describes the major improvements that Artifact Registry offers to Container Registry and the steps to start using Artifact Registry today.

OpenTelemetry becomes a CNCF incubating project

  • As the title suggests, the “OpenTelemetry” project has moved from the sandbox to the Incubating Maturity Level, giving an overview of OpenTelemetry.

Take Part 2 of the CNCF Cloud Native Survey today!

  • Continuing from last week, the above survey reminders.

Upcoming CNCF Online Programs

Live Webinar

Cloud Native Live

YouTube playlist submissions

Learn more about CNCF Online Programs

How about those articles? Do you have any interest in any?

Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.

Bye now!!

Yoshiki Fujiwara

--

--

Yoshiki Fujiwara

・Cloud Solutions Architect - AWS@NetApp in Tokyo, Japan. #AWS Certified Solution Architect&DevOps Professional, #Kubernetes, ・Opinions are my own.