SRE / DevOps / Kubernetes Weekly Collection#54(Week 6, 2021)
- In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
- Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
- I hope it contributes to the people browsing this kind of information as a reference.
DEVOPS WEEKLY ISSUE #528 February 7th, 2021
SRE Weekly Issue #256 February 7th, 2021
KubeWeekly #250 February 12th, 2021
DEVOPS WEEKLY ISSUE #528 February 7th, 2021
News
- The title is “How the Bottlerocket build system works” from the AWS Open Source Blog.
- It explains in detail the Bottlerocket OS created for the purpose of running containers that are OSS from AWS on VMs and bare metal.
- I felt nostalgic when I jumped to the Cargo page and saw the photo of the palletized cardboard. It reminds me of the days of the logistics industry, where I wrap and unravel.
A post on how to best defend your software build pipeline from targeted supply chain attacks.
- The title is “Defending software build pipelines from malicious attack”.
- Continuing from last week’s article “Securing the NCSC’s web platform,” an article from the UK’s National Cyber Security Center (NCSC).
- It explains why the build pipeline is one of the foundations of system security and why should give it particular attention, along with the following:
○ The benefits of automation
○ Defend the pipeline
○ Protect builds from each other
○ Establish a chain of custody
○ Consider a managed service for your build pipelines
○ Hard work, but worth the effort
- The title is “A visual language for digital integration”.
- It explains how to visually capture the right information (and only the right information) on one page when designing the integration for digital systems.
- In a future post, it will explore the process of determining components in more detail.
- The title is “Build packs vs Docker files”.
- The story of the author’s development team migrating from Dockerfile to buildpack. The following six perspectives are explained as factors that determined the transition.
○ Developer Productivity
○ Security
○ Performance
○ Customizability
○ Community
○ Kubernetes Support
- The web page of “Threat Modeling Manifesto”.
- Threat Modeling is explained according to the following items.
○ What is threat modeling?
○ Values
○ Principles
○ About
- The title is “Putting a VIP in your Kubernetes Clusters”.
- It touches on Tim Hockin ‘s “Bringing Traffic Into Your Kubernetes Cluster” and discusses Type:LoadBalancer(or in most cases, Virtual IP address) from a different perspective.
- The title is “Analyzing gRPC messages using Wireshark”.
- It explains how to configure and use the protocol-specific components “Wireshark gRPC dissector” and “Protocol Buffers (Protobuf) dissector” that allow Wireshark to analyze gRPC messages.
A case study for building a Kubernetes-powered CI/CD pipeline using GitLab and Helm.
- The title is “Building a Kubernetes CI / CD Pipeline with GitLab and Helm”.
- Since It was covered in KubeWeekly #249 last week, I will skip it.
Events
- An introductory article on the event “WTF is SRE?” featured in last week’s SRE Weekly Issue #255.
Tools
- A GitHub page of the operating system “Vorteil” for running cloud applications in micro virtual machines.
Kubenav provides desktop, web and mobile apps for monitoring the status of a Kubernetes cluster.
- A GitHub page of the mobile, desktop, and web app “kubenav” for managing Kubernetes clusters and getting an overview of resource status.
SRE Weekly Issue #256 February 7th, 2021
Articles
Slack’s Outage on January 4th 2021
Here’s a blog post from Slack giving even more information about what went wrong on January 4. Bravo, Slack, there’s a lot in here for us to learn from.
Laura Nolan — Slack
- Regarding the outage, the article I covered in SRE Weekly Issue #254 the other day was a Slack report, but this one is in Slack’s engineering blog.
- As the Corrective Action, AWS assures Slack to review the AWS Transit Gateways (TGWs) scaling algorithms for large packet-per-second increases as part of their post-incident process and Slack set reminders to request preemptive upscaling of their TGW of the next holiday season, and more.
Zero Downtime Release: Disruption-free Load Balancing of a Multi-Billion User Website
This academic paper from Facebook explains how they release code without disrupting active connections, even for a small number of users.
Usama Naseer, Luca Niccolini, Udip Pant, Alan Frindell, Ranjeeth Dasineni, and Theophilus A. Benson — Facebook
- An abstract page of Facebookpaper. You can download the paper from the link.
- It’s about Zero Downtime Release, a framework that leverages various components of the end-to-end network infrastructure to prevent or mask interruptions in the face of a release.
Another lesson we can learn from aviation: have one place where engineers can find out about temporary infrastructure changes that are important.
Bill Duncan
- It explores ways to communicate effectively with the entire team in a many-to-many condition where many environments are dealt as SRE with many other team members, and the temporary state of each environment. It uses the aviation term “NOTAM (Notices to Airmen)” as the keyword to explain the situation.
Incident Post Mortem: January 29, 2021 [Coinbase]
Coinbase posted this detailed analysis of their January 29th incident.
Coinbase
- It details the outage, explains what caused it, and describes changes to prevent similar failures in the future.
- They are working on changes such as reviewing monitoring, utilizing read-only replicas, and breaking down monolithic app servers into individual services.
Council Post: How Cloud Services Platform Teams Can Drive The Adoption Of Effective SRE Practices
Interesting thesis: a company moving into the cloud is in a unique position to adopt SRE practices — and better situated than cloud-first companies.
Tina Huang (CTO, transpose) — Forbes
- Along the title, it explains that there are two contrasting approaches that legacy software companies can adopt in their cloud strategies, including significantly different SRE outcomes.
- Adopt Cloud Services For Individual Services And Teams
- Build A Cloud Platform Team
“I’m Just Doing my Job,” An SRE Myth
We need to push past surface-level mitigation of an incident and really dig in and learn.
Darrell Pappa — Blameless
- The author, who heard the lines of the title from the person in charge of the customer consultation desk, suggested that SRE should be from the customer’s perspective as follows, and that the problems should be systematized and SRE best practices should be applied.
- Customers deserve better, and we should always be their biggest advocate. So, next time you find yourself saying, “Sorry, but I’m just doing my job,” try to shift your perspective to the customer. View these problems as systemic, use SRE best practices like SLOs and error budgets, and embrace a blameless culture to help make a change.
GitHub Availability Report: January 2021
GitHub’s database failed in a manner that wasn’t detected by their automated failover system.
Keith Ballinger — GitHub
- It describes one incident and its countermeasures that caused a significant impact and reduced availability of the GitHub Actions service that occurred in January.
Open source update: School of SRE
LinkedIn published their SRE training documentation in the form of a full curriculum covering a range of topics.
Akbar KM and Kalyanasundaram Somasundaram — LinkedIn
- Introducing the School of SRE, a curriculum curated for ambitious SREs published by LinkedIn on GitHub.
- It would be nice to have this kind of information in order to improve your skills as a team.
Push some big numbers through your system and look for bugs
Your code may be designed to handle 64-bit integers, but what if a library (such as a JSON decoder) converts them to floating point numbers?
rachelbythebay
- It introduces how to search for bugs and play with JSON.
KubeWeekly #250 February 12th, 2021
The Headlines
Editor’s pick of the highlights from the past week.
Last chance to register for KubeCon + CloudNativeCon Europe 2021 — Virtual for $10!
KubeCon + CloudNativeCon Europe 2021 Virtual is happening on May 4–7, 2021. Be sure to register for a full All Access Pass for just $10 through February 14 at 23:59 CEST! The price will increase to $75 on February 15, so act fast to take advantage of this great deal.
Don’t forget — the CFP deadline for KubeCon + CloudNativeCon Europe 2021 Virtual co-located events closes on February 19!
See the full list of co-located events below:
Cloud Native Rust Day– hosted by CNCF — May 3
Cloud Native Security Day Europe — May 4
Cloud Native Wasm Day — May 4
FluentCon Cloud Native Logging day with Fluent Bit & Fluentd — May 4
Kubernetes AI Day — May 4
Kubernetes on Edge Day — May 4
ServiceMeshCon Europe — May 4
- KubeCon + CloudNativeCon Europe 2021 — Virtual and Co-located events scheduled to be held on May 4–7, 2021. All Access Pass for $ 10 is up to 2/14. I have already applied for it and I’m considering which one to apply for Co-located events.
The Technical
Tutorials, tools, and more that take you on a deep dive into the code.
Getting started with Kubernetes audit logs and Falco
Pawan Shankar, Sysdig
- It describes what Kubernetes audit logs are, the information they provide, and how to integrate them with the open source runtime security tool “Falco” to detect suspicious activity in your cluster.
Building container images in Go
Ahmet Alp Balkan
- It explains how to build an OCI container image without using Docker by programmatically building the layers and image manifests using the go-containerregistry module.
Cloud Development Environments: Using Skaffold and Telepresence on Kubernetes for fast dev loops
Peter O’Neill, Ambassador Labs
- It explains how to use Skaffold to build and deploy a local environment, launch Telepresence, project the local services you are building to a remote cluster, and loop through development.
Achieving Cloud Native Security and Compliance with Teleport
Ninad Desai, InfraCloud
- It touches on the need for Zero Trust Architecture and introduces “Teleport” as a product that fits into the area of ”Zero Trust Network” for cloud-native apps.
Kubernetes Liveness Probes — Examples & Common Pitfalls
Levent Ogut, Loft
- The Readiness Probe and Liveness Probe, which it described in a previous post, mention that they behave differently and explain each component, configuration, and how to troubleshoot.
Saiyam Pathak, Civo
- It introduces an open source hyper-converged infrastructure (HCI) software running on Kubernetes. It is explained as an open source product alternative to products such as vSphere and Nutanix.
ICYMI: CNCF online programs this week
A weekly summary of CNCF online programs from this week.
How to Manage Kubernetes Application Lifecycle Using Carvel
Helen George and Joao Pereira @VMware
- It introduces Carvel, an open source project that provides a reliable, single-purpose, configurable set of tools to help you build, configure, and deploy your apps to Kubernetes.
- It shows how to take advantage of Carvel and explains how to use each tool individually or together.
Debugging Kubernetes On The Fly
Josh Hendrick @Rookout
- It describes what traditional challenges are when debugging Kubernetes-based apps, and how real-time debugging of production workloads can help solve them.
The Editorial
Articles, announcements, and morethatgive you a high-level overview of challenges and features.
The State of Cloud Native Application Security survey — 2021
Matt Jarvis, Snyk
- It introduces Snyk’s Cloud native application security (CNAS) 2021 survey and shares plans for its report.
- The first 500 people will get free coffee, and the survey results will be released to the community free of charge, so if you are interested, please check it.
Garden: The Configure-Once Kubernetes Platform for Seamless Dev/Prod Integration
Thor Sigurdsson & Mike Winters, Garden
- It describes that “Most of the problems developers run into CI are caused by a) discrepancies between dev and CI environments and b) insufficient, slow integration testing.”, and one possible approach to solve these problems is to use a consistent configuration for every pre-production environment, from development to testing to CI.
In this context, it introduces “Garden” which uses a consistent configuration for all pre-production environments, from development to testing to CI. - Garden is an open source project that describes the entire stack (all services, tests, dependencies) and allows you to launch an on-demand full stack environment at every step of your development pipeline.
- At the beginning, It explains how it impacts developer experience that Development environments use a completely separate (and often pared-down) configuration compared to CI, which builds, tests, and deploys in a more production-like setting.
○ First, the discrepancy between development and CI environments leads to hard-to-predict errors in CI.
○ Second, developers have no idea if integration tests will pass when they push to CI.
○ Third, the process of troubleshooting CI is slow and tedious.
○ And fourth, even just writing integration tests takes a lot of time and effort.
Upcoming CNCF Online Programs
CNCF Live webinar: Toward Hybrid Cloud Serverless Transparency with Lithops Framework
presented by IBM
February 16, 2021 at 10:00 am PT
Register Now
This Week in Cloud Native (Livestream): KCD El Salvador
February 17, 2021 at 12:00 pm PT
Register Now
CNCF Online Programs Playlist on YouTube
Check out our playlist for more curated content you don’t want to miss! New content is added every Friday.
- For more information, please visit our updated Online Programs page.
How about those articles? Do you have any interest in any?
Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.
Bye now!!