SRE / DevOps / Kubernetes Weekly Collection#87(Week 39, 2021)
- In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
- Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
- I hope it contributes to the people browsing this kind of information as a reference.
DEVOPS WEEKLY ISSUE #561 September 26th, 2021
SRE Weekly Issue #289 September 26th, 2021
KubeWeekly # 279 October 1st, 2021
DEVOPS WEEKLY ISSUE #561 September 26th, 2021
News
- The title is “2021 Accelerate State of DevOps report addresses burnout, team performance”.
- The DORA (DevOps Research and Assessment) team from Google Cloud has introduced the 2021 Accelerate State of DevOps Report with the new findings from this year’s report.
- The title is “How Paul Osman thinks about long-term strategies, open telemetry, and the value of boring systems”.
- An article in a series of interviews with standout engineering leaders to learn what’s top of mind for them, called “Level-Up”. As the title suggests, the interviewer, or the author, has extracted the points from an interview video with Paul Osman, a Staff Platform Engineer at Honeycomb.
- At the end is a video of the entire interview for about 30 minutes embedded.
A post on the evolution of distributed tracing over the past 5 years.
- The title is “Five years evolution of open-source distributed tracing”.
- It summarizes what the author has learned in the five years since It started working on the OSS distributed tracing project, shows the evolution of the OSS distributed tracing space in chronological order, and peek into the future.
- The explanation focuses on the following projects summarized in the Reference.
○ Hawkular APM: https://hawkular.gitbooks.io/hawkular-apm-user-guide/content/
○ Jaeger Tracing: https://www.jaegertracing.io/
○ Zipkin: https://zipkin.io/
○ OpenTelemetry: https://opentelemetry.io/
○ OpenCensus: https://opencensus.io/
○ Hypertrace: https://www.hypertrace.org/
○ SigNoz: https://signoz.io/
○ Apache SkyWalking: https://skywalking.apache.org/
○ W3C Trace-Context: https://www.w3.org/TR/trace-context/
○ Grafana Tempo: https://github.com/grafana/tempo
- The title is “How DevOps teams are using — and abusing — DORA metrics”.
- As the title and comments from the Editor above, DORA (DevOps Research and Assessment) metrics are explained in the following points.
○ DORA metrics can be a double-edged sword
○ Here’s the problem you really need to solve
○ It all starts with building the right culture
○ Keep learning
- The title is “Anatomy of a Cloud Infrastructure Attack via a Pull Request”.
- It focuses on the April 2021 vulnerability that allows malicious pull requests to the Github repository to gain access to production, and the key areas where we are improving CI/CD tools and practices. It is explained in the following points.
○ Context
○ Technical details
○ Response
○ Advice to others
○ Wrap-up
- The title is “SRE deep dive into Linux Page Cache”.
- Chapter 0 page of the series that describes the Linux page cache. Chapter structure is as follows.
0. Linux Page Cache for SRE
1. Prepare environments
2. Essential theory
3. Basic file operations
4. Eviction and page reclaim
5. More about mmap()
6. Cgroup v2
7. Unique set and working set
8. Direct IO
9. Advanced tools
- The title is “Common Kubernetes Errors Made by Beginners [2021]”.
- Of the common Kubernetes beginner’s errors the author has noticed in its years of experience with Kubernetes and consultation with various clients, it has selected and explained the following six common errors.
- The selector of the labels on the service does not have a match with the pods
- Wrong container port mapped to the service
- CrashLoopBackOff
- Liveness and readiness probes
- Resources — Requests and Limits
- Too many load balancer–type services
Events
- The Data on Kubernetes Community (DOK) will host the following events.
○ DoK Day North America 2021 @ KubeCon
○ Tuesday, October 12 9:00 AM — 5:00 PM PDT
○ Virtual + Los Angeles, California
Tools
- A web page for “jsPolicy”, a simpler and faster Kubernetes policy engine that uses JavaScript or TypeScript.
- In the item “Why yet another policy engine for Kubernetes?”, there is a comparison table between OPA and Kyverno, and it is easy to understand where it appeals for its superiority.
- Click here for the GitHub page.
- The GitHub page of “GitOops”, a tool that helps attackers and defenders identify lateral movements and privilege escalation paths in GitHub organizations by exploiting CI/CD pipelines and GitHub access controls.
SRE Weekly Issue #289 September 26th, 2021
Articles
How SREs are unique in their approach to work
Here are some things that make SREs a unique breed in software work:
The one about Scrum caught my eye, and I followed the links through to the Stack Overflow post about SRE and Scrum.
Ash P — Cruform
- It explains the following points that make SREs a unique breed in software work.
○ SREs look at the broader picture
○ SREs thrive in ambiguity
○ SREs work beyond constraints like Scrum
○ SREs don’t stay in their lane
○ SREs don’t have a monolith job description
○ Comparison with software developers
An in-depth explainer on the Linux page cache, full of details and experiments.
Viacheslav Biriukov
- Since it is covered in DEVOPS WEEKLY ISSUE #560 above, I will skip it.
Just got my first SRE job. I start tomorrow, any advice?
There’s some great advice in this reddit thread… and maybe some tongue-in-cheek advice too.
Take production down the first day they give access — then it’s nothing but up from there!
Various — reddit
- The thread is full of advice for questions from people who will work as new SRE.
Using two real-world case studies, this article explains how developer self-service can go wrong, and then discusses how to avoid these pitfalls.
Kaspar von Grünberg — humanitec
- It discusses how you should think about striking the right balance between golden cages and golden paths with real-life examples.
What is expected in the SRE role? We analyzed 30 job postings to find out.
What a great idea! I found it especially interesting that only 34% of SRE job postings mention defining SLIs/SLOs/error budgets.
Pruthvi — Spike.sh
- Based on the results of analyzing 30 job listings of SRE from major companies such as Google, Twitter, and Slack, it analyzed the job description of the role of SRE of a major company and summarized the top responsibilities of that role as of 2021.
10 questions teams should be asking for faster incident response
For the first time, we’ve created the State of Digital Operations Report which is based on PagerDuty platform data.
[…]
we will walk through some of these findings and share 10 questions teams can ask themselves to improve their incident response.
Hannah Culver — PagerDuty
- As the title suggests, the team shares the following 10 questions that they can ask themselves to improve incident response.
- What is our organization’s incident classification scheme?
- Which incidents require coordinated response?
- How do you get the right people involved?
- How can you respond faster through automation?
- Can you track whether a change led to an issue?
- How do you map the overall impact?
- How do you collaborate when time is the most valuable resource?
- How do you keep stakeholders informed?
- Do you have an efficient retrospective process?
- How do you incorporate learnings into your response process?
How to avoid bad assumptions during incidents
Incident response so often gets mired in assumptions that need to be re-evaluated. This article uses an incident as a case study
Lawrence Jones — incident.io.
- As mentioned in the title above and the comments in the Editor, the incident is explained as a case study with the following points.
○ Trust, but verify!
○ How the incident started
○ Where the incident went wrong
○ 3 lessons we learned during this incident investigation
SRE vs. DevOps: What are the Differences?
This one lays out clear definitions of SRE and DevOps and compares and contrasts them.
Matthew Gurgel — Rootly
- While unpacking the complex relationship between DevOps and SRE, with the following points, it explains the differences between SRE and DevOps and how to adopt both concepts, as the title suggests.
○ What are SRE and DevOps?
○ What are the differences between SRE and DevOps?
○ Why do you hear more about DevOps than SRE?
○ Conclusion: Using SRE and DevOps together
Merlion: A Machine Learning Library for Time Series
This week, Saleforce released Merlion, a Python library for time series machine learning and anomaly detection. Linked is an in-depth research paper on Merlin, explaining its theory of operation and experimental results.
Bhatnagar et al. — Salesforce
- As mentioned above, a research paper on the Python library “Merlion” for time series intelligence released by Salesforce.
Outages
- Atlassian Statuspage.io
- Giant Pay
- Trello
Trello had major outages on two consecutive days (here the other).
KubeWeekly #279 October 1st, 2021
The Headlines
Editor’s pick of the highlights from the past week.
What to expect at KubeCon + CloudNativeCon North America
The New Stack
Priyanka Sharma and Jasmine James sit down with Joab Jackson of The New Stack to discuss this year’s schedule and agenda, how it will all compare to KubeCon+CloudNativeCon of years past and general cloud native trends.
- KubeCon + CloudNativeCon North America 2021 will finally be coming this month on October 11–15. As mentioned above, CNCF GM Priyanka Sharma and co-chair Jasmine James interviewed about the highlights of this year and “Why next year’s KubeCon + CloudNativeCon Europe 2022 will be held in Valencia, Spain’s third largest city?” and so on .
ICYMI: CNCF online programs this week
A weekly summary of CNCF online programs from this week.
Kanister — Application level data operations on Kubernetes
Michael Cade & Pavan Devaraj, Kasten by Veeam
- An approximately 61-minute session explaining the framework “Kanister” that supports application level data management in Kubernetes.
- For developers and ops teams interested in stateful applications in Kubernetes. It demonstrates protection operations on a live MongoDB cluster.
Trace-based testing with OpenTelemetry
Michael Haberman, Appearance
- An approximately 52-minute session that introduces the Javascript framework “Malabi” based on OpenTelemetry, which takes trace data to the next level by easily using trace data.
Shifting security left-simplifying security for K8s & OpenShift environments
Jody Hunt, CyberArk
- An approximately 51-minute session that explains best practices for protecting sensitive information in DevOps, GitOps, and CI/CD pipelines, including secrets and credentials, with centralized secret management, including self-service for developers.
- Careful explanations with easy-to-understand images and demonstrations will help you learn.
Redefining cloud native debugging
Noa Goldman, Rookout
- An approximately 18-minute session that discusses debugging challenges when migrating to the cloud and best practices and tools to make migrating to the cloud easier and more secure.
Call Mova, MayaData
- As the title suggests, An approximately 33-minute session explaining the changes in OpenEBS 3.0 and glimpses into what is coming in 4.0 and more.
The thing about your software supply chain…
Eylam Milner, Argon Security
- An approximately 33-minute session that covers various security risks and vulnerabilities within the software supply chain and recreates a live how attackers executed a few of the recent supply chain breaches.
The Technical
Tutorials, tools, and more that take you on a deep dive into the code.
Giving your legacy applications an API facelift
David La Motta, Kong
- It shows how to create a very simple but powerful “Kong Gateway Lua plugin” and how to put the API in front of legacy applications that can only be accessed via the CLI.
Kiran Mova, OpenEBS
- At the beginning, the outline of Open EBS is explained, and the features of the major versions so far are explained in “A quick summary”, after that “What’s new in OpenEBS 3.0?” and “What is Next after OpenEBS 3.0?” follows. The viewing of Webinar in the above “ICYMI” corner is recommended.
Flux Server-side reconciliation is coming
Daniel Holbach, Flux
- As the title suggests, it explains the background, precautions, countermeasures, future prospects, etc. of migrating to the new server-side reconciler with the release of Flux 0.18.
- tl; dr is as follows.
○ tl;dr: Server-side reconciliation will make Flux more performant, improve overall observability and going forward will allow us to add new capabilities, like being able to preview local changes to manifests without pushing to upstream.
The Editorial
Articles, announcements, and morethatgive you a high-level overview of challenges and features.
Podman, with Daniel Walsh and Brent Baude
- Kubernetes Podcast by Google employees. This time the Host is Craig Box and Guest Host Jimmy Moore.
- Guests are Daniel Walsh , the leader of Red Hat’s containers team, and Brent Baude , the primary maintainer of Podman as the company’s architect .
- The topics I was interested in in the News of the week are as follows.
○ Mirantis Flow “reinvents the datacenter”
○ Deis Labs introduces Hippo
○ Accelerating new features in Docker Desktop
○ CNCF DevSecOps radar
CNCF
- KubeCon + CloudNativeCon North America 2021 “Experiences” page refers to the introduction of “Battlesnake: KubeCon Cup 2021”. “Virtual Games Lounge” and other content information are also available on the same page.
CNCF
- As the title suggests, KubeCon + CloudNativeCon + Open Source Summit China will be held virtually on 12/9 -12/10, 2021. There are 105 sessions, and many speakers are scheduled from like Alibaba, GitLab, Huawei, and Intel.
Automation is the future of cloud cost optimization
Laurent Gil, CAST.AI
- Under the theme of “optimizing automated cloud costs,” the following points explain how automation is already helping companies to cut their cloud bills.
○ How to control cloud costs? 4 approaches
○ Manual vs. automated approach to optimization
○ Here’s an example of automated optimization
○ 4 reasons why manual cost optimization just doesn’t work in the cloud
○ Automated cost optimization — case study
○ Conclusion: Automation is becoming the new normal
Upcoming CNCF Online Programs
Live Webinar
- October 5 at 10am PT: Kubernetes 1.22 release presented by Savitha Raghunathan, James Laverack, & Jesse Butler, Kubernetes 1.22 Release Team — RSVP
Cloud Native Live
- October 6 at 9am PT: Next generation observability using open source monitoring presented by Scott Fulton, Opscruise — RSVP
Looking for more great curated content? Visit our Online Programs playlist on YouTube.
Learn more about CNCF Online Programs
How about those articles? Do you have any interest in any?
Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.
Bye now!!