- In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
- Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
- I hope it contributes to the people browsing this kind of information as a reference.
Here are 5 posts still worth reading from Devops Weekly issues through the years.
But what is Devops? I know a number of people have signed up to this newsletter having only recently come across the term. It’s safe to say Devops means different things to different people at this early stage, but I’m going to start out by pointing everyone to James Turnbull’s WHAT DEVOPS MEANS TO ME
- The title is “What DevOps Means to Me.”
- First, it explains the background of the need for DevOps, “Why should we merge or bring together the two realms?” And to the question in the title “What DevOps means to me?”, the author closed the article with this phrase “The best thing about the movement for me is that it is trying to foster behaviours and environments where people work together towards joint goals rather than at cross-purposes or at odds. That’s a world I’d much rather use my skills in.”
Interesting set of blog posts, describing protocols or patterns for Devops adoption. The first two talk about the advantages of starting small and fixing a real problem quickly and about configuration management and limiting manual changes.
- The titles are “Devops Protocols: Start Small” (linked above) and “ Devops Protocol: No Manual Changes “.
- A blog series explaining how to apply DevOps and patterns. “Start Small” and “No Manual Changes” are divided into each three points. It is easy to understand because you can imagine the actual situation through those explanations.
- The title is “5 Years of Metrics & Monitoring”.
- Slides uploaded to Speaker Deck. There is a video link, but the link is broken.
- Looking back on the metrics and monitoring for five years, he asks himself questions and answers. The chart and dashboard examples are easy to understand what is wrong.
- The titles are “Operability.io 2016 — Operations is crucial (Day 1) (link above)” and “ Operability.io 2016 — Automation, monitoring and communication (Day 2) “
- There are the key points from the 2 days sessions of Operability.io 2016.
- The title is “SHIPPING SOFTWARE SHOULD NOT BE SCARY”.
- It says first that “Rolling out new software is the proximate cause for the overwhelming majority of incidents, at companies of all sizes.” and “most issues are still caused by humans and our pesky need for “improvements”.” and explains what can be done.
- It explains based on the following five points.
- Get someone to own the deploy software
- Value the work
- Create a culture of software ownership
- LOOK at what you’ve done after you do it
- Be suspicious of new versions until they prove themselves
- “Puppet’s 2020 State of DevOps survey” web page by Puppet. It seems that it is being carried out at this time every year.
- The title is “Design Docs at Google”.
- “One of the key elements of Google’s software engineering culture is defining software design through design documents,” he touches and explains: Bookmark this because I want to read it back later.
- These are relatively informal documents that the primary author or authors of a software system or application create before they embark on the coding project.
- The design doc documents the high level implementation strategy and key design decisions with emphasis on the trade-offs that were considered during those decisions.
- The title is “Introducing the State of Open Source Terraform Security Report”.
- Introductory article of “State of Open Source Terraform Security Report “.
- They analyze the public Terraform registry that contains thousands of open source modules used to provision cloud resources.
- It uses OSS static analysis tool “Checkov” to scan the registry and measure module compliance across categories and cloud providers.
- I was shocked by the findling “Nearly 1 in 2 modules used to build resources for AWS, Azure, and Google Cloud is misconfigured.”.
- The title is “Using Azure Pipelines to validate my Sysmon configuration”.
- The author has been maintaining a Sysmon repository for two years with manual generation from the included script.
- Thanks to a pull request by Ján Trenčanský that utilised GitHub Actions, it sparked the idea to take this a step further.
- It describes the easiest solution for him to run, configuration, and tips for getting it working with Azure Pipelines.
- The title is “How we migrated application servers from Unicorn to Puma”.
- GitLab’s application server has been migrated from Unicorn to Puma. It has been running on Puma since Gitlab 12.9, and has been running on Puma by default since 13.0.
- Both are for Ruby on Rails, but the big difference is that Unicorn is a single-threaded process model and Puma is a multi-threaded model.
- They started the investigation and implemented the migration from the viewpoint of solving memory problems and scalability.
- The title is “Top 7 Challenges to Becoming Cloud Native”.
- It describes the 7 most common issues enterprise companies face in the cloud native journey with the following introduction.
- While great in theory, the problem with cloud native computing is that it isn’t always easy or straightforward to implement — especially if you’re an enterprise with long-standing, legacy applications.
- Not only must you adopt cloud native tools that suit your unique requirements, you must nurture their use with cultural shifts.
- Change should be implemented incrementally but holistically.
- The 7 most common issues as follows.
- Slow release cycles and accelerated pace of change
- Outdated technologies
- Service provider lock-in and limited flexibility for growth
- Lack of technical expertise to handle data
- High operational and technology costs
- Cloud native concepts are difficult to communicate
- Kubernetes GitHub page for the OSS tool “kube-iptables-tailer” that gives you more visibility into network issues in your cluster.
- It detects traffic denied by iptables and surfaces corresponding information to the affected Pods via Kubernetes events.
Kconmon is a Kubernetes connectivity monitoring tool that runs frequent tests (tcp, udp and dns), and exposes Prometheus metrics that are enriched with the node name, and the locality information (such as zone), enabling you to correlate issues between availability zones or nodes.
- Kubernetes GitHub page of the OSS tool “Kconmon” that monitors connections between nodes.
- It runs frequent tests (tcp, udp, dns) and exposes Prometheus metrics enriched with node names and locality information (zones, etc.).
SRE Weekly Issue #229 July 26th, 2020
More details have emerged about the Twitter break-in last week, leading some to utter the quote above. Here’s a take on how to see it as not being about “stupidity”.
- An article about a case where Twitter accounts were hijacked one after another on July 15th (US time) and abused for bitcoin remittance fraud.
- The comment “How cloud they be so stupid?” in the title was cast on the Internet by a Twitter engineer because it was reported that a hacker got a credential with a message from Twitter’s internal Slack. It was
- The author said that “I don’t personally know any engineers at Twitter, but I have confidence that they have excellent engineers over there, including excellent security folks. So, how do we explain this seemingly obvious security lapse?”.
- He carried out the following analysis and explained that it is necessary to deepen the understanding of the system to eliminate the problem that motivated the workaround.
○ There are countless possibilities for why people employ workarounds. Maybe some system that’s required for doing it the “right” way is down for some reason, or maybe it simply takes too long or is too hard to do things the “right” way. Combine that with production pressures, and a workaround is born.
○ Some workarounds, like the Twitter example, are dangerous. But simply observing “they shouldn’t have done that” does nothing to address the problems in the system that motivated the workaround in the first place.
○ When you see a workaround, don’t ask “how could they be so stupid to do things the obviously wrong way?” Instead, ask “what are the properties of our system that contributed to the development of this workaround?”
The data in your database should be consistent… but then again, incidents shouldn’t happen, right? Slack accepts that things routinely go wrong with data at their scale, and they have framework and a set of tools to deal with it.
Paul Hammond and Samantha Stoller — Slack
- Blog by Slack engineers. In the beginning, the author says that “An entire ecosystem of monitoring and administrative tools exist for operating our databases, making sure they replicate, scale and are generally performant. Similarly, a number of tools accompany the databases’ query language from linters and beautifiers to query builders and object mappers. But after our application has written data, there is very little tooling to verify that the data is as expected and remains as such.” and explains mainly “Consistency Check Pattern.”
I learned a lot from this article. My favorite obstacle is “distancing through differencing”, e.g. “we would never have responded to an incident that way”.
Thai Wood — Learning from Incidents
- The following points explain “Obstacles to Learning from incidents” in the title. As with the editor of SRE Weekly, “Distancing through differencing” remains to me.
○ Distancing through differencing
○ Root cause
○ Only trying to learn from “bad” things
○ High pressure reporting requirements
○ Making sure this never happens again
○ Confusing writing, distribution, or meetings with learning
[…] SRE, that is SRE as defined by Google, is not applicable for most organizations.
- Here are the points the author wanted to convey in the title and the first section.
○ You do not need SRE. Don’t get me wrong, you need (service/system) Reliability Engineering.
○ You still need to automate repetitive, typical tasks in operations.
○ You just don’t need to, and really should not do it the Google way.
○ You are not Google. Very few organizations are.
- The points the author wanted to convey in the second section “SRE for the Enterprise” are as follows.
○ You are not replacing your current Ops team, your sys admins with software Engineers. You need your ops team.
○ Renaming your DevOps teams as SRE is a no-no. DevOps is DevOps. SRE is SRE.
○ According to the article dated February 20, 2020, at that time, if you cooperated with Survey and sent a screenshot, it seems that there was a 20-minute “free consultation between yourself and the team regarding SRE”. Survey is now closed.
Expert advice on what questions to ask as you try to figure out what your critical path is (and why you would want to know what it is).
- The author was questioned “Any advice/reading on how to establish a team’s critical path?” and wrote down her thoughts.
- Her answer is “ “What makes you money?” This leads to the following actions.
○ The idea here is to draw up a list of the things that are absolutely worth waking someone up to fix immediately, night or day, rain or shine.
○ That list should be as compact and well-defined as possible.
○ This allows you to be explicit about the fact that anything else can wait til morning, or some other less-demanding service level agreement.
- ““what makes us money?” is a substitute for the actual question below.
○ “what actions allow us to survive as a business?”
○ What do our customers care the absolute most about?
○ “What makes us us?
This podcast episode was kind of like a preview of J. Paul Reed and Tim Heckman’s joint talk at https://srefromhome.com/. I love how they refer to the pandemic as a months-long incident, and point out that if you’re always in an incident then you’re never in an incident.
Julie Gunderson and Mandi Walls — Page it to the Limit
- J. Paul Reed is a guest (Senior Applied Resilience Engineer at Netflix) on the podcast “Page It to the Limit.”
- It’s nice to hear about Netflix’s CORE (The Critical Operations and Reliability Engineering) team, which was covered in a series of articles in the last week’s post, “ Postmortems and More With J. Paul Reed” “Keeping Customers Streaming — The Centralized Site Reliability Practice at Netflix.”
I love a good dual-write story. Here’s how LinkedIn transitioned to a new messaging storage mechanism.
Pradhan Cadabam and Jingxuan (Rex) Zhang — LinkedIn
- Part 2 of the “Rebuilding Messaging” series describes a major migration of existing data to a new database, or bootstrapping of data from a legacy system to a new system, as is commonly mentioned. ing.
○ GGPoker had issues during a World Series of Poker (WSOP) event.
- Fastly (control plane)
○ Full disclosure: Fastly is my employer.
○ Squarespace had a rough week, with the following incidents:
- July 21
- July 22 (includes a detailed follow-up analysis)
- July 24
- July 24
- Google Cloud Platform
○ Several GCP components were impacted, including Layer 7 Load Balancers.
KubeWeekly #227 August 1st
Editor’s pick of the highlights from the past week.
David’s work with Borg, Omega and now Kubernetes over the past 13 years, puts him, in the words of Tim Hockin, “among the world’s experts in scheduling systems”. On this week’s episode of the Kubernetes Podcast from Google, David talks about his experience with scheduling systems, how learnings from Omega became Kubernetes features, and what he thinks the biggest challenges facing the cluster management space are today.
- Kubernetes Podcast by Google employees. The current co-hosts are Craig Box and Adam Glick.
- They welcomed David Oppenheimer (software engineer) from Google as a guest.
- The topics of interest in News of the week are:
○ Google Traffic Director supports proxyless gRPC
○ Changes to Aqua Wave and Aqua Enterprise
○ Emissary, from GitHub
○ VS Code Docker extension can now run containers in Azure Container Instances
○ Debugging Incidents in Google’s Distributed Systems by Beth Cooper and Charisma Chan
○ Wave and Aqua Enterprise
This week, CNCF published a project journey report for Jaeger. This is the sixth such report compiled for CNCF graduated projects. The report assesses the state of the Jaeger project and how CNCF has impacted its progress and growth.
Jaeger is an open source, end-to-end distributed tracing platform built to help companies of all sizes monitor and troubleshoot their cloud native architectures. Contributors to Jaeger include many of the world’s largest tech companies, such as Uber, Red Hat, Ryanair, IBM, and Ticketmaster as well as fast-growing mid-size companies like Cloudbees. Read the full report.
- CNCF released the Jaeger Project Journey Report, each report for graduated products in CNCF.
- They tried to objectively evaluate the current state of Jaeger and how the CNCF has impacted development and growth.
- No doubt that Uber, the developer/donator, is the top contributor by company, while Red Hat is growing the percentage by company, but Others is also growing, and the countries/companies to which the contributor belongs are diverse. It has been seen as a healthy growth.
Register by August 3, 2020 at 23:59 PDT and you will be entered into a drawing* to win one of the below gift boxes.
Keep Cloud Native Delighted Swag Box (500 available) which includes: KubeCon + CloudNativeCon Europe t-shirt
Keep Cloud Native Connected patch
Project logo face mask
Diamond sponsor surprise
Kubernetes fidget spinner
Grand Prize! Keep Cloud Native Delighted Deluxe Swag Box (10 available) which includes:
All the above items PLUS
$150 gift card to the CNCF online store
*The drawing is open to both pass types, Full Event and Keynote + Expo Hall only, whether already registered or registering between now and August 3. Winners must be registered by the August 3 deadline AND attend the conference. Drawing will be held and winners notified by email on August 24, 2020. Limit of (1) box per participant.
Not only do you have the opportunity to win swag but by registering now, time is blocked on your calendar so you won’t miss a thing. It’s a win-win!
- Information on KubeCon + CloudNativeCon Europe drawing. 2020/08/03 23:59 If you registered by PDT, you would get the lottery right. A lottery of 500 people will be held automatically, and the winners will be notified by email on 8/24. It seems that the target tickets are both $75 for full session participation and tickets for free keynote + sponsor session.
ICYMI: CNCF Webinars
You can view all CNCF recorded and upcoming webinars here.
Flavio Castelli, Distinguished Engineer @SUSE
- It explains the pros and cons of both approaches, running Kubernetes clusters in “one large cluster” and “lots of small clusters”, and which solutions can be used to alleviate some of their drawbacks there.
- The aim is to understand the trade-offs of both approaches, and to help evaluate which path to follow based on your requirements.
Eran Leib, Founder & VP Product Management @Apolicy and Spenser Paul, Director of Sales, North America @DoiT International
- Kubernetes policy is explained focusing on the following topics.
○ What type of Policies exist?
○ How do we define and enforce Policies?
○ What best practices are available?
Codefresh Brandon Phillips, Solutions Architect @Codefresh
- Argo and Codefresh are examples of how to use GitOps to repeatedly achieve reliable and fast release.
Sebastien Goasguen, CTO @TriggerMesh
- The explanation focuses on the following points. The commentary also mentions the products of TriggerMesh.
○ Discuss a set of serverless use-cases, from LEGO to HSBC, and highlight common patterns.
○ Show how these patterns can be reproduced with technologies like k-native and the cloud event specification.
○ Finish by waiting for the pros and cons of going serverless directly in the cloud vs. running some of the backing infrastructures yourself.
Saad Malik CTO & Co-Founder @Spectro Cloud Jun Zhou Chief Architect@Spectro Cloud
- It describes cluster APIs and common Kubernetes lifecycle management options.
Tutorials, tools, and more that take you on a deep dive into the code.
Adam Gluck, Uber
- It introduces a general approach to microservice architecture called “DOMA (Domain-Oriented Microservice Architecture)” which Uber is working on.
- The following points are highlighted early in the text.
Our goal with DOMA is to provide a way forward for organizations that want to reduce overall system complexity while maintaining the flexibility associated with microservice architectures.
- I would like to deepen my understanding and image of architecture. I bookmarked this article as well.
Pawan Shankar, Sysdig
- “In this blog, we will cover many image scanning best practices and tips that will help you adopt an effective container image scanning strategy.” he explains 12 best practices below.
- Bake image scanning into your CI/CD pipelines
- Adopt inline scanning to keep control of your privacy
- Perform image scanning at registries
- Leverage Kubernetes admission controllers
- Pin your image versions
- Scan for OS vulnerabilities
- Make use of distroless images
- Scan for vulnerabilities in third-party libraries
- Optimize layer ordering
- Scan for misconfigurations in your Dockerfile
- Flag vulnerabilities quickly across Kubernetes deployments
- Choose a SaaS-based scanning solution
Szabolcs Berecz, Banzai Cloud
- Following on from a recent post on the blog titled “Certificate management on Kubernetes”, it focuses on the differences that the Istio service mesh makes.
- The hands-on explanations and diagrams are easy to see and are substantial.
Vasumathy Seenuvasan and Ravi Bukka, eBay
- A story that utilizes the function of Kubernetes secret to manage eBay’s Jenkins credentials.
- The company is containerizing Jenkins to provide a continuous build infrastructure for Kubernetes clusters, enhancing the e-commerce marketplace experience.
Gareth Rushgrove, Snyk
- I’ll skip it because it’s an article I covered last week.
Mario Vazquez, Ryan Cook, Chris Short, Red Hat
- A nearly 100-minute Twitch video featuring Tekton CI and Argo CDs by Red Hat members.
- It details the new seccomp notification feature they have developed in both the kernel and user space. It’s been explained in great detail, and I haven’t read it completely yet.
Brian Brazil, Sysdig
- The following two points are mainly looked at.
- Analyzing your service to find the most useful places to add metrics, how to add that instrumentation, getting it exposed and scraped.
- Basic query to publish and scrape it, then use these metrics in a graph
- It introduces how to easily manage fine-grained access control as a custom policy using Kyverno.
Articles, announcements, and morethatgive you a high-level overview of challenges and features.
- The CPU was influenced by the “ghost” that lives in the Kubernetes cluster, so the story was investigated and the cause was clarified.
- The following two measures were implemented as countermeasures.
Catherine Paganini, Kublr
- It decomposes CNCF cloud Native Landscape Map and provides an overview of the entire landscape, layers, columns, and categories.
- This is the first article in the series. Subsequent articles will expand each layer and column to explain in detail what each category is, the problem it solves and how.
- Personally, I always looked at this landscape somehow, so I will take this opportunity to review it in a structured way.
Guinevere Saenger, GitHub
- The speaker was speaking on a dark background, so I felt “it’s creepy” along with the title in this YouTube video.
- How can we go under the CLA (Contributor License Agreement) check?
Mina Benothman, Heavybit Industries
- An interview article with Priyanka Sharma who became a new GM of CNCF.
- It covers her backgrounds that form her career, marketing, OSS, and people who have paved the way for her.
Electro Monkeys podcast
- A podcast delivered in French, it is covered by this blog sometimes. This time, the guest is Denis Jannot ( Sales Engineer at D2iQ .
- He explains various framework issues and why D2iQ chose to create KUDO.
Emily Omier, Nirmata
- It demonstrates some specific ways to streamline Day 2 operations by automating configuration using an intelligent policy engine.
- Kyverno is mentioned as a specific example .
Deepthi Sigireddi, Vitess maintainer
- Vitess 7 release article. The original article was by Deepthi Sigireddi, maintainer at Vitess.io.
- The four main themes are as follows.
- Improved SQL Support
Upcoming CNCF webinars
You can check some Recorded Webinars and Upcoming Webinars here. The following are posted as Upcoming CNCF webinars at that moment.
Project Webinar: How We Doubled System Read Throughput with Only 26 Lines of Code
July 31, 2020 10:00 AM Pacific Time
Member Webinar: Comparing eBPF and Istio/Envoy for Monitoring Microservice Interactions
Roko Kruze, Solutions Engineer @Flowmill
Mike Cohen, Co-Founder and COO @Flowmill
Aug 4, 2020 10:00 AM Pacific Time
Member Webinar: Debugging your debugging tools; What to do when your service mesh goes down in production?
Neeraj Poddar, Co-founder and Chief Architect @Aspen Mesh
Aug 5, 2020 7:00 AM Pacific Time
Member Webinar: Making Data Work for Developers with Kubernetes & Cassandra
Chris Splinter, Sr. Product Manager — Developer Solutions @DataStax
Patrick McFadin, VP of Developer Relations @DataStax
Aug 5, 2020 1:00 PM Pacific Time
Member Webinar: Maximizing M3 — Pushing performance boundaries at scale in a cloud-native distributed metrics engine
Ryan Allen, Senior Software Engineer @Chronosphere
Aug 6, 2020 10:00 AM Pacific Time
Ambassador Webinar: GitOps, DSL and App Model — Getting Started Building Developer Centric Kubernetes
Lei Zhang, Staff Engineer @Alibaba
Aug 7, 2020 10:00 AM Pacific Time
Member Webinar: Hardware for Kubernetes, Peeling Back the Layers
Erik Reidel, SVP Compute & Storage Solutions @ITRenew
Aug 11, 2020 10:00 AM Pacific Time
Member Webinar: The Open-Source Observability Playbook
Hen Peretz, Head of Solutions Engineering @Epsagon
Aug 12, 2020 7:00 AM Pacific Time
Member Webinar: Migrating Real-Time Communication Applications to Kubernetes at Scale: Learnings from 8×8’s Experience
Michael Laws, Sr. Site Reliability Engineer/DevOps at 8×8
Pankaj Gupta, Sr. Director at Citrix
Aug 12, 2020 1:00 PM Pacific Time
Member Webinar: MLOps automation with Git Based CI/CD for ML
Yaron Haviv, Co-Founder and CTO, Iguazio
Aug 26, 2020 1:00 PM Pacific Time
REGISTER NOW »
Project Webinar: Kubernetes 1.19
Kubernetes release team
Aug 28, 2020 10:00 AM Pacific Time
REGISTER NOW »
Member Webinar: Getting started with container runtime security using Falco
Loris Degioanni, CTO and Founder @Sysdig
Sept 2, 2020 1:00 PM Pacific Time
REGISTER NOW »
How about those articles? Do you have any interest in any?
Actually, I have some contents which I can not digest at this stage, I’ll make use of this aide-memoire and links for catching-up for myself too.