SRE / DevOps / Kubernetes Weekly Collection#47(Week 52)

  • In this blog post series, I collect the following 3 Weekly Mailing List I subscribe to, leave some comments as an aide-memoire and useful links.
  • Actually, I have already published the same content in my Japanese blog and am catching-up in English in this series.
  • I hope it contributes to the people browsing this kind of information as a reference.

DEVOPS WEEKLY ISSUE #521 December 20th, 2020

News

  • The title is “Data Hub: Popular metadata architectures explained”.
  • It describes the three generations of architecture that the industry has generated as data discovery tools, and along its scope, where many of the well-known options fall.
    ○ First-generation architecture: Monolith everything
    ○ Second-generation architecture: 3-tier app with a service API
    ○ Third-generation architecture: Event-sourced metadata
  • Architectural progression between generations are mirrored by the evolution of LinkedIn’s DataHub architecture, which publishes this article. The company has promoted the latest best practices through the following open source.
    ○ (first open sourced and shared with the world as WhereHows in 2016, and then rewritten completely and re-shared with the open source community in 2019 as DataHub).
  • The title is “Does AWS Serverless care about IT Operations? Their service naming says “no” but their breadth and quality of choice says “yes”.
  • The meaning of “serverless” was mentioned at the beginning, and it does not literally eliminate servers, but states as follows.
    ○ “I believe quite the opposite, that serverless is the wave beyond VM configuration management in empowering operations-minded people to reclaim their focus, creativity, and business relevance.”
  • From the releases at AWS re: Invent, it picked up things related to serverless and explained them according to the theme.
    ○ “I wrote operations in this post about as many times as AWS uses the word innovation in their presentations, but I’m walking away from re:Invent with the impression that AWS is serious about both.”
  • The title is “Raft does not Guarantee Liveness in the face of Network Faults”.
  • It touched on Cloudflare’s post-mortem “A Byzantine failure in the real world” that was covered on this blog before, and based on the discussion on Twitter about Raft of the distributed consensus algorithm , he explained it with the following three points.
    ○ Does Raft guarantee liveness in the presence of network failures?
    ○ So, does Raft with PreVote guarantee liveness then?
    ○ Does Raft with PreVote and CheckQuorum guarantee liveness?
  • The title is “On YOLOsec and FOMOsec”.
  • The proponent author explains why both YOLO security (YOLOsec) and FOMO security are detrimental disadvantages to infosec’s defenses and how to find them to protect them from your organization’s security strategy.
  • The moment I saw the notation “33 minutes” in the upper left of the title, I gave up reading all at once. Some excerpts of tl; dr and Conclusion are below.
    ○ The tl;dr is that #yolosec and #fomosec are disconnected from the goals and needs of the business, forsaking pragmatism and prudence in favor of fanatical flavors of recklessness. YOLOsec reflects a security strategy driven by a “you only live once” mentality — one that emboldens people to ignore future concerns around security to achieve today’s gratification. FOMOsec reflects a security strategy driven by a fear of missing out — one that frightens people into misallocating resources towards what makes them feel better about their security efforts.
    ○ If security must shun both YOLOsec and FOMOsec, how should it look instead? To simultaneously alleviate a longing for belonging, envy, and myopia, infosec defenders must seek out and share the identity of “builder”58 with software engineers59. Aligning infosec metrics to software delivery metrics facilitates the alignment of infosec work to software delivery work. Acting upon this alignment — not just paying lip service — engenders the opportunity for security teams to more tangibly connect the work they perform with value and meaning produced.
  • The title is “How to monitor multi-cloud Kubernetes with Prometheus and Grafana”.
  • I’ve covered it in Kube Weekly # 244 last week, so I will skip it.
  • The title is “Forbidden lore: hacking DNS routing for k8s”.
  • There are multiple registries in Harbor, and they are struggling with DNS to point to different registries depending on the usage when retrieving container images.
  • The title is “10 Best Tools to Monitor SSL Certificate Expiry, Validity & Change”.
  • The following 10 SSL certificate expiration dates/validity/changes are explained using figures as the title suggests.
  1. Sematext Synthetics
  2. TrackSSL
  3. Pingdom
  4. Smartbear
  5. Keychest
  6. Site24x7
  7. Juices
  8. SSL Certificate Expiration Alerts
  9. Certificate Expiry Monitor
  10. SSL Certification Expiration Checker
  • The title is “Building Kubernetes Clusters using Kubernetes”.
  • It describes how to build a Kubernetes cluster using the Kubernetes with Argo Events and Argo Workflows.
  • The SAP Concur used in this article uses EKS, and the same concept can be applied to other cloud providers.
    ○ Note: SAP Concur uses AWS EKS, and a similar concept can be applied to Google’s GKE, Azure’s AKS, or any other cloud provider’s Kubernetes offering.

SRE Weekly Issue #249 December 20th, 2020

Articles

  • The concept of “generic mitigation” is explained using cute illustrations.
  • It introduces tips for On Call during the holiday season.
  • At Transposit, they know the pain of On Call themselves, so they united to come up with the following five tips to make the holidays as painless as possible on shifts.
    ○ Share the love (or spread the pain) when organizing on call shifts, and incentivize communal behavior.
    ○ Communicate early and often, with and without runbooks.
    ○ Plan around potential travel problems
    ○ Let friendly allies help you manage the social side of the situation
    ○ Pat yourself and your team on the back
  • I covered it in DEVOPS WEEKLY ISSUE # 521 above, so I will skip it.
  • The simplest way to answer a frequently asked question when a company implements incident management, “Why this process is needed,” describes the following characteristics of failed incident management:
    ○ Confusion about Process
    ○ Panic and Thrash
    ○ Lack of Awareness
    ○ Blame
    ○ Uncoordinated & Conflicting Response
    ○ Confusion over Ownership
    ○ Repeat Problems
  • It describes “Just Culture” as an industry term used to describe a value-based accountability model that considers the behaviors, systems, and expectations that make up an organization.
  • It is explained from the viewpoint that to foster a fair culture requires a multifaceted approach to managing risk, and it is important to take a holistic approach when investigating the issues and risks inherent in the operation of an organization.
    ○ Knowledge, systems, safeguards
    ○ Human performance
    ○ How we make mistakes
    ○ Safety and reporting culture
    ○ Systems and safeguards
    ○ Our experience
    ○ Standardization and bias reduction
    ○ Big data
    ○ Building trust
  • It introduces the introduction of “full-service ownership” to the issues and answering questions in an interview format. The question is below.
    ○ Q: First things first, what exactly is a service?
    ○ Q: So what’s the big deal about full-service ownership? Why should IT and engineering leaders care? Paint me a picture.
    ○ Q: What is one of the biggest drivers for moving to a model of full-service ownership?
    ○ Q: Where does one even start?
  • An introductory article on Jeli.io, an analysis platform specializing in software-related incidents, by an angel investor.
  • An incident report of Heroku Heroku Connect. Syncing with Salesforce affected 25% of production connections.
  • It explains what’s in the title and Editor’s comments, but since it’s an article from a Twitter thread by former Uber engineer McLaren Stanley, the author highly recommends reading the original threat as follows:
    ○ I highly recommend reading the original thread in full. My writing above is based solely on that thread, I don’t have any additional information, and I probably got some stuff wrong. I also created a concept map based on Stanley’s thread.
  • Since they were building the GraphQL API on a whole new stack, they wanted to see how it measures under production load compared to the previous REST API so that it doesn’t adversely affect the user experience. The story of thinking and releasing “Shadow Request”.
  • In Shadow Request, on the target page, the user loaded the page data from the REST API as usual, displayed the page, the user loaded the same data from GraphQL, timed the call, and discarded the data.
  • It describes the improvements found in the Docker and Node environments, how GraphQL resolver works with lists of entities, and CORS requests.

Outages

KubeWeekly # 245 ← No Updates

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Yoshiki Fujiwara

・Cloud Solutions Architect - AWS@NetApp in Tokyo, Japan. #AWS Certified Solution Architect&DevOps Professional, #Kubernetes, ・Opinions are my own.