SRE & Platform Engineering Insights

Practical advice on SRE, Platform Engineering, SLOs, and making reliability a repeatable habit.

209 Articles

Latest Articles

Incident Response and Learning: Turning Failures into Growth

Master the art of incident response and conduct blameless post-mortems for continuous improvement.

The SRE Operating Model: How to Organize for Reliability

Explore different SRE operating models, from embedded SREs to centralized platform teams.

Reducing Toil in SRE: Automation and Efficiency

Learn how to identify, measure, and eliminate manual, repetitive work to focus on high-value engineering tasks.

Observability for SRE: Beyond Simple Monitoring

Discover why observability is critical for SRE and the difference between monitoring and observability.

Error Budgets Explained: Balancing Innovation and Reliability

Understand the concept of Error Budgets in SRE and how to use them to manage feature velocity and system stability.

What is an SLO? Service Level Objectives Explained

Learn what a Service Level Objective (SLO) is, why it matters for SRE, and how to define meaningful reliability targets.

Internal Developer Platform: What It Is and Why Engineering Teams Need One

Discover what an Internal Developer Platform (IDP) is, how it differs from a developer portal, and why it's essential for platform engineering and reducing cognitive load.

The Trouble With Chasing 100%

There is always someone who wants 100% reliability, 100% uptime, 100% certainty, 100% confidence, and ideally by next quarter without reducing feature velocity.

The SLA That Sales Invented

There is a special kind of optimism that appears in technology companies right before engineering gets invited to a “quick alignment meeting.

Reliability Is a Feature, Even If Nobody Put It in the Roadmap

Somewhere in every organization, there is a roadmap bursting with ambition. It has glossy feature names, strategic themes, and enough arrows pointing upward to...

Announcing MeloSlo Early Beta

My SLO strategy used to be “hope, dashboards, and a strong coffee.” Apparently that is not an official framework. Announcing MeloSlo Early Beta: The SLO...

SLO Bleed

Our runbook says reliability is a feature, but somehow the dashboard keeps interpreting that as “creativity is a feature” too.

Fail-Soft

Why do SREs love “degraded mode”? Because “everything is on fire, but technically still serving traffic” is somehow considered progress.

bounded staleness

Today we’re talking about bounded staleness . Yes, that’s right. The quiet, unassuming hero of distributed databases.

Limited and Fragile Context Handling

Why Bigger Context Windows Still Don’t Save Us From Ourselves For a while, the AI industry treated larger context windows like cloud teams once treated bigger...

Sensitive Data Leakage in LLM Systems

The New Way to Break Prod Without Touching Prod There was a time when “data leakage” usually meant a bad S3 bucket policy, a stray debug log, or someone...

Prompt Injection

How to Spot It Before It Hijacks Your LLM, and How to Prevent It Without Turning Your AI Into a Brick Prompt injection has become one of those wonderfully...

Hallucinations and Ungrounded Answers

Why LLMs Make Things Up, and How to Stop Letting Them Break Prod If you have spent more than ten minutes with a large language model in a real production...

Human-Sustainability SLOs

Human-Sustainability SLOs: Reliability Targets for the People Who Keep Reliability Real We’ve spent years getting serious about Service Level Objectives.

Reliability for AI is now core SRE

If your on-call has recently involved debugging “why our tokens-per-minute hit a brick wall at lunch” rather than “why service X returned 500s,” welcome to a...

“Seatbelts and Speedometers: Why SLOs Aren’t Error Budgets”

If you’ve ever argued in a retro about whether a 502 is “really” an error if the user just refreshes, this one’s for you.

The Swiss Cheese Model for SREs

The Swiss Cheese Model for SREs: why every slice matters (and how to stop the holes from lining up) The Swiss Cheese Model is a safety classic from James...

Apple as the “budget” AI cluster

The late-2025 plot twist: Apple as the “budget” AI cluster By the end of 2025, two things were simultaneously true: NVIDIA still ruled large-scale training,...

Is ITIL Still Relevant for SREs?

Is ITIL Still Relevant for SREs? A Battle-Tested Yes — With Guardrails If you’ve ever sat through a 90-minute Change Advisory Board that approved a one-line...

Process-Heavy Rollouts vs. Automated Guardrails

Process-Heavy Rollouts vs. Automated Guardrails: Stop Choosing, Start Combining If you’ve ever shipped on a Friday “because the CAB finally approved it,” you...

SREs never sleep.

This is technically false. We just sleep in 15-minute increments between alerts. Our circadian rhythm is aligned to incident frequency, not daylight.

You can’t measure reliability — it’s just uptime.

Myth : “You can’t measure reliability — it’s just uptime.” Oh, the dream of simplicity. If reliability were just uptime, Windows 95 plug-n-play would be the...

SREs Aren’t Allergic to Meetings

SREs Aren’t Allergic to Meetings — We’re Allergic to Meetings That Don’t Earn Their Keep Somewhere along the way, “SREs are allergic to meetings” became a...

The 15 Personalities of SRE

The 15 Personalities of SRE (And Why Your Error Budget Thinks They’re Hilarious) The cast you already know (and secretly are) Every SRE org is an ensemble...

25 years of prices versus freelance day rates

The uncomfortable math: 25 years of prices versus freelance day rates Let’s rip off the Band-Aid with numbers. If you bought a basket of Dutch goods and...

SRE Is Not About Kubernetes

SRE Is Not About Kubernetes — It’s Culture On Call The line “SRE is about technology, not culture” sounds tidy until you meet reality at 03:17 on a Sunday when...

SRE is just Ops with a cooler name

“SRE is just Ops with a cooler name” — Myth-busting the Ferrari vs Fancy Red Bicycle Why the myth persists You’ve heard it before: “SRE is just operations...

SLI , SLO & SLA Setup, How?

If you’re setting up SLOs, SLAs, and SLIs for the first time—or rebooting them for systems that have been running happily-chaotically in prod—welcome to the...

Working in the Noise: How to Thrive on LinkedIn Without Getting Spammed to Death

If my LinkedIn inbox were a monitoring dashboard, it would be paging me for a “critical: unsolicited pitch storm” every five minutes—followed by an incident...

Is Cloud Vendor Lock-In a Good Thing or a Bad Thing?

Few phrases trigger more eye-rolls in engineering than “vendor lock-in.” It’s the great bogeyman of platform decisions, invoked whenever someone suggests...

The Art of Paving Roads Without Building Cages

Golden Paths vs. Developer Autonomy: The Art of Paving Roads Without Building Cages “According to our incident runbook, Step 1 is panic; Step 2 is Google; Step...

Observability: OpenTelemetry-First vs. Vendor Agent-First

Observability: OpenTelemetry-First vs. Vendor Agent-First — What SREs Should Measure Before Picking a Side Why this debate won’t die (and why SREs should...

Privacy-first observability: PII in telemetry, GDPR/data-minimization, and redaction at the pipeline

Why are we still leaking secrets into the void? Every SRE has had that 3 a.m. moment: tailing logs during an incident and suddenly spotting a customer email, a...

Headless, Frontend-First Observability vs. Backend-First

Headless, Frontend-First Observability vs. Backend-First: Why Starting at the User Changes the Whole Debugging Game If you’ve ever followed a red error dot...

The map is not the territory

When teams go serverless, reality quickly replaces slides. You wire up Step Functions, sprinkle in a dozen Lambdas, toss in API Gateway, SQS, SNS, and...

The real question behind “Who owns observability?”

We ask “SRE, platform, or product?” as if there’s a single, eternal answer. There isn’t. Observability is a capability, not a team.

Trunk-Based CD vs. Gated Releases

Trunk-Based CD vs. Gated Releases: Why This Debate Refuses to Die If you’ve been anywhere near a deployment pipeline lately, you’ve heard the dueling...

Prometheus Native Histograms & Quantiles

Accuracy vs. Cost vs. Complexity (DDSketch / HDR / NH)—When to Migrate and How to Sell It Upstairs If you’ve ever tried to explain p95 to an executive at 3 a.m.

eBPF-first telemetry vs. agents/sidecars (and what “ambient” meshes mean for observability)

If that felt a little too real, welcome. Today we’re unpacking one of the spiciest debates in modern observability: go eBPF-first, stick with agents and...

Multi-cloud vs. Single-cloud-Multi-Region

Multi-cloud vs. Single-cloud-Multi-Region: A Decision Framework (with Failure Modes, Sovereignty headaches, and Cost gremlins) There’s a reason “high...

SLOs That Actually Matter: Per-Service vs. User-Journey/RUM-Driven SLOs

Why this argument won’t die (and why it matters) If you’ve been around Site Reliability Engineering long enough, you’ve seen the SLO pendulum swing.

“OpenTelemetry everywhere” vs. vendor agents: is auto-instrumentation mature enough for prod at scale?

The elevator pitch we all wish were true Everyone wants the same happy ending: flip a switch, auto-instrument everything with OpenTelemetry, send it to any...

The observability cost war (and the hidden bill it sends to your MTTR)

“Our incident runbook says: Step 1 — panic. Step 2 — Google. Step 3 — realize your logs were ‘cost-optimized’ last quarter.

Reducing Toil, Spending Error Budgets, and Keeping Your Sanity

“My on-call strategy is simple: automate everything I do twice, and never admit to the third time.” Why toil feels inevitable—and why SREs refuse to accept it...

Enhanced Observability for SREs

The Four Golden Signals—latency, traffic, errors, and saturation—are a great starting point, but modern SRE work needs explainability , not just dashboards.

Enhanced Observability for SREs: From Golden Signals to Real Insight

“Monitoring told me everything was green—right up until the users started tweeting in all caps.” Why “enhanced” observability—why now? If you’re running modern...

Why 2025 Feels Like the Year the Pipes Finally Standardized

OpenTelemetry Everywhere: Why 2025 Feels Like the Year the Pipes Finally Standardized If you’ve worked on-call this year, you’ve probably noticed the same...

Your On-Call Copilot That Doesn’t Need Coffee

AI/AIOps for Incident Management in 2025: Your On-Call Copilot That Doesn’t Need Coffee “According to our incident runbook, Step 1 is panic. Step 2 is Google.

AI Veganism: Ethical Imperative or Symbolic Gesture?

The surprising rise of “AI veganism” A phrase that sounded tongue-in-cheek a year ago is suddenly everywhere: AI veganism —the deliberate choice to abstain...

New Disaster Recovery Setups You Can Actually Ship

DR is changing (again) Classic DR (backup/restore, pilot-light, warm standby) isn’t dead—but the way we set it up is changing fast.

Web Scraping: Protecting Rights or Hindering Innovation?

Why this fight matters to SREs and builders If you run production websites or platforms, you’re probably stuck between two loud forces.

Cybersecurity: Rising Fears or Earned Preparedness?

Why this debate matters now If you work anywhere near reliability or operations, you’ve probably felt the cognitive whiplash.

Why Green IT Matters Now More Than Ever

Dear LinkedIn colleagues and sustainability champions, Today’s tech landscape is at a crossroads. On one hand, our digital world enables innovation and...

Hyperautomation: Full Workflow Efficiency or Autonomous Risk?

The pitch for hyperautomation and AI agents—through an SRE lens If you’ve spent any time in SRE or DevOps, you know the gravitational pull of automation.

Low-Code, No-Code: Democratising Development or Lowering the Bar?

Walk the halls of any large enterprise right now and you’ll hear the same chorus from IT and the business: we need apps faster.

Is Data Really the New Product, or Just Another Asset?

When I hear companies call data “the new oil,” I can’t help but wonder: is data truly the product, or just another corporate asset waiting to expire or clutter...

Green IT: The Rise, Reality, and What’s at Stake

Green IT is evolving fast. Recent research shows that the Green Tech sector is booming—from an estimated $25.5 billion in 2025 , it’s projected to reach nearly...

This Week in Reliability: AI Agents for IR, Safer Platforms & Smarter DB Observability

Why it matters: The past few days brought practical updates SRE/DevOps teams can use now — from AI agents that auto-recover workloads, to tighter platform...

Quantum Computing: Imminent Cryptographic Crisis or Overhyped Future?

A New Digital Dawn—or a Screenwriter’s Plot? Let’s set the scene. Imagine waking up to headlines warning quantum computers will dismantle the internet’s...

Vibe Coding: Creative Synergy or Diluting Technical Rigor?

When Andrej Karpathy coined “vibe coding” in early 2025, he ushered in a provocative new chapter for software development—one where natural language and...

More Tools, More Problems? The Cybersecurity Integration Debate.

If there’s a paradox in modern cybersecurity, it’s this: we’ve layered our defenses so much that they’re tangling us up.

Cloud Repatriation: Strategic Move or Step Backward?

Cloud Repatriation: Strategic Move or Step Backward? When cloud was the shining path to infrastructure nirvana—scalable, flexible, and cost-efficient—few...

The Hidden Politics of Incident Management

Incidents are supposed to be technical. A service fails. An alert fires. Engineers swarm. The issue gets mitigated. A postmortem is written.

Capacity Planning – Engineering or Astrology?

Capacity Planning – Engineering or Astrology? Capacity planning: the science—or is it art?—of figuring out how much infrastructure you’ll need to support your...

2025 07 19 10:07:29.0

Here’s a comprehensive draft for your LinkedIn blog post on “The Use of LLMs in Operational IT Work.” It’s structured with a conversational tone, real-world...

Tooling vs. Culture – What Really Drives Reliability?

 Ask any Site Reliability Engineer what makes a team successful, and you’ll likely get two answers: good tools and good culture.

Is Chaos Engineering Worth the Risk?

At first glance, chaos engineering sounds counterintuitive—even reckless. Intentionally break your own systems? Inject failure on purpose? Simulate outages...

Why Your DevOps Isn't Reliable

Lecture: Human Factors in DevOps Reliability Take A Way's from the lecture / Talk i did on the SREDAY of 27 June SREday Video can be found : https://www.

Burnout in SRE – Is It Inevitable?

You wake up tired. Not because you were paged, but because you might be. Every Slack ping feels like a warning. Deploys bring dread.

Platform Engineering: Evolution or Overcorrection?

Platform engineering is the new buzzword echoing across the halls of DevOps, SRE, and cloud-native communities. It’s the latest answer to complexity, scale,...

Are Incident Reviews Just Blame in Disguise?

It’s the day after an outage. The system is back online. The alerts have stopped. Customers are recovering. And now, it’s time for the incident review.

The Myth of 100% Reliability

“Five nines.” It’s the gold standard. 99.999% uptime. Less than 5 minutes of downtime per year. It sounds impressive—and it is.

Automation Gone Too Far?

Automation is the holy grail of Site Reliability Engineering. It’s what separates resilient, scalable systems from fragile, human-dependent ones.

The Dark Side of Infrastructure as Code: When IaC Becomes a Liability

Introduction Infrastructure as Code (IaC) has revolutionized the way we manage and provision infrastructure. However, as with any technology, it has its...

On-Call Compensation: Fair or Flawed?

It’s Saturday night. You’re out with friends, half-listening to a conversation when your phone buzzes. PagerDuty. CPU utilization spiked.

The Delicate Balance Between SLOs and Innovation

As software development teams strive for excellence, they often find themselves torn between two competing priorities: reliability and innovation.

The Great Reliability Debate: Devs vs. SREs

In the world of software development, a longstanding question has been: who owns reliability? Is it the developers who build the code, or the Site Reliability...

The Myth of Toil: Rethinking the Way We Approach Work

In the world of Site Reliability Engineering (SRE), there's a term that's often thrown around: "toil." It's defined as manual, repetitive work that's...

SRE Metrics Are Misleading?

Metrics are the lifeblood of Site Reliability Engineering. Uptime, latency, throughput, error rate—these numbers define how we measure system health, team...

Incident Commanders Are Too Rigid

Incident Commanders: How to Lead Without Becoming a Bureaucratic Robot I’ll never forget the first major incident I had to run point on.

Error Budgets Aren’t Dead

Error Budgets Aren’t Dead—They Just Grew Up There was a time when error budgets were the toast of the SRE world. People talked about them with a kind of...

Is AI Replacing the SRE?

Is AI Replacing the SRE? Or Just Giving Us Better Tools? First, it was smarter alerting. Then came anomaly detection.

SRE vs. Platform Engineering

SRE vs. Platform Engineering: Different Missions, Shared DNA Ask a group of engineers to explain the difference between Site Reliability Engineering and...

SRE for Startups vs. Enterprises

SRE at Two Speeds: Why Startups and Enterprises Do Reliability Differently You can spot the difference a mile away.

Blameless Postmortems

The incident was rough. An early-morning deploy introduced a memory leak that spiraled into a full-blown outage by lunch. Customers were impacted.

Tooling Overload

It starts with good intentions. You want to monitor your system, so you add Prometheus. Then you want pretty dashboards, so you add Grafana.

Too Much Observability?

The dashboards are glowing. The graphs are dancing. Alerts are flying across Slack channels. You have Grafana, Prometheus, Datadog, OpenTelemetry, Splunk, New...

Burnout and 24/7 On-Call

It’s 3:47 AM. You’ve been asleep for maybe two hours when your phone buzzes with a familiar notification tone: “High CPU usage on production node 18.

SRE and Security

There’s a moment during every serious incident when someone asks, “Wait—is this a reliability issue or a security issue?” The truth is, the lines are blurring.

SLIs/SLOs Are Too Rigid

There’s a moment in almost every SRE's life where they go from being wildly enthusiastic about service-level indicators (SLIs) and service-level objectives...

SRE Teams as Ops 2.0

The day the infrastructure team at a mid-size SaaS company rebranded itself as “SRE” was the day everything—and yet nothing—changed. The nameplates changed.

Toil Isn’t the Enemy.

Toil Isn’t the Enemy. Misunderstanding It Is. I’ll be honest with you: the first time I heard the word “toil” at a Site Reliability Engineering meeting, I...

SRE vs. DevOps

It’s one of the most persistent and surprisingly emotional debates in modern infrastructure and operations: Is Site Reliability Engineering (SRE) just DevOps...

Error Budgets vs. Business Demands

A few years ago, I was sitting in a cross-functional meeting between product, business, and SRE teams. The air was tense.

Toil vs. Valuable Work

It’s 2:00 AM, and I’m staring at a terminal window that’s begun to blur into itself.The room is dark except for the faint glow of a monitor and the blinking...

The Fine Print Trap

Risky Clauses Freelancers in the Netherlands Shouldn’t Ignore You know that moment—you’re staring at a fresh contract, the client seems promising, the project...

Generative AI and API Integration

The Future of Seamless IT Automation The best AI models in the world are useless if they can’t communicate with your existing systems.

The Freelance IT Rate Paradox in the Netherlands

The Freelance IT Rate Paradox in the Netherlands: Navigating the Disconnect Between Inflation and Compensation A Personal Reflection A lot of years back, I...

Generative AI and Cloud Computing

How AWS, Azure, and GCP Are Powering the Future Cloud computing and AI have long been on a collision course, and we’re finally seeing the full potential of...

The Future of DevOps

How Generative AI is Transforming IT Automation IT operations used to be a game of reaction. Something would break, alarms would go off, engineers would...

AI-Powered Code Generation

The Future of Software Development There was a time when writing code meant meticulously typing out every function, debugging for hours, and sifting through...

Large Language Models and Prompt Engineering

A New Era of AI for IT Engineers It’s no secret that AI is reshaping the IT landscape. From automating workflows to generating complex code snippets, Large...

How Generative AI Models Work

A Deep Dive into Transformers and Neural Networks The rise of Generative AI has been nothing short of revolutionary.

AI Regulation and Governance

The Battle Between Innovation and Control AI is advancing at breakneck speed, transforming industries, reshaping economies, and redefining the way we interact...

AI and Scientific Discoveries

AI and Scientific Discoveries: A Revolution Unfolding Science has always thrived on curiosity, innovation, and the relentless pursuit of knowledge.

AI Copyright and Intellectual Property

The Battle for Creativity AI-generated art, music, and writing have opened a Pandora’s box of legal and ethical questions.

AI in Warfare

The Rise of Lethal Autonomous Weapons and the Military’s Unchecked Power For decades, the idea of autonomous machines deciding who lives and who dies belonged...

Artificial General Intelligence and Existential Risk

Progress or Pandora’s Box? The idea of Artificial General Intelligence (AGI) has long danced on the edge of science fiction and reality.

Privacy and AI Surveillance

Balancing Security and Personal Freedoms Imagine walking through a city where every movement is tracked—every purchase, conversation, and glance analyzed in...

AI + Interdisciplinary Science

Why This Should Be Every Scientist’s Dream 👋 Ever feel like your research would go further if you just had more time—or ten more PhDs in different disciplines?...

Deepfakes and AI-Generated Misinformation

A Double-Edged Sword Imagine stumbling across a video of a world leader declaring war, only to find out later it was completely fake.

AI Ethics and Bias

Building a Fairer Future with AI AI is transforming industries at an unprecedented pace, making decisions that affect hiring, healthcare, law enforcement, and...

AI and Job Displacement

A New Era of Opportunity If history has taught us anything, it’s that technology changes the way we work—sometimes in ways we fear, but often in ways that lead...

Copy of Large Language Models and Prompt Engineering

A New Era of AI for IT Engineers It’s no secret that AI is reshaping the IT landscape. From automating workflows to generating complex code snippets, Large...

AI-Driven Decision Making

Transforming Critical Industries for the Better Imagine a world where AI helps doctors diagnose diseases earlier than ever, ensures fairer financial decisions,...

Paying for views/advertisement for your youtube channel is that bad.

The Debate Over Paid Views and Advertising on YouTube: A Balanced Perspective YouTube is an ever-expanding universe of content, where millions of videos...

Emphasizing Developer Experience in DevOps

In the realm of DevOps, the focus has traditionally been on streamlining processes, automating workflows, and enhancing collaboration between development and...

Rise of Internal Developer Platforms

The Rise of Internal Developer Platforms: A Comprehensive Guide for DevOps Engineers In the dynamic realm of software development, the emergence of Internal...

The Hype About Platform Engineering: Echoes of the SRE Revolution

In the world of modern software development, buzzwords come and go, but some stick long enough to redefine the way we build and manage systems.

Openshift V Kubernetes

OpenShift and Kubernetes are both popular container orchestration platforms used in the deployment and management of containerized applications.

Human biases in SRE

Human biases can have a negative impact on reliability in an IT organisation by influencing decision-making, problem-solving, and communication.

The Devaluation of SRE

The Devaluation of SRE: When Operations Gets a New Label In recent years, Site Reliability Engineering (SRE) has emerged as a transformative discipline,...

Building reliability

Building reliability into a microservices environment requires a comprehensive approach that encompasses various aspects of system design, infrastructure,...

Certification V Experience

The debate between certification and experience revolves around the question of what holds more value in the professional world.

SLO, SLI & SLA in SRE

In Site Reliability Engineering (SRE), Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs) play critical...

Openshift Concepts

OpenShift, being built on top of Kubernetes, extends the core concepts of Kubernetes and introduces additional features and concepts that enhance the platform.

Tuning Java code

Tuning Java code involves optimizing its performance, memory usage, and overall efficiency. Here are some techniques to consider when tuning Java code: 1.

Tuning ElasticSearch

Tuning an Elasticsearch database involves optimizing its performance, scalability, and resource usage. Here are some key considerations and techniques to tune...

Incident Management & DEV/OPS

Incident Management & DEV/OPS A lot of people say that if we do DEV/OPS we do not need incident management anymore. This is not correct.

The human in alerting

The speed at which someone wakes up and responds to an alert depends on the urgency and severity of the alert, as well as the established processes and...

Observability 2.0 tooling

This blog is also available as video : https://youtu.be/k8xWIrwsLUg Observability has evolved significantly in recent years, particularly with the rise of...

Migrating to OpenTelemetry

This can also be found as a video on Youtube : https://youtu.be/Gs9FXEUEMZM Migrating to OpenTelemetry (OTEL) from a traditional pull-based monitoring system...

The future of OpenTelemetry OTEL

The future of OpenTelemetry (OTEL) is a fascinating topic, as it continues to evolve as the de facto standard for observability in the cloud-native ecosystem.

The EU Cybersecurity Act: Transforming the IT Landscape

There was also a video created from this blog please check it out : https://youtu.be/GCv0gBqD128  Introduction  In an era characterized by digital...

History of OpenTelemetry

OpenTelemetry (OTEL) is one of the most significant projects in modern observability, offering a set of APIs, libraries, agents, and instrumentation that...

Introduction to Blockchain and Decentralized Systems

Please also look at the video that was created from this blog post : https://youtu.be/6501cfG8A84 Blockchain technology and decentralized systems are rapidly...

Unlocking Insights: The Power of OpenTelemetry

Please also check out the video that was produced from this BLOG post : https://youtu.be/9JtY9Y3j-4Q OpenTelemetry (OTEL) has quickly become the de facto...

Introduction to 5G Networks and Beyond

Welcome to this article; here is also a link to the video of this blog article."The 5G Effect: How It's Changing Our World" https://youtu.

Exploring the Evolution of Observability: From 1.0 to 2.0 from an SRE Perspective

In the realm of Site Reliability Engineering (SRE), one of the most critical aspects of ensuring that systems remain available, performant, and resilient is...

Human behaviour and SRE

Human behaviour plays a significant role in determining the reliability of a DevOps organisation. Here are some ways in which human behavior can influence...

Alerting best pratices

Alerting is a critical aspect of monitoring systems and applications. Here are some best practices for implementing effective alerting: 1.

PACT Testing

Pact testing is a technique used for testing the interactions between services in a distributed system. It focuses on the contract between the service consumer...

how a software development team should incorporate reliability

Incorporating reliability into software development requires a comprehensive and proactive approach. Here's an in-depth explanation of how a software...

Introducing SRE into a DevOps

Introducing Site Reliability Engineering (SRE) into a DevOps organization involves a systematic approach that focuses on cultural transformation, process...

Monitoring Best practises

Monitoring is crucial for maintaining the health, performance, and reliability of systems and applications. Here are some best practices for monitoring: 1.

chain reliability in a micro services environment

Creating chain reliability in a microservices environment involves ensuring that each microservice within the chain operates reliably and can handle failures...

A Site Reliability Engineering (SRE) Manifesto

A Site Reliability Engineering (SRE) Manifesto 1.     Reliability is Our North Star: At the core of SRE is a relentless pursuit of...

What are Complicated-Subsystem Teams?

What are Complicated-Subsystem Teams? The previous articles taught us about Stream-Aligned Teams and Enabling Teams.

What are Enabling Teams?

What are Enabling Teams? An Enabling team is the second type of team under team topologies. Enabling teams are meant to support and elevate the kind of work...

Team Topologies (Stream-Aligned Teams)

What are Stream-Aligned Teams? According to Matthew Skelton, an organization has four types of teams. The first and perhaps the most important one is...

Team Topologies : Cognitive Load

What is Cognitive Load? In articles about team topologies, you will hear much talk about Cognitive Load. In this article, I am trying to explain what it is and...

What is Team Topologies?

A Beginner's Guide In today's era, where everything is moving rapidly, software development as a niche has progressed a long way.

Azure Monitor

Monitoring Monitoring is an essential aspect of cloud computing, as it helps evaluate and manage cloud-based services, applications, and infrastructure.

Azure Databases

Azure Database Before diving into what type of databases Azure provides, I want to talk to you about the different types of databases.

AZURE Storage

Storage Storage is a means of computing technology to save digital data within a data storage device. It is a mechanism that enables a computer to retain data...

Azure Compute

Azure Compute a short overview Computing is the extensive use of computer technology to complete any goal-driven task.

AZURE Networking Components

Azure Networking Components Computer networks comprise two or more computers that are connected to transmit, share, and exchange data and resources.

Azure Kubernetes Services (AKS)

Azure Kubernetes Services (AKS) Introduction Previously the IT industry used to work with virtual machines and VM Wares, but that turned out to be pretty...

Azure Active Directory

Azure Active Directory Azure Active Directory is Microsoft's identity and access management service that is cloud-based.

Azure Policy

Azure Policy Azure Policy is a service that allows an organization to set its standards and look at the compliance of the complete environment.

Azure Resource Manager and Resource Groups

Azure Resource Manager To help make the whole process of deployment, management, and security of Azure services seamless, Microsoft has developed Azure...

AZURE Availability Zones

Azure Availability Zones are separate data center units that protect your applications and data from data center failures.

AZURE Regions

Azure Regions Microsoft Azure is Microsoft's popular cloud computing platform. This comprehensive platform offers various cloud services, including computing,...

What is Cloud Computing?

What is Cloud Computing? In the simplest terms, cloud computing is the delivery of computing services over the cloud.

SRE concepts part 9 ( Stability versus Agility )

The ninth article in the series about SRE Concepts/Topics is about one topic, "Stability versus Agility". Stability versus Agility As soon Agile...

SRE concepts part 8 ( Break your system & Test in Production )

SRE concepts part 8 ( Break your system & Test in Production ) The eighth article in the series about SRE Concepts/Topics is about two topics, "Break...

SRE concepts part 7 (White/Black Box Monitoring)

The seventh article in the series about SRE Concepts/Topics is about two topics "white-box" and "black-box" Monitoring.

SRE concepts part 6 ( Automation & CB/CD)

SRE concepts part 6 The sixth article in the series about SRE Concepts/Topics is about two topics, "The Value of Automation" and "Continuous build and...

SRE concepts part 5 ( Capacity Planning & Availability Monitoring)

The fifth article in the series about SRE Concepts/Topics is about two topics, Capacity Planning and "Time-based Versus Aggregated Availability" Capacity...

SRE concepts part 4 (RCA & Error Budget)

The fourth article in the series about SRE Concepts/Topics is about two topics, Root Cause Analysis, and Error budget.

SRE concepts part 3 (Risk / Toil)

In the third article in the series about SRE Concepts/Topics in this article, I will discuss Risk and Toil. How to deal with Risk as SRE? Site Reliability...

SRE concepts part 2 (SLI/SLO)

This is the second article in a series about SRE Concepts/Topics. In this article, I will discuss two topics that are needed in the next articles.

SRE Concepts series Part 1

I have been asked many times about certain concepts of SRE. So I will do a series about 15 topics that feature in the Google SRE book.

DevOps Automation with Chef

Chef Automate is undoubtedly the most popular automation tool for enterprises. It is a dashboard and analytics tool with cross-team collaboration features.

A Review of Terraform

Terraform is an excellent tool for changing, building, and versioning infrastructure. The advantage of using Terraform is that you can quickly shift into the...

Puppet Automation for DevOps

The core idea of DevOps is speed and resilience. DevOps and a regular software developer's crucial difference is that DevOps uses the latest technology to make...

Ansible what is it and what not

Ansible review Ansible is one of the most straightforward automation services to implement. Sponsored by Redhat, Ansible managed to gain a foothold in the...

Update Your Monitoring

Update Your Monitoring From time to time you will need to go thru all your monitoring tooling and look what is outdated and what can still work fine.

What to log

Quick over view. All Applications that you write should have good logging. But what is good logging? Let’s start with a few No Brainers.

Decoupled Application Monitoring

What are we doing now There is are a lot of new monitoring tools out there. The tools are becoming more sophisticated and there are more of them.

Jenkins

Jenkins Jenkins is the product that comes out of the concept of “Continuous Integration”. Continuous Integration; a tool that allows continuous development of...

High Availability : The religion of the Nines.

High Availability The religion of the Nines. When you talk about high availability in up time numbers everybody talks about how many nines they need to have.

My road to AZ-104

My road to AZ-104  Since I passed my AZ-104 “Microsoft Azure Administrator Associate” last week I did get a lot of questions on how I did it and could I...

What to look for when selecting a AIOPS partner / Application

Introduction In one of my previous blogs I talked about what AIOPS can do for you. Now I would like to talk to you about what AIOPS tooling needs to have to be...

Alternative to Kubernetes : Nomad

The Application: Nomad In March 2013, a revolutionary developmental invention took place, changing the way of application deployment for everyone, making It...

Alternative to Kubernetes: Rancher

The Application Rancher With the open-source solution Rancher, containers can be easily orchestrated across multiple cloud environments.

Alternative to Kubernetes: IronWorker

IronWorker Introduction to the tool with its main features:  Software developers understand everybody's or business's requirements and provide them with...

Alternative to Kubernetes: Cloudify

The Application Cloudify Cloudify is an orchestration software that automates system management. Not only the deployment process such as server deployment and...

Alternative to Kubernetes: Docker Swarm

The Application Docker Swarm In recent years and months, a new trend has established itself in the IT world - the "containerization" of applications.

Alternative to Kubernetes: APACHE MESOS

The Application APACHE MESOS Apache Mesos was born as a research project at Berkeley University, California and it's done in the  C ++  language.

Alternative to Kubernetes: Docker Compose

The Application Docker Compose Compose is a tool for using and running multi-container Docker applications. With Compose, you can define a YAML file to...

Alternative to Kubernetes: Kontena

The Application Kontena Kontena offers support to companies that need to handle large-scale containers. Founded in March 2015, Kontena has developed an...

Alternative to Kubernetes: DOCKER?

The Application Docker Docker is open-source software that can be used to create and operate containers for virtualizing applications.

Alternative to Kubernetes: AWS Fargate?

Introduction to AWS Fargate – Run Containers Without Managing Infrastructure AWS Fargate is a serverless compute engine for containers that functions with...

Prometheus Query Language

What is a Query Language? Prometheus query language is a type of query language. Query languages refer to the languages in computer science that are used to...

Release Pipelines in Azure DEV/OPS to Kubernetes

What is Microsoft Azure? Introduction: Azure DevOps is a server, also known by the names of "Team Foundation Server" and "Visual Studio Team System.

Canary Release with Kubernetes

Introduction This method was roused from the way that canary winged creatures were once utilized in coal mineshafts to alarm diggers.

Kibana/Elastic Query language

What is Query Language? A query language gives an approach to pose an inquiry. Query language refers to any computer programming language that demands and...

Build pipelines in Azure DEV/OPS for dockers

What is Microsoft Azure? It is a software that is a cloud management service developed by Microsoft and first released in February 2010.

Java 11 and Docker

I know i was a little light on the JAVA 11 parts of the last series of posts. So i have written a separate blog post on Java 11 and Docker and 1 issue that...

Java 8/11 and Docker (Part 3)

This article was published on my Blog (https://www.melomar-it.com/page/blog.php) on 19-Feb-2020 as part of a 3 peace blog post about Java and Docker.

Java 8/11 and Docker (Part 2)

This article was published on my Blog (https://www.melomar-it.com/page/blog.php) on 14-Feb-2020 as part of a 3 peace blog post about Java and Docker.

Java 8/11 and Docker (Part 1)

This article was published on my Blog (https://www.melomar-it.com/page/blog.php) on 10-Feb-2020 as part of a 3 peace blog post about Java and Docker.

The influence of BIG Data on Operations

THE INFLUENCE OF BIG DATA on OPERATIONS.

Need practical SRE consulting for your engineering organisation?

MeloMar IT helps teams define meaningful SLOs, reduce toil, and build platform capabilities that actually support engineering teams.