From SLIs to SLAs: How Reliability Is Measured in Modern Systems

Hey everyone, and welcome to Infradiaries! If you’re anything like me, you love a good, reliable companion. Maybe it’s your favorite comfy armchair, your trusty old car, or, if you’re a dog lover like us, your furry best friend who’s always there, ready for a walk or a cuddle.

In the world of computers and websites, we want the same kind of reliability. We want our systems to be “always there” for our users. But how do we actually know if they are, and how do we talk about it without getting lost in jargon?

Think of it like this: if you want to know if your dog is healthy and happy, you don’t just guess, right? You look for signs, set goals, and maybe even have agreements with your vet.

Let’s break down how we do that for our tech, using our favorite four-legged friends as guides!

1. SLI (Service Level Indicator): Your Dog’s Health Check

An SLI is just a fancy way of saying “what we measure.” It’s like the specific things you check to see if your dog is doing well.

How much food did they eat? (Is our website processing enough requests?)
How long did it take them to bring back the ball? (How fast does our website respond?)
How many barks did they make today? (Are there too many error messages?)

These are our raw observations, the hard numbers that tell us what’s happening.

Imagine this: You’re timing your dog, Buster, when he fetches a ball. He usually brings it back in 5 seconds. That’s your SLI: “time to retrieve ball.”

Here’s an image of a happy dog fetching a ball.

2. SLO (Service Level Objective): Setting Goals for Your Pup

An SLO is the goal you set based on your measurements. It’s what you aim for.

So, if Buster usually brings the ball back in 5 seconds, your SLO might be: “Buster will retrieve the ball within 7 seconds, 95% of the time.” You’re not saying every single time (because sometimes he gets distracted by a squirrel!), but most of the time, you expect him to hit that target.

This is crucial because it helps us define what “good enough” looks like. We can’t expect perfection (more on that later!), but we can expect consistent, high-quality performance.

The “Why” Behind SLOs:

Clear Expectations: Everyone knows what’s expected.
Actionable Alerts: If Buster starts taking 15 seconds consistently, you know something’s up and it’s time to check on him (or your website!).

Here’s an image of a dog looking determined, with a thought bubble above his head showing a medal or a target, representing an SLO.

gemini generated image h3citxh3citxh3ci (1)

3. SLA (Service Level Agreement): The Puppy Contract!

An SLA is where things get serious. It’s an official agreement – a contract – that says what will happen if you don’t meet your SLOs.

Think of it like an agreement with your dog walker. Their job is to walk Buster for 30 minutes, 5 days a week (that’s the SLO). If they only walk him for 10 minutes for a whole week, the SLA might say they have to give you a discount or even a free walk next time.

In the tech world, if a company promises 99.9% uptime (their SLO) and they only deliver 90% for a month, the SLA might mean they have to give their customers a credit or a refund.

Here’s an image of a cute puppy “signing” a contract with a paw print.

gemini generated image h3citxh3citxh3ci (2)

Why This Dog-Gone Important (and Some Pro Tips!)

Defining these three things helps us move from just hoping our systems are reliable to knowing it and having a plan when they’re not.

Pro Tip 1: Don’t Just Measure the “Average” Fetch Time!

If Buster brings the ball back in 1 second sometimes, and 20 seconds other times (because he stopped to sniff every bush!), his average fetch time might look okay. But you know that those long waits are frustrating!

The same goes for websites. An “average” response time might hide the fact that 1 out of 10 users is waiting forever. Always look at the worst-case scenarios (like “99th percentile latency”) to understand the full user experience.

Here’s an image showing a dog running happily for the ball, but then an overlay or thought bubble shows another dog getting distracted and taking a long time. The text on the image highlights “Average vs. Real Experience.”

gemini generated image h3citxh3citxh3ci (3)

Pro Tip 2: Don’t Aim for “Perfect” Reliability (It’s a Trap!)

You love your dog, but you don’t expect them to never have an accident, never chew on something they shouldn’t, or never bark at the mail carrier. Expecting perfection is unrealistic and exhausting!

It’s the same for our computer systems. Trying to make a system “100% perfect” is incredibly expensive and often unnecessary. A little bit of acceptable “unreliability” (called an “error budget”) is actually a good thing. It frees up our teams to build new features instead of chasing impossible perfection.

Here’s an image of a dog looking guilty next to a slightly chewed slipper, with a thought bubble saying “Oops, not 100% perfect!”

gemini generated image h3citxh3citxh3ci (4)

InfraDiaries