Skip to content

Commit

Permalink
Moar review
Browse files Browse the repository at this point in the history
  • Loading branch information
jfsmith-at-coveo committed Jul 17, 2024
1 parent ade586a commit c7e3a25
Show file tree
Hide file tree
Showing 3 changed files with 9 additions and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,15 @@ The site reliability engineering (SRE) team at Coveo is currently hard at work i

<!-- more -->

But first, a refresher. What is an SLO? A service-level objective is an acceptability threshold relating to the performance of a service. This concept also has an inseparable twin, the service-level indicator or SLI. The SLI is a measurement of a service’s behavior expressed as the frequency of some successful state or result. For example, the number of requests that return HTTP 200 OK responses; or the number of jobs that completed within 5 minutes. A simple guidance to ensure that your SLI is expressed in this conventional way is that your measurement is an unequivocal yes/no or true/false proposition. Did a response return a 200 OK? Did a job complete within 5 minutes? The acceptable ratio of good/bad answers to these questions, within a predetermined time window, is your SLO. For example, you may declare that 99.9% of your requests in the last 24h should return 200 OK; or that 95% of the jobs within the last 28 days should complete within 5 minutes. SLOs such as these are much more than mere monitoring redlines on a dashboard. They are, in essence, quality pledges to your customers.
But first, a refresher. What is an SLO? A service-level objective is an acceptability threshold relating to the performance of a service. This concept also has an inseparable twin, the service-level indicator or SLI. The SLI is a measurement of a service’s behavior expressed as the frequency of some successful state or result. For example, the number of requests that return HTTP 200 OK responses; or the number of jobs that completed within 5 minutes. A simple guidance to ensure that your SLI is expressed in this conventional way is that your measurement is an unequivocal yes/no or true/false proposition. Did a response return a 200 OK? Did a job complete within 5 minutes? This is so that you can calculate the ratio of good versus bad events. This is your SLI measurement:

![SLI as the ration of good / bad events](/images/2024-07-16-the-curious-case-of-a-service-level-objective/sli.png){:style="display:block; margin-left:auto; margin-right:auto; width:35%"}

An *acceptable* value of this SLI, within a predetermined time window, is your SLO. The intent behind the time window is to calculate an error budget (EB) and a burn rate (EBBR). The main purpose of the EB is to represent the margin of error within which you are allowing your service to operate. An empty budget should always represent the moment when your customers begin to feel unhappy; a non-empty budget means you can allow yourself to deploy, or even experiment with, application changes. The related EBBR will then be used for alerting when the budget is ailing because your service is going south (or you are messing dangerously with it!). In short:

![Error budget and budget burn rate](/images/2024-07-16-the-curious-case-of-a-service-level-objective/eb-ebbr.png){:style="display:block; margin-left:auto; margin-right:auto; width:25%"}

In concrete terms, you could declare that 99.9% of your requests in the last 24h should return 200 OK; or that 95% of the jobs within the last 28 days should complete within 5 minutes. SLOs such as these are much more than mere monitoring redlines on a dashboard. They are, in essence, quality pledges to your customers.

Up until now at Coveo our implementation of SLOs has leveraged Honeycomb, which uses [distributed tracing](https://docs.honeycomb.io/get-started/basics/observability/concepts/distributed-tracing/#what-is-a-trace) to propel request observability to impressive heights. Using this data, setting up availability and latency SLOs is not only easy, but also quite appropriate. Thanks to its almost limitless cardinality, drilling down into traces and cross-associating multiple properties allow for very deep investigations.

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit c7e3a25

Please sign in to comment.