SRE Notes on SLO, SLI, and Error Budgets

SLOs, SLI, and Error Budgets are the key pillars of Site Reliability Engineering, as championed by Google.

What is the purpose of service working 100% of the time, but which doesn’t meet the needs of its users…. 
Or a service that never gets new features because that would require it to fail! 

On the extreme opposite, what is the purpose of service, with all the great big bang features, that the users want, but which is unreliable, and doesn’t work most of the time!

Neither service is any good!

What we need is a balance, the yin and the yang need to be in perfect harmony.

This is why SLO, SLIs, and Error budgets exist.

What is an SLO?

The SLO stands for Service Level Object, which is not to be confused by SLAs, which are better known by lawyers used to drafting contracts with penalty clauses.

Nobody likes to have to deal with lawyers, so that’s why SLOs were invented. Ok…. I am just joking….

A Service Level Agreement, more commonly known as an SLA is not always something that can be easily measured and can be highly subjective.

An SLO on the other hand should be easily measured with data and it shouldn’t be open to dispute.

An example of an SLO is for instance saying: I want 95% of all requests to API X to be successfully served within 100ms.
Or a more common SLO is about availability, “I want 99.99% of all requests to API x to be successful”.

Additionally, an SLO should also be measured across a window of time, for instance over a 30 days period.

So a service level objective is a target that can be measured, by an SLI, which stands for Service Level Indicator. Without an SLI you don’t have an SLO. If you can’t measure it, an SLO is no more than an aspiration.

How is the SLO set?

One shouldn’t randomly set values for a service SLO. 

An SLO properly defined is the result of the delicate balance of having a reliable and fast service, against the desire for constant improvement.

This is why defining SLOs that seeks 100% uptime and availability is not a good idea.
Besides being a very costly objective, having near-perfect reliability will prevent a service from being constantly updated with new features. Because every time you need to add new features, for sure the reliability will go down. 

But if the reliability goes down then you would be automatically not meet the SLO. This is because the SLO was set too high, without having a good enough buffer, which is also known as the error budget.

Therefore it is very important to set realistic SLOs.

Below you can see an example of an SLO that you can set for a project:

  • 99.9% of all requests to service X are successful
  • Latency should be below 100ms for 95% of all requests
  • 99.99% of all requests to add a product to the cart are successful.

These SLOs are based on windows of time of 30 days. However, you can also set them per quarter or by whatever length of time you find appropriate for your project.

It is no good to define SLOs and SLIs that can’t be easily measured with SLIs and we don’t keep track of them. 

More often than not, Prometheus and Grafana are used to collect metrics and visualize SLOs in an enterprise. Below you can see an example of a dashboard that you might see, to track compliance with an SLO:

Often, the measured SLI indicator is above the target SLO. The difference between the two is what constitutes the Error Budget. 

The Error Budget is the slack that a business has in making changes that may bring the reliability down.

By keeping tabs on the SLO and the Error Budget a business is able to control the rate of change.


Posted

in

,

by