In modern Site Reliability Engineering (SRE), metrics are more than just numbers — they represent trust between systems, users, and teams.
Whether you manage a global SaaS platform or an on-prem VMware private cloud, understanding SLI, SLO, SLA, and Error Budgets is the foundation for measuring reliability in a scientific, data-driven way.
In this post, I’ll break down these concepts using examples from vCenter, vSAN, and NSX, followed by how the Four Golden Signals help you observe and maintain system health.
What Is an SLI (Service Level Indicator)?
An SLI is a measurable indicator of how reliable your service is.
It’s a metric that can be collected, monitored, and analyzed over time.
Example – vCenter:
99.95 % of API requests return a
200 OKresponse within 1 second.
Example – vSAN:
Average write latency stays below 10 ms for 99 % of requests.
In short, SLI = what you measure.
What Is an SLO (Service Level Objective)?
An SLO is a goal or target for that measurement.
It defines what “reliable enough” means for your users or internal stakeholders.
Example:
vCenter API success rate should be ≥ 99.95 % each month.
vSAN write latency should be < 10 ms for 99 % of requests.
The SLO gives engineering teams a clear boundary between acceptable and unacceptable performance.
What Is an SLA (Service Level Agreement)?
An SLA is the external promise you make to your customers.
It is contractual — if you fail to meet it, there might be penalties or service credits.
Example:
“Our private-cloud vCenter will maintain 99.9 % uptime per month.
If not, customers receive credit on their next billing cycle.”
Notice that the SLA is usually less strict than the SLO — this provides a small safety buffer.
What is Error Budget?
The Error Budget defines how much failure is acceptable before it impacts users.
It balances innovation and reliability.
Formula:
Error Budget = 100 % – SLO
Example:
If SLO = 99.95 %, then Error Budget = 0.05 %.
In a 30-day month (43 200 minutes), that’s about 21.6 minutes of allowable downtime.
If you exceed that — your error budget is “spent,” and the focus shifts from new deployments to stability improvements.
The Four Golden Signals of Monitoring
Defined originally by Google SRE, these signals tell you almost everything about a system’s health.
| Signal | Meaning | VMware Example |
|---|---|---|
| Latency | Time taken to serve a request | vSAN read/write latency, API response time |
| Traffic | Volume of demand | Number of vCenter API calls/sec, NSX packets/sec |
| Errors | Rate of failed requests | 5xx API errors, packet drops |
| Saturation | How “full” the system is | CPU %USED, memory swap, disk queue depth |
By tracking these, you can detect bottlenecks before users feel pain.
A Quick Example in Context
Service: vCenter
- SLI: API availability = 99.95 %
- SLO: Keep above 99.95 % monthly
- SLA: Promise 99.9 % to customers
- Error Budget: 21 minutes of downtime allowed per month
If vCenter is unavailable for 15 minutes — within budget
If it’s down 30 minutes — SLO breach → freeze risky releases and perform RCA.
How the Golden Signals Tie Back to Reliability
Let’s take an example from vSAN storage:
- Latency rises from 8 → 25 ms.
- Traffic increases 10× due to backup jobs.
- Errors show 5 % I/O timeouts.
- Saturation shows SSD utilization 85 %.
Together, these four signals tell a complete story:
“High traffic caused saturation → latency spike → I/O errors → user impact.”
So, that’s all about the SRE Concept which everyone must keep in mind to be more data driven. Thank you!




