
Answer-first summary for fast verification
Answer: As the proportion of report generation requests that result in a successful response
An availability Service Level Indicator (SLI) quantifies the reliability of a service from the user's perspective, typically defined as the proportion of requests that result in a successful outcome. In this scenario, the issue involved HTTP 500 errors indicating server failures, which were correlated with high I/O wait times and resolved by resizing the persistent disk. This underscores that availability should be measured based on user-visible outcomes, not internal system metrics. - **Option A (I/O wait times)**: This measures a system-level performance metric, not availability. While it correlated with failures in this case, it does not directly reflect whether users are experiencing successful requests and could be misleading if other factors cause high I/O without failures. - **Option B (Proportion of successful responses)**: This is the correct choice. It directly defines availability as the ratio of report generation requests that succeed (e.g., returning HTTP 2xx codes) to total requests. This aligns with SRE best practices, as it captures the user experience and can detect issues like the HTTP 500 errors described. - **Option C (Queue size comparison)**: This focuses on an internal queue metric, which is a leading indicator of potential problems but not a direct measure of availability. Queue size might grow without causing failures (e.g., during temporary spikes), and it doesn't quantify actual user success or failure rates. - **Option D (PD throughput capacity comparison)**: This assesses resource capacity, not availability. While insufficient throughput contributed to the issue, this metric is about system limits rather than the observable service reliability for users. It could be used for capacity planning but not as an SLI for availability. Therefore, Option B is the most appropriate SLI because it is user-centric, measurable, and directly tied to the service's functional reliability.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
You are responsible for the reliability of a high-volume enterprise application. Users report that a critical data-intensive reporting feature consistently fails with HTTP 500 errors. Investigation reveals a strong correlation between failures and the size of an internal queue used for report generation, traced to a reporting backend with high I/O wait times. After resolving the issue by resizing the backend's persistent disk (PD), you need to define an availability Service Level Indicator (SLI) for the report generation feature. How would you define it?
A
As the I/O wait times aggregated across all report generation backends
B
As the proportion of report generation requests that result in a successful response
C
As the application's report generation queue size compared to a known-good threshold
D
As the reporting backend PD throughout capacity compared to a known-good threshold