
Answer-first summary for fast verification
Answer: Use Stackdriver Logging to filter on the clusters in the affected region, and inspect error messages in the logs.
In Site Reliability Engineering (SRE) practices, the first step during an incident is to diagnose the problem before taking corrective actions. The issue reported is that users in a specific region cannot connect to the application, indicating a regional failure. - Option A (rerouting traffic) is a mitigation tactic but should not be done first without understanding the root cause, as it might overload other regions or mask underlying issues. - Option B (checking CPU/memory spikes via Stackdriver Monitoring) could identify resource constraints but is too narrow; connection failures might stem from network issues, configuration errors, or application bugs not reflected in CPU/memory. - Option C (adding high-resource node pools) is a reactive scaling measure that addresses symptoms (e.g., resource exhaustion) but not the root cause and wastes resources if misdiagnosed. - Option D (inspecting logs via Stackdriver Logging) is the best first step. Logs provide direct error messages, stack traces, and contextual clues (e.g., network timeouts, authentication failures, or regional misconfigurations) to pinpoint the exact failure. This aligns with SRE principles of data-driven diagnosis, minimizing guesswork, and ensuring efficient resolution.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
You manage a widely-used mobile game application running on Google Kubernetes Engine (GKE) across multiple Google Cloud regions, with each region containing several Kubernetes clusters. A report indicates that users in a specific region cannot connect to the application. Following Site Reliability Engineering (SRE) principles, what is the first action you should take to resolve this incident?
A
Reroute the user traffic from the affected region to other regions that don't report issues.
B
Use Stackdriver Monitoring to check for a spike in CPU or memory usage for the affected region.
C
Add an extra node pool that consists of high memory and high CPU machine type instances to the cluster.
D
Use Stackdriver Logging to filter on the clusters in the affected region, and inspect error messages in the logs.