LeetQuiz Logo
Privacy Policy•contact@leetquiz.com
© 2025 LeetQuiz All rights reserved.
Google Professional Cloud DevOps Engineer

Google Professional Cloud DevOps Engineer

Get started today

Ultimate access to all questions.


You manage a widely-used mobile game application running on Google Kubernetes Engine (GKE) across multiple Google Cloud regions, with each region containing several Kubernetes clusters. A report indicates that users in a specific region cannot connect to the application. Following Site Reliability Engineering (SRE) principles, what is the first action you should take to resolve this incident?

Exam-Like



Explanation:

In Site Reliability Engineering (SRE) practices, the first step during an incident is to diagnose the problem before taking corrective actions. The issue reported is that users in a specific region cannot connect to the application, indicating a regional failure.

  • Option A (rerouting traffic) is a mitigation tactic but should not be done first without understanding the root cause, as it might overload other regions or mask underlying issues.
  • Option B (checking CPU/memory spikes via Stackdriver Monitoring) could identify resource constraints but is too narrow; connection failures might stem from network issues, configuration errors, or application bugs not reflected in CPU/memory.
  • Option C (adding high-resource node pools) is a reactive scaling measure that addresses symptoms (e.g., resource exhaustion) but not the root cause and wastes resources if misdiagnosed.
  • Option D (inspecting logs via Stackdriver Logging) is the best first step. Logs provide direct error messages, stack traces, and contextual clues (e.g., network timeouts, authentication failures, or regional misconfigurations) to pinpoint the exact failure. This aligns with SRE principles of data-driven diagnosis, minimizing guesswork, and ensuring efficient resolution.
Powered ByGPT-5