Ultimate access to all questions.
You are a data engineer responsible for deploying a critical batch job in a production environment on Azure Databricks. The job processes large volumes of data nightly and must complete within a strict 6-hour window to meet business SLAs. Given the importance of the job, you need to ensure its success and minimize the risk of failures, especially those related to node failures. Which of the following strategies would BEST ensure the job's success by providing comprehensive monitoring and proactive failure detection, while also considering cost efficiency and scalability? Choose one option.