Ultimate access to all questions.
A data engineering team has an ETL job that runs every midnight but fails intermittently due to a specific task, requiring manual reruns in the morning. This issue is causing significant overhead. What approach can the team take to ensure the job completes every night while minimizing compute costs?