
Ultimate access to all questions.
Question 38
A data engineer has set up two Jobs that each run nightly. The first Job starts at 12:00 AM, and it usually completes in about 20 minutes. The second Job depends on the first Job, and it starts at 12:30 AM. Sometimes, the second Job fails when the first Job does not complete by 12:30 AM.
Which of the following approaches can the data engineer use to avoid this problem?
Explanation:
The correct answer is A because:
Multiple tasks in a single job with linear dependency ensures that the second task only starts after the first task successfully completes. This eliminates the timing issue where the second job starts before the first job finishes.
Option B (cluster pools) might improve efficiency but doesn't solve the dependency timing problem.
Option C (retry policy) might help with transient failures but doesn't guarantee the first job completes before the second job starts.
Option D (limit output size) addresses potential performance issues but doesn't solve the fundamental dependency timing problem.
Option E (streaming data) is not appropriate for batch jobs and doesn't address the job dependency issue.
By using multiple tasks within a single job with linear dependencies, the data engineer can ensure proper execution order without relying on fixed start times that may cause race conditions.