
Explanation:
The correct answer is D. This approach efficiently orchestrates the pipeline by respecting the dependency of Notebooks B and C on Notebook A's completion, while also allowing Notebooks B and C to run in parallel. This minimizes the total runtime by leveraging Databricks' job orchestration capabilities for both sequential and parallel task execution. Options A, B, and C either fail to utilize parallel execution where possible or do not respect the necessary completion of Notebook A before starting Notebooks B and C, leading to inefficiencies or potential errors.
Ultimate access to all questions.
No comments yet.
A team of machine learning engineers is given three notebooks (Notebook A, Notebook B, and Notebook C) by a data scientist to establish a machine learning pipeline. Notebook A is used for exploratory data analysis, while Notebooks B and C are for feature engineering. Notebook A must be completed before Notebooks B and C can start, but Notebooks B and C can run independently of each other. What is the most efficient and reliable way to orchestrate this pipeline in Databricks? Choose the ONE best answer.
A
Set up a three-task job where each task runs a specific notebook, with each task depending on the completion of the previous one.
B
Create a three-task job where each task runs a distinct notebook, and all three tasks are executed in parallel.
C
Establish three single-task jobs, each running a unique notebook, all scheduled to run at the same time.
D
Configure a three-task job where each task runs a specific notebook. The last two tasks run simultaneously, each depending on the completion of the first task.