
Databricks Certified Data Engineer - Associate
Get started today
Ultimate access to all questions.
You are tasked with building a robust, production-grade data pipeline in Databricks. The pipeline consists of several modular notebooks:
utils_notebook: Contains utility functions for logging, error notification, and data quality checks.
extract_notebook: Extracts data from multiple sources and performs initial validation.
transform_notebook: Applies complex business logic and transformations.
load_notebook: Loads the processed data into a Delta table.
Your requirements are:
Reusability: Utility functions from utils_notebook must be accessible in all other notebooks without code duplication.
Parameterization: Each notebook must accept parameters (e.g., source paths, table names, run dates) at runtime.
Isolation: Each notebook should run in its own execution context to avoid variable conflicts and ensure independent error handling.
Error Handling: If any notebook fails, the pipeline should log the error using a function from utils_notebook and halt further execution.
Scalability: The solution should be maintainable and scalable for a team of data engineers.
Which of the following approaches best satisfies all requirements? Select the best answer and explain why the other options are less suitable.
You are tasked with building a robust, production-grade data pipeline in Databricks. The pipeline consists of several modular notebooks:
utils_notebook: Contains utility functions for logging, error notification, and data quality checks.
extract_notebook: Extracts data from multiple sources and performs initial validation.
transform_notebook: Applies complex business logic and transformations.
load_notebook: Loads the processed data into a Delta table.
Your requirements are:
Reusability: Utility functions from utils_notebook must be accessible in all other notebooks without code duplication.
Parameterization: Each notebook must accept parameters (e.g., source paths, table names, run dates) at runtime.
Isolation: Each notebook should run in its own execution context to avoid variable conflicts and ensure independent error handling.
Error Handling: If any notebook fails, the pipeline should log the error using a function from utils_notebook and halt further execution.
Scalability: The solution should be maintainable and scalable for a team of data engineers.
Which of the following approaches best satisfies all requirements? Select the best answer and explain why the other options are less suitable.
Exam-Like
Explanation:
In each notebook (extract_notebook, transform_notebook, load_notebook), use %run to include utils_notebook at the top, and then use dbutils.notebook.run() in the master orchestrator to call each notebook with parameters and error handling.
Reference Explanation:
Reusability: Including utils_notebook with %run at the top of each notebook ensures utility functions are available in each notebook’s context without code duplication.
Parameterization: dbutils.notebook.run() allows you to pass parameters to each notebook at runtime.
Isolation: Each notebook runs in its own context when called with dbutils.notebook.run(), preventing variable conflicts and supporting independent error handling.
Error Handling: The master orchestrator can use try/except blocks to catch errors from each dbutils.notebook.run() call and invoke the logging function (available in each notebook via %run utils_notebook).
Scalability: This modular approach is maintainable and scalable for collaborative teams.
Why other options are less suitable:
A: Utility functions from utils_notebook included in the master notebook are not available in the child notebooks’ isolated contexts when using dbutils.notebook.run().
B: Calling utils_notebook with dbutils.notebook.run() does not make its functions available to other notebooks.
D: %run does not provide context isolation or parameterization, and using global variables is not robust or scalable.