
Answer-first summary for fast verification
Answer: Design a modular data pipeline using separate Databricks notebooks for each stage (ingestion, processing, storage), linked together via Databricks jobs. Each notebook includes specific data quality checks, error handling, and performance metrics logging, enabling scalability and easier maintenance.
Option B is the best approach because it leverages Databricks' native capabilities to create a scalable, maintainable, and compliant data pipeline. By modularizing the pipeline into distinct stages with dedicated notebooks and jobs, it ensures clear separation of concerns, facilitates data quality checks and error handling at each step, and supports comprehensive monitoring. This design aligns with the project's requirements for scalability, compliance, and high data quality, while avoiding the pitfalls of monolithic designs, external dependencies, or excessive custom development.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
In the context of designing a data pipeline for a Databricks project that ingests data from multiple sources, processes it, and stores the results in a data warehouse, consider the following scenario: The project must adhere to strict compliance standards, ensure high data quality, and be scalable to handle increasing data volumes. Additionally, the solution must provide comprehensive monitoring and error handling capabilities. Given these requirements, which of the following approaches BEST meets the project's needs? (Choose one option.)
A
Develop a monolithic Databricks notebook that combines data ingestion, processing, and storage operations, with inline data quality checks and error handling. This approach simplifies the pipeline structure but may lack scalability and detailed monitoring.
B
Design a modular data pipeline using separate Databricks notebooks for each stage (ingestion, processing, storage), linked together via Databricks jobs. Each notebook includes specific data quality checks, error handling, and performance metrics logging, enabling scalability and easier maintenance.
C
Leverage a third-party ETL tool for data ingestion and initial processing, then use Databricks notebooks for further processing and storage. While this approach may offer some built-in data quality features, it introduces additional complexity and potential compliance risks due to external dependencies.
D
Build a custom pipeline using Apache Spark jobs outside of Databricks, with a separate data quality framework and monitoring solution. This approach offers maximum flexibility but requires significant development effort and may not fully leverage Databricks' integrated features.