
Ultimate access to all questions.
In the context of designing a data pipeline for a Databricks project that ingests data from multiple sources, processes it, and stores the results in a data warehouse, consider the following scenario: The project must adhere to strict compliance standards, ensure high data quality, and be scalable to handle increasing data volumes. Additionally, the solution must provide comprehensive monitoring and error handling capabilities. Given these requirements, which of the following approaches BEST meets the project's needs? (Choose one option.)
A
Develop a monolithic Databricks notebook that combines data ingestion, processing, and storage operations, with inline data quality checks and error handling. This approach simplifies the pipeline structure but may lack scalability and detailed monitoring.
B
Design a modular data pipeline using separate Databricks notebooks for each stage (ingestion, processing, storage), linked together via Databricks jobs. Each notebook includes specific data quality checks, error handling, and performance metrics logging, enabling scalability and easier maintenance.
C
Leverage a third-party ETL tool for data ingestion and initial processing, then use Databricks notebooks for further processing and storage. While this approach may offer some built-in data quality features, it introduces additional complexity and potential compliance risks due to external dependencies.
D
Build a custom pipeline using Apache Spark jobs outside of Databricks, with a separate data quality framework and monitoring solution. This approach offers maximum flexibility but requires significant development effort and may not fully leverage Databricks' integrated features.