Ultimate access to all questions.
When designing a data quality framework within a lakehouse architecture, which approach ensures comprehensive data validation, error handling, and cleansing without introducing significant processing overhead?
Explanation:
A (Periodic batch Spark jobs for validation/cleansing): This approach reduces load on ingestion pipelines by offloading validation to batch jobs, minimizing impact on primary workloads. It can be comprehensive, as batch jobs can be complex and thorough. This matches the requirement for minimal processing overhead. B (Embed quality checks in streaming with side outputs for errors): Similar to integrating a third-party tool, embedding checks directly in streaming ingestion can increase complexity and overhead. Side outputs are good for error handling but can still impact streaming performance. C (Third-party tool in real-time ingestion): While real-time validation can catch errors early, it can introduce significant processing overhead on the ingestion pipeline, impacting performance. D (Declarative rules with Delta Lake constraints at ingestion): Delta Lake supports constraints, which help prevent bad data. However, enforcing constraints at ingestion can slow down ingestion and cause failures, potentially affecting performance if data volumes are high. Not as flexible or comprehensive for cleansing, and can increase latency.