
Explanation:
Correct Answer: D
D is the most complete and accurate answer for a production-grade Databricks Data Engineering pipeline using Delta Lake best practices.
When moving data from a Bronze table (raw, unprocessed) to a Silver table (cleaned, deduplicated, query-ready), you need to address:
Key steps and Delta Lake features you would use (as per best practices for Databricks Certified Data Engineer - Professional):
Bronze Layer (Ingestion)
delta.enableChangeDataFeed = true (for CDC/incremental downstream)mergeSchema or overwriteSchemaCHECK constraints or GENERATED columns if neededSilver Layer (Cleansing + Deduplication)
COPY INTO in some cases) with a watermark or high-watermark column (e.g., ingestion_timestamp, event_time, or a surrogate key).ROW_NUMBER() over a window partitioned by business keys and ordered by timestamp (to keep the latest record).CHECK (amount >= 0), CHECK (customer_id IS NOT NULL)).CONSTRAINT with VIOLATION behavior (FAIL or DROP).Delta Lake Specific Features Used:
Example code pattern (what the exam expects you to know):
-- Silver table with quality constraints
ALTER TABLE silver_table ADD CONSTRAINT valid_amount CHECK (amount >= 0);
-- Deduplication + Incremental MERGE
MERGE INTO silver_table tgt
USING (
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY ingestion_ts DESC) as rn
FROM bronze_table
WHERE ingestion_ts > (SELECT max(ingestion_ts) FROM silver_table) -- incremental
) src
ON tgt.id = src.id
WHEN MATCHED AND src.rn = 1 THEN UPDATE SET *
WHEN NOT MATCHED AND src.rn = 1 THEN INSERT *
;
-- Silver table with quality constraints
ALTER TABLE silver_table ADD CONSTRAINT valid_amount CHECK (amount >= 0);
-- Deduplication + Incremental MERGE
MERGE INTO silver_table tgt
USING (
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY ingestion_ts DESC) as rn
FROM bronze_table
WHERE ingestion_ts > (SELECT max(ingestion_ts) FROM silver_table) -- incremental
) src
ON tgt.id = src.id
WHEN MATCHED AND src.rn = 1 THEN UPDATE SET *
WHEN NOT MATCHED AND src.rn = 1 THEN INSERT *
;
A: Incorrect. It mentions "without any specific Delta Lake features" — this completely misses the point of using Delta Lake for reliable pipelines. Pure Spark SQL does not provide ACID guarantees, schema enforcement, or efficient incremental MERGE.
B: Partially correct but incomplete.
MERGE INTO is good for deduplication/incremental.auto-compaction is only for file-level optimization (helps performance, not data quality).C: Worst option. Manual checks + custom Python script is not scalable, not reliable, and goes against Delta Lake / Databricks best practices. The certification strongly favors declarative approaches (Delta features, DLT, constraints) over imperative custom scripts.
@expect / @expect_or_fail is often the preferred way to enforce quality.This question tests whether you understand Medallion Architecture (Bronze → Silver) + Delta Lake capabilities for quality and incremental loads.
Would you like me to provide a full working code example or explain how this would look in a Delta Live Tables (DLT) pipeline?
Ultimate access to all questions.
You are tasked with designing a data pipeline that processes raw data from a bronze table to a silver table. The raw data contains duplicates and requires incremental processing. Describe the steps you would take to ensure data quality and deduplication. Include how you would enforce data quality at each stage and the specific Delta Lake features you would use.
A
Use Spark SQL to filter duplicates and apply incremental processing without any specific Delta Lake features.
B
Implement a Delta Lake table with auto-compaction and use MERGE INTO for deduplication and incremental processing.
C
Manually check for duplicates and apply incremental processing using a custom Python script.
D
Use a combination of Spark and Delta Lake to filter duplicates, apply incremental processing, and enforce data quality using CHECK constraints.