Databricks Certified Data Engineer - Professional

Databricks Certified Data Engineer - Professional

Get started today

Ultimate access to all questions.


You are designing a Delta Lake table in Databricks to store 5 years of sales data. The dataset contains the following columns:

sale_date (date of sale) region (geographic region, ~10 distinct values) customer_id (~10 million distinct values) amount (sale amount)

Your queries often: Filter by sale_date and region Occasionally filter or group by customer_id

Which of the following table definitions BEST balances performance and storage efficiency?




Explanation:

Explanation: PARTITIONED BY (sale_date, region) → Both are low-cardinality and frequently used in filters, enabling partition pruning. CLUSTER BY (customer_id) → High-cardinality column, so clustering improves data skipping without creating excessive small files. Option A: Bad — customer_id has high cardinality → too many partitions → small files problem. Option B: Good for partition pruning, but misses optimization for customer_id queries. Option D: Clustering on low-cardinality columns is less effective than partitioning them.