
Databricks Certified Data Engineer - Professional
Get started today
Ultimate access to all questions.
You are designing a Delta Lake table in Databricks to store 5 years of sales data.
The dataset contains the following columns:
sale_date (date of sale)
region (geographic region, ~10 distinct values)
customer_id (~10 million distinct values)
amount (sale amount)
Your queries often:
Filter by sale_date and region
Occasionally filter or group by customer_id
Which of the following table definitions BEST balances performance and storage efficiency?
You are designing a Delta Lake table in Databricks to store 5 years of sales data. The dataset contains the following columns:
sale_date (date of sale) region (geographic region, ~10 distinct values) customer_id (~10 million distinct values) amount (sale amount)
Your queries often: Filter by sale_date and region Occasionally filter or group by customer_id
Which of the following table definitions BEST balances performance and storage efficiency?
Explanation:
Explanation: PARTITIONED BY (sale_date, region) → Both are low-cardinality and frequently used in filters, enabling partition pruning. CLUSTER BY (customer_id) → High-cardinality column, so clustering improves data skipping without creating excessive small files. Option A: Bad — customer_id has high cardinality → too many partitions → small files problem. Option B: Good for partition pruning, but misses optimization for customer_id queries. Option D: Clustering on low-cardinality columns is less effective than partitioning them.