
Answer-first summary for fast verification
Answer: ```sql CREATE TABLE sales USING DELTA PARTITIONED BY (sale_date, region) CLUSTER BY (customer_id) AS SELECT * FROM raw_sales; ```
Explanation: PARTITIONED BY (sale_date, region) → Both are low-cardinality and frequently used in filters, enabling partition pruning. CLUSTER BY (customer_id) → High-cardinality column, so clustering improves data skipping without creating excessive small files. Option A: Bad — customer_id has high cardinality → too many partitions → small files problem. Option B: Good for partition pruning, but misses optimization for customer_id queries. Option D: Clustering on low-cardinality columns is less effective than partitioning them.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
You are designing a Delta Lake table in Databricks to store 5 years of sales data. The dataset contains the following columns:
sale_date (date of sale) region (geographic region, ~10 distinct values) customer_id (~10 million distinct values) amount (sale amount)
Your queries often: Filter by sale_date and region Occasionally filter or group by customer_id
Which of the following table definitions BEST balances performance and storage efficiency?
A
CREATE TABLE sales USING DELTA PARTITIONED BY (customer_id) AS SELECT * FROM raw_sales;
CREATE TABLE sales USING DELTA PARTITIONED BY (customer_id) AS SELECT * FROM raw_sales;
B
CREATE TABLE sales USING DELTA PARTITIONED BY (sale_date, region) AS SELECT * FROM raw_sales;
CREATE TABLE sales USING DELTA PARTITIONED BY (sale_date, region) AS SELECT * FROM raw_sales;
C
CREATE TABLE sales USING DELTA PARTITIONED BY (sale_date, region) CLUSTER BY (customer_id) AS SELECT * FROM raw_sales;
CREATE TABLE sales USING DELTA PARTITIONED BY (sale_date, region) CLUSTER BY (customer_id) AS SELECT * FROM raw_sales;
D
CREATE TABLE sales USING DELTA CLUSTER BY (sale_date, region) AS SELECT * FROM raw_sales;
CREATE TABLE sales USING DELTA CLUSTER BY (sale_date, region) AS SELECT * FROM raw_sales;