Databricks Certified Data Engineer - Professional

Ultimate access to all questions.

You are designing a Delta Lake table in Databricks to store 5 years of sales data. The dataset contains the following columns:

sale_date (date of sale) region (geographic region, ~10 distinct values) customer_id (~10 million distinct values) amount (sale amount)

Your queries often: Filter by sale_date and region Occasionally filter or group by customer_id

Which of the following table definitions BEST balances performance and storage efficiency?

Exam-Like

Loading comments...

CREATE TABLE sales USING DELTA PARTITIONED BY (customer_id) AS SELECT * FROM raw_sales;

CREATE TABLE sales USING DELTA PARTITIONED BY (customer_id) AS SELECT * FROM raw_sales;

0.0%

CREATE TABLE sales USING DELTA PARTITIONED BY (sale_date, region) AS SELECT * FROM raw_sales;

CREATE TABLE sales USING DELTA PARTITIONED BY (sale_date, region) AS SELECT * FROM raw_sales;

16.7%

CREATE TABLE sales USING DELTA PARTITIONED BY (sale_date, region) CLUSTER BY (customer_id) AS SELECT * FROM raw_sales;

CREATE TABLE sales USING DELTA PARTITIONED BY (sale_date, region) CLUSTER BY (customer_id) AS SELECT * FROM raw_sales;

75.0%

CREATE TABLE sales USING DELTA CLUSTER BY (sale_date, region) AS SELECT * FROM raw_sales;

CREATE TABLE sales USING DELTA CLUSTER BY (sale_date, region) AS SELECT * FROM raw_sales;

8.3%