Databricks Certified Data Engineer - Professional

Get started today

Ultimate access to all questions.

You are a data engineer working on optimizing a large dataset stored in Delta Lake to improve query performance for analytical workloads. The dataset contains sales transactions with columns including 'date', 'region', 'user_id', 'product_id', and 'amount'. The queries frequently filter on 'date' and 'region', and often perform equality lookups on 'user_id'. Given the need to minimize query latency while considering cost and scalability, which of the following indexing and partitioning strategies would you implement? Choose the best option.

Simulated

Partition the dataset solely on the 'date' column to simplify the partitioning scheme.

4.6%

Apply z-ordering across all columns to evenly distribute the data without specific consideration for query patterns.

Comments

Loading comments...

Implement partitioning on both 'date' and 'region' columns to reduce the data scanned per query and apply a bloom filter index on 'user_id' to accelerate equality lookups.

82.3%