Databricks Certified Data Engineer - Professional

Databricks Certified Data Engineer - Professional

Get started today

Ultimate access to all questions.


You are a data engineer working on optimizing a large dataset stored in Delta Lake to improve query performance for analytical workloads. The dataset contains sales transactions with columns including 'date', 'region', 'user_id', 'product_id', and 'amount'. The queries frequently filter on 'date' and 'region', and often perform equality lookups on 'user_id'. Given the need to minimize query latency while considering cost and scalability, which of the following indexing and partitioning strategies would you implement? Choose the best option.




Explanation:

The correct answer is C because it effectively addresses the query patterns by partitioning on high-cardinality columns ('date' and 'region') that are frequently used in filters, thereby reducing the amount of data scanned. The bloom filter index on 'user_id' optimizes equality lookups, which are common in the workload. Option A is too narrow, potentially leading to large partitions that don't significantly reduce scan sizes. Option B is inefficient as z-ordering is not universally optimal for all data types and distributions. Option D overlooks the importance of tailoring file sizes to the data's nature and query patterns, which can adversely affect performance.