
Answer-first summary for fast verification
Answer: Implement partitioning on both 'date' and 'region' columns to reduce the data scanned per query and apply a bloom filter index on 'user_id' to accelerate equality lookups.
The correct answer is C because it effectively addresses the query patterns by partitioning on high-cardinality columns ('date' and 'region') that are frequently used in filters, thereby reducing the amount of data scanned. The bloom filter index on 'user_id' optimizes equality lookups, which are common in the workload. Option A is too narrow, potentially leading to large partitions that don't significantly reduce scan sizes. Option B is inefficient as z-ordering is not universally optimal for all data types and distributions. Option D overlooks the importance of tailoring file sizes to the data's nature and query patterns, which can adversely affect performance.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
You are a data engineer working on optimizing a large dataset stored in Delta Lake to improve query performance for analytical workloads. The dataset contains sales transactions with columns including 'date', 'region', 'user_id', 'product_id', and 'amount'. The queries frequently filter on 'date' and 'region', and often perform equality lookups on 'user_id'. Given the need to minimize query latency while considering cost and scalability, which of the following indexing and partitioning strategies would you implement? Choose the best option.
A
Partition the dataset solely on the 'date' column to simplify the partitioning scheme.
B
Apply z-ordering across all columns to evenly distribute the data without specific consideration for query patterns.
C
Implement partitioning on both 'date' and 'region' columns to reduce the data scanned per query and apply a bloom filter index on 'user_id' to accelerate equality lookups.
D
Set a uniform file size of 128MB for all data files, ignoring the specific characteristics and access patterns of the data.
No comments yet.