
Answer-first summary for fast verification
Answer: `date`
Partitioning Delta tables by a `date` column is a standard best practice for time-series or event-based data in a Lakehouse architecture. **Why `date` is correct:** * **Efficient Data Skipping**: Most analytical workloads filter by calendar ranges (e.g., 'last 30 days'). Queries filtering on these dates can skip entire partitions at the file-listing level, dramatically reducing I/O. * **Manageable Partition Count**: A daily grain results in a predictable number of partitions (~365 per year), allowing Spark to plan queries efficiently without overwhelming the driver with metadata. * **Optimization**: It aligns with common data retention and vacuuming patterns. **Why the other options are unsuitable:** * **`post_time`**: A full timestamp has extremely high cardinality. Partitioning by second or millisecond would create millions of tiny partitions, severely degrading performance. * **`latitude`**: Geographic coordinates are continuous values. Partitioning on `latitude` creates too many fragmented partitions and rarely matches standard query filter patterns. * **`post_id`**: This is a unique identifier per row. Partitioning here would result in one partition per post, yielding no pruning benefit and creating massive small-file overhead. * **`user_id`**: While queries may filter by user, the high cardinality (potentially millions of users) makes it unsuitable for partition pruning, which performs best with low-to-moderate cardinality.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
When designing a Delta Lake table to store metadata for user content posts, which of the following columns would be the most effective choice for partitioning to optimize query performance and data skipping?
A
post_id
B
user_id
C
date
D
latitude
E
post_time
No comments yet.