
Explanation:
Partitioning Delta tables by a date column is a standard best practice for time-series or event-based data in a Lakehouse architecture.
Why date is correct:
Why the other options are unsuitable:
post_time: A full timestamp has extremely high cardinality. Partitioning by second or millisecond would create millions of tiny partitions, severely degrading performance.latitude: Geographic coordinates are continuous values. Partitioning on latitude creates too many fragmented partitions and rarely matches standard query filter patterns.post_id: This is a unique identifier per row. Partitioning here would result in one partition per post, yielding no pruning benefit and creating massive small-file overhead.user_id: While queries may filter by user, the high cardinality (potentially millions of users) makes it unsuitable for partition pruning, which performs best with low-to-moderate cardinality.Ultimate access to all questions.
When designing a Delta Lake table to store metadata for user content posts, which of the following columns would be the most effective choice for partitioning to optimize query performance and data skipping?
A
post_id
B
user_id
C
date
D
latitude
E
post_time
No comments yet.