Databricks Certified Data Engineer - Professional

Ultimate access to all questions.

Explanation:

Partitioning Delta tables by a date column is a standard best practice for time-series or event-based data in a Lakehouse architecture.

Why date is correct:

Efficient Data Skipping: Most analytical workloads filter by calendar ranges (e.g., 'last 30 days'). Queries filtering on these dates can skip entire partitions at the file-listing level, dramatically reducing I/O.
Manageable Partition Count: A daily grain results in a predictable number of partitions (~365 per year), allowing Spark to plan queries efficiently without overwhelming the driver with metadata.
Optimization: It aligns with common data retention and vacuuming patterns.

Why the other options are unsuitable:

post_time: A full timestamp has extremely high cardinality. Partitioning by second or millisecond would create millions of tiny partitions, severely degrading performance.
latitude: Geographic coordinates are continuous values. Partitioning on latitude creates too many fragmented partitions and rarely matches standard query filter patterns.
post_id: This is a unique identifier per row. Partitioning here would result in one partition per post, yielding no pruning benefit and creating massive small-file overhead.
user_id: While queries may filter by user, the high cardinality (potentially millions of users) makes it unsuitable for partition pruning, which performs best with low-to-moderate cardinality.

Explanation:

Partitioning Delta tables by a date column is a standard best practice for time-series or event-based data in a Lakehouse architecture.

Why date is correct:

Efficient Data Skipping: Most analytical workloads filter by calendar ranges (e.g., 'last 30 days'). Queries filtering on these dates can skip entire partitions at the file-listing level, dramatically reducing I/O.
Manageable Partition Count: A daily grain results in a predictable number of partitions (~365 per year), allowing Spark to plan queries efficiently without overwhelming the driver with metadata.
Optimization: It aligns with common data retention and vacuuming patterns.

Why the other options are unsuitable:

post_time: A full timestamp has extremely high cardinality. Partitioning by second or millisecond would create millions of tiny partitions, severely degrading performance.
latitude: Geographic coordinates are continuous values. Partitioning on latitude creates too many fragmented partitions and rarely matches standard query filter patterns.
post_id: This is a unique identifier per row. Partitioning here would result in one partition per post, yielding no pruning benefit and creating massive small-file overhead.
user_id: While queries may filter by user, the high cardinality (potentially millions of users) makes it unsuitable for partition pruning, which performs best with low-to-moderate cardinality.

Comments (0)

No comments yet.

Real Exam

Last updated: January 19, 2026 at 14:03

post_id

11.1%

user_id

11.1%

date

71.1%

latitude

post_time

6.7%