
Explanation:
Partitioning a Delta Lake table effectively requires selecting a column that can divide the data into meaningful, manageable segments without creating too many small files. The date column (E) is a strong candidate because it allows for partitioning by day, which is a common and efficient approach for time-series data. This method supports efficient querying by date ranges and helps in managing the size of each partition. Other options like post_time (A) are too granular, leading to excessive small files. latitude (B) and user_id (D) have high cardinality, which can result in too many partitions, and post_id (C) is unique to each post, making it unsuitable for partitioning.
Ultimate access to all questions.
No comments yet.
Given a Delta Lake table with the following schema for user content post metadata:
user_id LONG,
post_text STRING,
post_id STRING,
longitude FLOAT,
latitude FLOAT,
post_time TIMESTAMP,
date DATE
user_id LONG,
post_text STRING,
post_id STRING,
longitude FLOAT,
latitude FLOAT,
post_time TIMESTAMP,
date DATE
Which column would be the most suitable for partitioning the Delta table?
A
post_time
B
latitude
C
post_id
D
user_id
E
date