
Explanation:
Delta Lake implements data skipping by automatically recording file-level statistics (minimum, maximum, and null counts) for the first 32 columns of a table within the transaction log (_delta_log).
When a query includes a filter (e.g., latitude > 66.3), the engine consults these pre-computed statistics in the log's AddFile actions or checkpoint files. This allows the engine to identify and skip Parquet files that cannot possibly contain relevant data without ever having to touch the data files themselves.
CONVERT TO DELTA to bootstrap the initial log.Ultimate access to all questions.
A Delta table containing weather records is partitioned by the date column. The schema includes date, device_id, temp, latitude, and longitude. A data engineer executes a query to retrieve records from the Arctic Circle using the filter latitude > 66.3.
What mechanism does the Delta engine use to identify which specific Parquet files must be loaded to satisfy this query?
A
The Delta engine scans the Delta transaction log (_delta_log) to retrieve the minimum and maximum statistics for the latitude column.
B
The Delta engine scans the footers of all Parquet files in the table to extract the minimum and maximum statistics for the latitude column.
C
The Delta engine retrieves the minimum and maximum statistics for the latitude column from the Hive metastore's partition metadata.
D
All records are first cached to the cluster's attached storage, and the filter condition is subsequently applied to the cached data.
E
All records are first loaded into an operational database where the filter is applied as a post-processing step.
No comments yet.