
Answer-first summary for fast verification
Answer: The Delta engine scans the Delta transaction log (`_delta_log`) to retrieve the minimum and maximum statistics for the `latitude` column.
Delta Lake implements **data skipping** by automatically recording file-level statistics (minimum, maximum, and null counts) for the first 32 columns of a table within the transaction log (`_delta_log`). When a query includes a filter (e.g., `latitude > 66.3`), the engine consults these pre-computed statistics in the log's `AddFile` actions or checkpoint files. This allows the engine to identify and skip Parquet files that cannot possibly contain relevant data without ever having to touch the data files themselves. * **Option B is incorrect**: Reading Parquet footers for every query would be inefficient; footers are generally only read during operations like `CONVERT TO DELTA` to bootstrap the initial log. * **Option C is incorrect**: The Hive Metastore tracks table locations and partition information but does not store granular file-level column statistics. * **Options D & E are incorrect**: Caching occurs after the engine determines which files to read; it is not the mechanism used for initial file pruning.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A Delta table containing weather records is partitioned by the date column. The schema includes date, device_id, temp, latitude, and longitude. A data engineer executes a query to retrieve records from the Arctic Circle using the filter latitude > 66.3.
What mechanism does the Delta engine use to identify which specific Parquet files must be loaded to satisfy this query?
A
The Delta engine scans the Delta transaction log (_delta_log) to retrieve the minimum and maximum statistics for the latitude column.
B
The Delta engine scans the footers of all Parquet files in the table to extract the minimum and maximum statistics for the latitude column.
C
The Delta engine retrieves the minimum and maximum statistics for the latitude column from the Hive metastore's partition metadata.
D
All records are first cached to the cluster's attached storage, and the filter condition is subsequently applied to the cached data.
E
All records are first loaded into an operational database where the filter is applied as a post-processing step.