Ultimate access to all questions.
How does the Delta engine determine which files to load when querying a Delta table partitioned by date with the schema: date DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOAT, using the filter condition latitude > 66.3 to find records within the Arctic Circle?
Explanation:
Delta Lake optimizes query performance using data skipping, which leverages min and max statistics stored in Parquet file footers for each column. When a query with a filter (e.g., latitude > 66.3
) is executed, the Delta engine checks these statistics to determine if a file can be skipped. Since the table is partitioned by date
(not latitude
), partition pruning is not applicable. Instead, the engine scans the Parquet footers of relevant files (based on the Delta log's file list) to assess if their latitude
ranges could include values matching the filter. Options A and C are incorrect because Delta avoids full data loading unless necessary. Options D and E are incorrect because the Delta log and Hive metastore do not store per-file column statistics.