
Answer-first summary for fast verification
Answer: Files statistics in the Delta transaction log
Delta Lake captures statistics for each data file in the transaction log. These statistics include the total number of records, minimum and maximum values in each column (for the first 32 columns), and null value counts for each column (also for the first 32 columns). When a query with a selective filter is executed, the query optimizer uses these statistics to identify data files that may contain records matching the filter condition. For the query in question, the transaction log is scanned for min and max statistics of the price column to efficiently locate the relevant data files. Reference: [Delta Lake Data Skipping](https://docs.databricks.com/delta/data-skipping.html).
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
Given a Delta table named ‘products’ with the schema: name (STRING), category (STRING), expiration_date (DATE), price (FLOAT). When executing the query SELECT * FROM products WHERE price > 30.5, which of the following mechanisms will the query optimizer use to identify the data files to load?
A
Columns statistics in the Hive metastore
B
Files statistics in the Delta transaction log
C
Columns statistics in the metadata of Parquet files
D
Files statistics in the Hive metastore
E
None of the above. All data files are fully scanned to identify the ones to load
No comments yet.