
Answer-first summary for fast verification
Answer: Delta Lake's data-skipping statistics are not optimized for high-cardinality, free-form text fields like the `review` column.
### Correct Answer: B Delta Lake's built-in data-skipping mechanism relies on per-file **min/max statistics** stored in the transaction log. For high-cardinality, free-form text (like long reviews), these statistics are often truncated and offer negligible selectivity. Consequently, converting a Parquet table to Delta Lake will not, by itself, accelerate keyword searches (e.g., `WHERE review LIKE '%keyword%'`). ### Why the other options are incorrect: * **ZORDER (A)**: Z-Ordering clusters data based on min/max statistics. If the column does not have meaningful statistics (which is true for free-form text due to truncation and high variance), Z-Ordering adds significant compute cost during write operations without improving query performance. * **Term Matrix (C)**: The Delta transaction log does not generate term matrices or full-text indices. It only tracks file metadata and basic statistics. * **Column Statistics Limit (D)**: By default, Delta Lake collects statistics for the first **32 columns** of a table, not 4. This behavior can be adjusted via the table property `delta.dataSkippingNumIndexedCols`.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
A data science team needs to optimize queries on a free-form text column (review) within a Parquet table. The current schema includes item_id INT, user_id INT, review_id INT, rating FLOAT, and review STRING. The team frequently searches for specific keywords within the review field. A junior data engineer suggests that migrating to Delta Lake will automatically improve query performance for these keyword searches. How should you evaluate this suggestion?
A
Applying ZORDER BY review will reorganize the data to provide significant performance gains for keyword-based predicates.
B
Delta Lake's data-skipping statistics are not optimized for high-cardinality, free-form text fields like the review column.
C
The Delta transaction log generates a term matrix for string fields, enabling highly selective filtering for specific keywords.
D
Delta Lake only collects file-level statistics for the first 4 columns of a table by default, meaning the review column (the 5th column) will be ignored.
No comments yet.