
Explanation:
Delta Lake's built-in data-skipping mechanism relies on per-file min/max statistics stored in the transaction log. For high-cardinality, free-form text (like long reviews), these statistics are often truncated and offer negligible selectivity. Consequently, converting a Parquet table to Delta Lake will not, by itself, accelerate keyword searches (e.g., WHERE review LIKE '%keyword%').
delta.dataSkippingNumIndexedCols.Ultimate access to all questions.
A data science team needs to optimize queries on a free-form text column (review) within a Parquet table. The current schema includes item_id INT, user_id INT, review_id INT, rating FLOAT, and review STRING. The team frequently searches for specific keywords within the review field. A junior data engineer suggests that migrating to Delta Lake will automatically improve query performance for these keyword searches. How should you evaluate this suggestion?
A
Applying ZORDER BY review will reorganize the data to provide significant performance gains for keyword-based predicates.
B
Delta Lake's data-skipping statistics are not optimized for high-cardinality, free-form text fields like the review column.
C
The Delta transaction log generates a term matrix for string fields, enabling highly selective filtering for specific keywords.
D
Delta Lake only collects file-level statistics for the first 4 columns of a table by default, meaning the review column (the 5th column) will be ignored.
No comments yet.