
Answer-first summary for fast verification
Answer: Delta Lake statistics are ineffective for high-cardinality free-text fields like reviews, providing limited performance benefits for keyword-based substring searches.
Simply converting a Parquet table to Delta Lake will not significantly improve keyword search performance for free-form text. Delta Lake's primary mechanism for performance optimization is **data skipping**, which relies on file-level statistics like minimum and maximum values. For high-cardinality string columns where users perform substring searches (e.g., `LIKE '%keyword%'`), these statistics are largely useless because the filter does not compare a value against a range. Furthermore, Delta Lake often truncates long strings when collecting statistics, making the min/max values even less meaningful for searching content within the text. **Why the other options are incorrect:** * **B:** Delta Lake fully supports the STRING data type. * **C:** Z-Ordering is designed to improve performance for equality or range predicates by clustering similar data. It still relies on min/max statistics, which do not benefit substring searches in varied text fields. * **D:** The Delta transaction log contains metadata and file paths, not a full-text search index. * **E:** By default, Delta Lake collects statistics for the first **32** columns of a table, not four, and this limit is configurable via table properties.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A data science team is experiencing slow query performance when searching for specific keywords within a high-cardinality review column (STRING) stored in a Parquet table. A junior data engineer suggests migrating the table to Delta Lake to resolve the performance bottleneck.
How should you respond to this suggestion?
A
Delta Lake statistics are ineffective for high-cardinality free-text fields like reviews, providing limited performance benefits for keyword-based substring searches.
B
Delta Lake tables do not support the STRING data type, which would require the team to restructure the review data into a different format.
C
While Delta Lake offers performance benefits, significant gains for keyword searches would only be achieved by applying Z-ORDER optimization to the review column.
D
The Delta Lake transaction log automatically generates a structured index for text fields, enabling efficient keyword filtering without full table scans.
E
Delta Lake only collects metadata statistics for the first four columns in a table schema, making it ineffective for the review column in this specific table.