
Explanation:
Simply converting a Parquet table to Delta Lake will not significantly improve keyword search performance for free-form text. Delta Lake's primary mechanism for performance optimization is data skipping, which relies on file-level statistics like minimum and maximum values.
For high-cardinality string columns where users perform substring searches (e.g., LIKE '%keyword%'), these statistics are largely useless because the filter does not compare a value against a range. Furthermore, Delta Lake often truncates long strings when collecting statistics, making the min/max values even less meaningful for searching content within the text.
Why the other options are incorrect:
Ultimate access to all questions.
No comments yet.
A data science team is experiencing slow query performance when searching for specific keywords within a high-cardinality review column (STRING) stored in a Parquet table. A junior data engineer suggests migrating the table to Delta Lake to resolve the performance bottleneck.
How should you respond to this suggestion?
A
Delta Lake statistics are ineffective for high-cardinality free-text fields like reviews, providing limited performance benefits for keyword-based substring searches.
B
Delta Lake tables do not support the STRING data type, which would require the team to restructure the review data into a different format.
C
While Delta Lake offers performance benefits, significant gains for keyword searches would only be achieved by applying Z-ORDER optimization to the review column.
D
The Delta Lake transaction log automatically generates a structured index for text fields, enabling efficient keyword filtering without full table scans.
E
Delta Lake only collects metadata statistics for the first four columns in a table schema, making it ineffective for the review column in this specific table.