
Answer-first summary for fast verification
Answer: Delta Lake statistics are not optimized for free text fields with high cardinality.
Delta Lake improves query performance through data skipping using statistics like min/max values, which are effective for low-cardinality columns. However, free-form text fields like 'review' have high cardinality, making these statistics ineffective. Option A correctly states this limitation. Options B and D are incorrect as Delta collects stats on the first 32 columns by default (not 4) and does not create term matrices. ZORDER (Option C) is unsuitable for high-cardinality text fields. Thus, converting to Delta Lake alone won't accelerate keyword searches in this scenario.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
The data science team needs help optimizing queries on unstructured text from user reviews, stored in Parquet with this schema:
item_id INT,
user_id INT,
review_id INT,
rating FLOAT,
review STRING
item_id INT,
user_id INT,
review_id INT,
rating FLOAT,
review STRING
The review column contains full user review text, and the team wants to detect occurrences of 30 specific keywords in this field.
A junior data engineer proposes that converting this data to Delta Lake will enhance query performance.
What is the accurate response to this suggestion?
A
Delta Lake statistics are not optimized for free text fields with high cardinality.
B
Delta Lake statistics are only collected on the first 4 columns in a table.
C
ZORDER ON review will need to be run to see performance gains.
D
The Delta log creates a term matrix for free text fields to support selective filtering.
No comments yet.