
Explanation:
Delta Lake improves query performance through data skipping using statistics like min/max values, which are effective for low-cardinality columns. However, free-form text fields like 'review' have high cardinality, making these statistics ineffective. Option A correctly states this limitation. Options B and D are incorrect as Delta collects stats on the first 32 columns by default (not 4) and does not create term matrices. ZORDER (Option C) is unsuitable for high-cardinality text fields. Thus, converting to Delta Lake alone won't accelerate keyword searches in this scenario.
Ultimate access to all questions.
No comments yet.
The data science team needs help optimizing queries on unstructured text from user reviews, stored in Parquet with this schema:
item_id INT,
user_id INT,
review_id INT,
rating FLOAT,
review STRING
item_id INT,
user_id INT,
review_id INT,
rating FLOAT,
review STRING
The review column contains full user review text, and the team wants to detect occurrences of 30 specific keywords in this field.
A junior data engineer proposes that converting this data to Delta Lake will enhance query performance.
What is the accurate response to this suggestion?
A
Delta Lake statistics are not optimized for free text fields with high cardinality.
B
Delta Lake statistics are only collected on the first 4 columns in a table.
C
ZORDER ON review will need to be run to see performance gains.
D
The Delta log creates a term matrix for free text fields to support selective filtering.