
Answer-first summary for fast verification
Answer: Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.
The correct approach is E. The current method of overwriting the entire table makes it impossible to track individual record changes. By replacing the overwrite with a merge statement, only changed records are modified. The Change Data Feed (CDF) in Delta Lake tracks these row-level changes, allowing the ML team to efficiently identify records that have changed in the past 24 hours. Other options either process all records (A, B, D) or rely on comparing with previous predictions (C), which may not reflect actual data changes. Option E directly addresses the problem by leveraging merge and CDF to track and identify changed records.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A table named customer_churn_params in the Lakehouse is utilized for churn prediction by the machine learning team. This table contains customer information aggregated from multiple upstream sources. Currently, the data engineering team refreshes this table nightly by completely overwriting it with the latest valid values from upstream sources.
The ML team's churn prediction model is stable in production, and they only need to generate predictions for records that have been modified within the last 24 hours.
What approach would streamline the process of identifying these recently changed records?
A
Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.
B
Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.
C
Calculate the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers before making new predictions; only make predictions on those customers not in the previous predictions.
D
Modify the overwrite logic to include a field populated by calling spark.sql.functions.current_timestamp() as data are being written; use this field to identify records written on a particular date.
E
Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.