
Ultimate access to all questions.
A Delta Lake table named customer_churn_params with Change Data Feed (CDF) enabled is used for churn prediction in a Lakehouse environment. This table contains customer data aggregated from multiple upstream sources. Currently, the data engineering team refreshes this table nightly by fully overwriting it with the latest valid values from upstream sources.
The machine learning team's churn prediction model is stable in production and only needs to process records that have changed within the last 24 hours.
What approach would most efficiently identify these recently changed records?
A
Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.
B
Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.
C
Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.
D
Modify the overwrite logic to include a field populated by calling spark.sql.functions.current_timestamp() as data are being written; use this field to identify records written on a particular date._