
Answer-first summary for fast verification
Answer: Replace the current overwrite logic with a `MERGE` statement and enable the Delta Lake Change Data Feed (CDF) to identify and process only those records that have been inserted or updated.
The most efficient approach for incremental processing in Databricks is to leverage **Delta Lake Change Data Feed (CDF)** combined with **MERGE** statements. **Why this works:** Delta Lake's CDF records every insert, update, and delete produced by write operations like `MERGE`. Once CDF is enabled, the machine learning pipeline can query only the changed rows using `readChangeFeed = true` or `table_changes()`. This allows the model to score only the new or updated records, eliminating the compute waste of rescanning unchanged data. **Analysis of other options:** * **Manual Diffs:** Performing a manual difference calculation involves a costly join between two large tables, which is redundant when CDF is available. * **Timestamps with Overwrite:** Because an overwrite replaces the entire table, every row receives the same timestamp, making it impossible to isolate specific record changes. * **Complete Output Mode:** In Structured Streaming, `complete` mode rewrites the entire result set to the sink every time, which scales poorly and defeats the purpose of incremental scoring. * **Filtering post-scoring:** Applying the model to all rows and then ignoring unchanged results still incurs the full compute cost of running the ML model on the entire dataset.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
The machine learning team needs to optimize the workflow for identifying changed records in the customer_churn_params table to trigger updates for their churn prediction model. Which method would most effectively streamline the identification of these records for incremental processing?
A
Calculate the difference between the previous model predictions and the current customer_churn_params using a unique customer key before making new predictions, only processing customers not found in the previous set.
B
Modify the overwrite logic to include a field populated by spark.sql.functions.current_timestamp() during the write process, then use this field to filter for records written on a specific date.
C
Replace the current overwrite logic with a MERGE statement and enable the Delta Lake Change Data Feed (CDF) to identify and process only those records that have been inserted or updated.
D
Convert the batch job to a Structured Streaming job using complete output mode to read from the customer_churn_params table and incrementally predict against the churn model.
E
Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.