
Answer-first summary for fast verification
Answer: Adjust the processing time to 20 minutes and incorporate `dropDuplicates()` in the streaming query.
Let's evaluate each option: - **A**: While applying `dropDuplicates` on the target table can remove duplicates, doing so every 20 minutes is not cost-effective for a streaming query. - **B**: Converting the downstream table to a temporary table and then removing duplicates before loading to the original table is feasible but increases storage costs. - **C**: Correct. Adjusting the processing time to match the upstream emission interval (20 minutes) and using `dropDuplicates()` ensures duplicates within the same batch are efficiently removed. - **D**: The `dropDuplicates()` method alone does not remove duplicates from previous batches. - **E**: The `withWatermark()` method requires a time column in the streaming source and, without `dropDuplicates`, does not address the duplicate issue within the 20-minute window. Additional Information: Utilizing processing time triggers in a streaming query optimizes duplicate removal in sync with data emission intervals.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
In an ETL framework, a Kafka stream acting as an upstream system frequently produces duplicate values within a batch. The streaming query reads from this source and writes to a downstream delta table using the default trigger interval. Given that the upstream system emits data every 20 minutes, which strategy effectively removes duplicates before saving to the downstream delta table while minimizing costs?
A
Apply the dropDuplicates method directly on the target table every 20 minutes.
B
Modify the downstream table to a temporary table within the streaming query, eliminate duplicates from this temporary table every 20 minutes, then transfer the data to the original downstream table.
C
Adjust the processing time to 20 minutes and incorporate dropDuplicates() in the streaming query.
D
Including dropDuplicates() in the streaming query will eliminate duplicates from all prior batches of data.
E
Implement the withWatermark method in the streaming query, specifying 20 minutes as the argument.