
Ultimate access to all questions.
In an ETL framework, a Kafka stream acting as an upstream system frequently produces duplicate values within a batch. The streaming query reads from this source and writes to a downstream delta table using the default trigger interval. Given that the upstream system emits data every 20 minutes, which strategy effectively removes duplicates before saving to the downstream delta table while minimizing costs?
A
Apply the dropDuplicates method directly on the target table every 20 minutes.
B
Modify the downstream table to a temporary table within the streaming query, eliminate duplicates from this temporary table every 20 minutes, then transfer the data to the original downstream table.
C
Adjust the processing time to 20 minutes and incorporate dropDuplicates() in the streaming query.
D
Including dropDuplicates() in the streaming query will eliminate duplicates from all prior batches of data.
E
Implement the withWatermark method in the streaming query, specifying 20 minutes as the argument.