A junior data engineer is using the following code to de-duplicate raw streaming data and insert them into a target Delta table.
spark.readStream.table("orders_raw")
.dropDuplicates(["order_id", "order_timestamp"])
.writeStream .option("checkpointLocation", "dbfs: /checkpoints")
.table("orders_unique")
However, a senior data engineer points out that this approach may not suffice for ensuring distinct records in the target table, especially with late-arriving duplicates. What could explain the senior engineer's concern?