
Answer-first summary for fast verification
Answer: A batch job will overwrite the stream_data_stage table by deduplicated records calculated from all 'recent' items in the stream_sink table.
When you read a Delta table using the `spark.table()` function, it is treated as a static source. This means every time the query is executed, all records in the current version of the 'stream_sink' table are read, filtered, and deduplicated. The query then writes the data in 'overwrite' mode to the 'stream_data_stage' table, completely replacing the table's contents with each execution. It's important to note that `spark.table()` and `spark.read.table()` are functionally the same, as the latter internally calls the former. For more details, refer to the [Spark documentation](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.table.html).
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
Given the following query:
.filter(“recent = true“)
.dropDuplicates([“item_id“, “item_timestamp“])
.write
.mode (“overwrite“)
.table(“stream_data_stage“)
.filter(“recent = true“)
.dropDuplicates([“item_id“, “item_timestamp“])
.write
.mode (“overwrite“)
.table(“stream_data_stage“)
Which statement accurately describes the outcome of executing this query?
A
An incremental job will overwrite the stream_sink table by those deduplicated records from stream_data_stage that have been added since the last time the job was run.
B
A batch job will overwrite the stream_data_stage table by deduplicated records calculated from all 'recent' items in the stream_sink table.
C
An incremental job will overwrite the stream_data_stage table by those deduplicated records from stream_sink that have been added since the last time the job was run.
D
A batch job will overwrite the stream_sink table by deduplicated records calculated from all 'recent' items in the stream_data_stage table.
E
A batch job will overwrite the stream_data_stage table by those deduplicated records from stream_sink that have been added since the last time the job was run.