
Ultimate access to all questions.
A data engineer is using the following code block as part of a batch ingestion pipeline to read from a composable table:
transactions_df = (spark.read
.schema(schema)
.format("delta")
.table("transactions")
)
transactions_df = (spark.read
.schema(schema)
.format("delta")
.table("transactions")
)
Which of the following changes needs to be made so this code block will work when the transactions table is a stream source?
A
Replace predict with a stream-friendly prediction function
B
Replace schema(schema) with option ("maxFilesPerTrigger", 1)
C
Replace "transactions" with the path to the location of the Delta table
D
Replace format("delta") with format("stream")
E
Replace spark.read with spark.readStream
Explanation:
Explanation:
To read from a Delta table as a streaming source in Databricks, you need to use spark.readStream instead of spark.read. This is the fundamental difference between batch and streaming operations in Spark Structured Streaming.
Key points:
spark.read is used for batch processing - it reads the entire dataset at once.spark.readStream is used for streaming processing - it reads data incrementally as it arrives.predict function in the code, so this doesn't apply.maxFilesPerTrigger can be used to control micro-batch size in streaming, it's not required to make the code work as a stream source.Correct streaming code would look like:
transactions_df = (spark.readStream
.schema(schema)
.format("delta")
.table("transactions")
)
transactions_df = (spark.readStream
.schema(schema)
.format("delta")
.table("transactions")
)
This change enables the pipeline to process data incrementally as new data arrives in the Delta table, which is essential for streaming applications.