
Ultimate access to all questions.
A data engineer is using the following code block as part of a batch ingestion pipeline to read from a composable table:
transactions_df = (spark.read
.schema(schema)
.format("delta")
.table("transactions")
)
transactions_df = (spark.read
.schema(schema)
.format("delta")
.table("transactions")
)
Which of the following changes needs to be made so this code block will work when the transactions table is a stream source?
A
Replace predict with a stream-friendly prediction function
B
Replace schema(schema) with option ("maxFilesPerTrigger", 1)
C
Replace "transactions" with the path to the location of the Delta table
D
Replace format("delta") with format("stream")
E
Replace spark.read with spark.readStream
Explanation:
To read from a Delta table as a streaming source in Databricks, you need to use spark.readStream instead of spark.read. The spark.readStream API is specifically designed for streaming data sources and provides the necessary functionality for incremental processing.
spark.readStream vs spark.read:
spark.read is for batch processing (reading data once)spark.readStream is for streaming processing (reading data incrementally)Other options analysis:
predict function in the code, and this is unrelated to streamingmaxFilesPerTrigger is an optional configuration for controlling micro-batch size, not required for basic streamingtable("transactions")) without needing the pathformat("delta") is correct for Delta tables; there's no format("stream")Correct streaming code:
transactions_df = (spark.readStream
.schema(schema)
.format("delta")
.table("transactions")
)
transactions_df = (spark.readStream
.schema(schema)
.format("delta")
.table("transactions")
)
This change enables the code to read from the Delta table as a streaming source, processing new data as it arrives.