
Ultimate access to all questions.
A data engineer is using the following code block as part of a batch ingestion pipeline to read from a composable table:
transactions_df = (spark.read
.schema(schema)
.format("delta")
.table("transactions"))
transactions_df = (spark.read
.schema(schema)
.format("delta")
.table("transactions"))
Which of the following changes needs to be made so this code block will work when the transactions table is a stream source?
A
Replace predict with a stream-friendly prediction function
B
Replace schema(schema) with option ("maxFilesPerTrigger", 1)
C
Replace "transactions" with the path to the location of the Delta table
D
Replace format("delta") with format("stream")
E
Replace spark.read with spark.readStream
Explanation:
Explanation:
When working with streaming data in Databricks, you need to use spark.readStream instead of spark.read to create a streaming DataFrame. The key differences are:
spark.readStream creates a streaming DataFrame that continuously reads data as it arrives, while spark.read creates a static DataFrame that reads all data at once.
The original code uses batch processing (spark.read), which is suitable for one-time data ingestion but not for streaming sources.
The other options are incorrect:
predict function in the code, so this option doesn't apply.maxFilesPerTrigger is a valid streaming option, replacing schema(schema) with it would remove the schema definition, which is often necessary.format("stream") is not a valid format; you should keep format("delta") for Delta tables.After replacing spark.read with spark.readStream, the code would look like:
transactions_df = (spark.readStream
.schema(schema)
.format("delta")
.table("transactions"))
transactions_df = (spark.readStream
.schema(schema)
.format("delta")
.table("transactions"))
This creates a streaming DataFrame that can process data incrementally as it arrives in the Delta table.