
Ultimate access to all questions.
Question 25
A data engineer has developed a code block to perform a streaming read on a data source. The code block is below:
(spark
.read
.schema(schema)
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load(dataSource)
)
(spark
.read
.schema(schema)
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load(dataSource)
)
The code block is returning an error.
Which of the following changes should be made to the code block to configure the block to successfully perform a streaming read?
Explanation:
In Apache Spark Structured Streaming, to perform a streaming read (as opposed to a batch read), you need to use .readStream instead of .read.
Key Points:
.read is used for batch processing.readStream is used for streaming processing.format("cloudFiles") is correct for reading from cloud storage with Auto Loader.option("cloudFiles.format", "json") is correct for specifying JSON format.schema(schema) and .load(dataSource) are properly configuredWhy other options are incorrect:
.stream method in Spark's DataFrameReader API.format("stream") is not a valid format - "cloudFiles" is the correct format for Auto Loader.stream method that can be added after spark.stream after .load(dataSource) would be syntactically incorrectThe corrected code should be:
(spark
.readStream
.schema(schema)
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load(dataSource)
)
(spark
.readStream
.schema(schema)
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load(dataSource)
)