
Ultimate access to all questions.
A data engineer has developed a code block to perform a streaming read on a data source. The code block is below:
(spark
.read
.schema(schema)
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load(dataSource)
)
(spark
.read
.schema(schema)
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load(dataSource)
)
The code block is returning an error.
Which of the following changes should be made to the code block to configure the block to successfully perform a streaming read?
A
The .read line should be replaced with .readStream.
B
A new .stream line should be added after the .read line.
C
The .format("cloudFiles") line should be replaced with .format("stream").
D
A new .stream line should be added after the spark line.
E
A new .stream line should be added after the .load(dataSource) line.
Explanation:
In Apache Spark Structured Streaming, to perform a streaming read from a data source, you must use .readStream instead of .read. The .read method is used for batch processing, while .readStream is specifically designed for streaming operations.
.readStream vs .read:
.read(): Creates a DataFrameReader for batch processing.readStream(): Creates a DataFrameReader for streaming processingCorrect Code Structure:
(spark
.readStream # Changed from .read
.schema(schema)
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load(dataSource)
)
(spark
.readStream # Changed from .read
.schema(schema)
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load(dataSource)
)
Why Other Options Are Incorrect:
.stream method in Spark's DataFrameReader API.format("cloudFiles") to .format("stream") would break the format specification.stream after spark is not valid Spark syntax.stream after .load(dataSource) is not valid Spark syntaxCloudFiles Format: The .format("cloudFiles") is correct for reading files from cloud storage with Auto Loader, which supports both batch and streaming reads. When combined with .readStream, it enables incremental file processing.
This change is essential because Spark Structured Streaming requires explicit declaration of streaming operations through the .readStream method to properly handle incremental data processing, state management, and trigger configurations.