
Answer-first summary for fast verification
Answer: Set the trigger interval to 3 seconds; the default trigger interval is consuming too many records per batch, resulting in spill to disk that can increase volume costs.
The issue arises from frequent API calls to the source storage due to the default trigger (0s interval), leading to high costs. The solution must reduce API calls while ensuring processing latency <10 minutes. Option A is the correct choice because setting a 3s trigger interval reduces the number of microbatches (from potentially many per second to 20/min), lowering API calls and it meets latency requirements (max latency 3s). Options B, C, and D either violate the latency requirements or worsen the cost issue.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A production-deployed Structured Streaming job is incurring higher than expected cloud storage costs. Currently, each microbatch processes in under 3 seconds during normal execution, with at least 12 microbatch executions per minute containing zero records. The streaming write uses default trigger settings. The job runs in a workspace with instance pools provisioned to minimize startup time for batch jobs, alongside many other Databricks jobs.
Assuming all other variables remain constant and records must be processed within 10 minutes, which configuration adjustment will meet this requirement while addressing the cost issue?
A
Set the trigger interval to 3 seconds; the default trigger interval is consuming too many records per batch, resulting in spill to disk that can increase volume costs.
B
Use the trigger once option and configure a Databricks job to execute the query every 10 minutes; this approach minimizes costs for both compute and storage.
C
Set the trigger interval to 10 minutes; each batch calls APIs in the source storage account, so decreasing trigger frequency to maximum allowable threshold should minimize this cost.
D
Set the trigger interval to 500 milliseconds; setting a small but non-zero trigger interval ensures that the source is not queried too frequently.