
Answer-first summary for fast verification
Answer: Schedule a job to execute the pipeline every hour using a new job cluster.
To meet an hourly refresh requirement at the lowest cost, scheduling a job on a **new job cluster** is the optimal choice for the following reasons: * **Cost Efficiency of Job Clusters**: Job clusters (automated compute) are priced at a significantly lower DBU rate than all-purpose (interactive) clusters. * **Ephemeral Lifecycle**: Job clusters start, execute the task, and terminate automatically upon completion. Since the ETL process only lasts 10 minutes, the cluster is only billed for that duration plus startup time, rather than the full hour. * **Comparison with Streaming**: A Structured Streaming job with a 60-minute trigger (Option C) typically requires the cluster to stay running to poll for data, leading to unnecessary idle costs between micro-batches. * **Interactive Compute**: Using an all-purpose cluster (Option A) for production ETL is more expensive due to higher DBU rates and the potential for the cluster to remain active (and billable) while waiting for the next scheduled run.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
A business reporting team requires their dashboard data to be refreshed once every hour. The ETL pipeline responsible for data extraction, transformation, and loading typically takes 10 minutes to complete.
Under normal conditions, which configuration would best meet this service-level agreement (SLA) while minimizing operational costs?
A
Schedule the pipeline to run every hour on a dedicated, always-on interactive (all-purpose) cluster.
B
Configure a job to trigger automatically whenever new data files arrive in a specific cloud storage directory.
C
Use a Structured Streaming job with a 60-minute trigger interval on a running cluster.
D
Schedule a job to execute the pipeline every hour using a new job cluster.
No comments yet.