
Answer-first summary for fast verification
Answer: Event Timeline, stage summaries (max/median durations, spill metrics, shuffle stats), and per-task details expose stragglers, data skew (max duration >> 75th percentile), spills, uneven partitions, and driver issues. These enable targeted optimizations like enabling AQE, salting keys, tuning shuffle partitions, strategic repartition/coalesce, broadcast joins, or Delta OPTIMIZE with ZORDER — lowering runtime/costs while preserving audit logs.
A: Incorrect Timestamps support audits, but the Event Timeline's core purpose is performance diagnostics: it reveals scheduling delays, executor parallelism over time, idle periods, and stage dependencies — directly highlighting inefficiencies for cost/runtime fixes. B: Incorrect Spark UI delivers rich task-level data in Stages tab: individual task durations/sizes/spills/GC time; summary stats (min/median/75th/max) flag skew; stage timeline visualizes stragglers. Ganglia shows cluster metrics but not this native task detail. C: Correct Accurately reflects Spark UI strengths. Timeline shows flow/delays; summaries detect skew/spills/shuffle costs; task views pinpoint imbalance. Insights drive proven fixes (AQE for auto-optimizations, salting/broadcast to cut shuffles, partition tuning, Delta ZORDER for better layout) — achieving faster execution, lower DBUs, reliable scaling, and intact compliance (timestamps/logs unchanged). D: Incorrect UI excels beyond failures: it proactively surfaces skew, spills, stragglers, shuffle overhead, and underutilization in successful runs — central to ongoing tuning and cost management.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
You are troubleshooting a production ETL job on an Azure Databricks all-purpose cluster (autoscaling enabled, Delta Lake tables) processing terabytes of event data with wide transformations, high-cardinality joins, and aggregations. The job shows inconsistent durations, occasional timeouts, and elevated DBU costs despite scaling. No task failures are present. You analyze the Spark UI (Jobs tab with Event Timeline, Stages tab with summary/task metrics and shuffle stats, Executors tab). Which option best describes the key insights available and their primary application to optimize runtime, reduce DBU consumption, improve scalability predictability, and maintain audit compliance?
A
The Event Timeline mainly logs stage timestamps for compliance audits but provides minimal value for diagnosing inefficiencies or guiding performance tuning.
B
Stage metrics (input sizes, shuffle bytes) identify broad shuffle bottlenecks, but task-level skew, stragglers, or partition imbalance require Ganglia or custom logging since Spark UI lacks per-task granularity.
C
Event Timeline, stage summaries (max/median durations, spill metrics, shuffle stats), and per-task details expose stragglers, data skew (max duration >> 75th percentile), spills, uneven partitions, and driver issues. These enable targeted optimizations like enabling AQE, salting keys, tuning shuffle partitions, strategic repartition/coalesce, broadcast joins, or Delta OPTIMIZE with ZORDER — lowering runtime/costs while preserving audit logs.
D
Spark UI views are primarily for post-failure analysis (exceptions, OOM, fetch failures) and offer limited proactive insight for tuning slow but successful runs or controlling costs.