
Answer-first summary for fast verification
Answer: Skew caused by more data being assigned to a subset of spark-partitions.
The Min and Median task durations being similar indicate that most tasks are completing in roughly the same time. However, the Max duration being 100 times longer suggests a significant outlier. This is characteristic of data skew (Option D), where a small subset of partitions contains disproportionately more data, causing those tasks to take much longer. Other options are less likely: Task queueing (A) would typically affect scheduling delays rather than execution time. Spill (B) would likely impact all tasks if storage is insufficient. Network latency (C) might cause variability but not such extreme outliers unless specific to a subset of data, which is less common than skew.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
A Spark job is running slower than anticipated. While analyzing the Spark UI, a data engineer observes that for tasks in a specific stage, the Min and Median task durations are approximately equal, but the Max task duration is about 100 times longer than the Min.
What issue is causing the overall job to take longer to complete?
A
Task queueing resulting from improper thread pool assignment.
B
Spill resulting from attached volume storage being too small.
C
Network latency due to some cluster nodes being in different regions from the source data
D
Skew caused by more data being assigned to a subset of spark-partitions.
No comments yet.