
Answer-first summary for fast verification
Answer: Skew caused by more data being assigned to a subset of spark-partitions.
The scenario described, where the Min and Median task durations are similar but the Max duration is significantly longer (100 times the Min), is a classic indicator of data skew. Data skew occurs when a disproportionate amount of data is assigned to a few partitions, causing tasks processing these partitions to take much longer than others. This imbalance directly increases the overall job duration as the job cannot complete until the slowest task finishes. Options A, B, C, and E describe issues that could affect task durations but do not typically result in such a pronounced disparity between the Min/Median and Max durations. Therefore, the most likely cause of the increased job duration is skew caused by more data being assigned to a subset of Spark partitions (Option D).
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
A Spark job is running slower than anticipated. While analyzing the Spark UI, a data engineer observes that for tasks in a specific stage, the Min and Median Durations are approximately equal, but the Max Duration is about 100 times longer than the minimum.
What issue is causing the overall job to take longer to complete?
A
Task queueing resulting from improper thread pool assignment.
B
Spill resulting from attached volume storage being too small.
C
Network latency due to some cluster nodes being in different regions from the source data
D
Skew caused by more data being assigned to a subset of spark-partitions.
E
Credential validation errors while pulling data from an external system.
No comments yet.