
Answer-first summary for fast verification
Answer: Data skew resulting from uneven distribution, where certain partitions contain significantly more records than others.
### Explanation The scenario described—where the **minimum** and **median** task durations are low and consistent, but the **maximum** duration is an extreme outlier (100x higher)—is a classic signature of **Data Skew**. * **Data Skew (Correct):** This occurs when a few partitions contain significantly more data than the rest. Since Spark processes partitions in parallel, the 'straggler' tasks handling these large partitions become a bottleneck for the entire stage. * **Task Queuing:** If thread pool configuration were the issue, you would likely see tasks waiting to start despite having available cores. It would not typically cause a single task to execute 100x longer than others once it has already started. * **Spillover:** While spilling to disk (due to memory pressure) does slow down tasks, a 100x duration difference is most commonly rooted in the sheer volume of data in a single partition (skew) rather than just the mechanism of writing to disk. * **Network Delays:** Latency from cross-region data transfers would generally increase the duration for all tasks performing reads, leading to an elevated median duration rather than a single extreme outlier.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
A Spark job is executing much slower than expected. Upon examining the Spark UI, a data engineer notices that within a specific stage, the minimum and median task durations are nearly identical, yet the maximum task duration is approximately 100 times longer than the minimum. What is the most likely cause for this performance bottleneck?
A
Disk spillover caused by insufficient attached volume storage for temporary data.
B
Data skew resulting from uneven distribution, where certain partitions contain significantly more records than others.
C
Task queuing delays resulting from an incorrectly configured thread pool.
D
Network latency caused by cluster nodes residing in different geographic regions from the source data.
No comments yet.