
Explanation:
The scenario described—where most tasks (minimum and median) finish quickly but one or a few tasks (maximum) take orders of magnitude longer—is a classic symptom of Data Skew.
Data skew occurs when the underlying data is not distributed evenly across partitions. The tasks responsible for the 'heavy' partitions become stragglers, processing far more records than their peers and preventing the stage from completing.
Ultimate access to all questions.
While monitoring a Spark job, you observe in the Spark UI that for a specific stage, the minimum and median task durations are nearly identical. However, the maximum task duration is approximately 100 times longer than the minimum. What is the most likely cause of this performance bottleneck?
A
Network latency caused by cluster nodes residing in different geographic regions from the source data.
B
Task queuing delays stemming from an incorrectly configured executor thread pool.
C
Data skew resulting from uneven distribution, where certain Spark partitions contain significantly more records than others.
D
Disk spillover caused by insufficient attached volume storage for intermediate data processing.
No comments yet.