
Answer-first summary for fast verification
Answer: Data skew resulting from a disproportionate amount of data being allocated to a small number of Spark partitions.
The metrics described—where the `min` and `median` task durations are consistent but the `max` is a significant outlier—are the classic signature of **data skew**. * **Why D is correct:** Data skew occurs when certain partitions contain significantly more records than others. Since Spark processes tasks at the partition level, the executors assigned to these 'heavy' partitions will take much longer to complete their work than those assigned to evenly distributed partitions, creating a bottleneck. * **Why others are incorrect:** * **Network latency (A)** and **Credential errors (B)** would generally affect all tasks uniformly, raising the median duration rather than creating a single outlier. * **Task queuing (E)** would delay the start times of many tasks, shifting the median higher. * **Spillage (C)** typically occurs when executor memory is exhausted during shuffles; while it can cause slowness, it is often a symptom of skew or would affect multiple tasks involved in the shuffle. If storage were actually 'insufficiently sized,' the job would likely fail with an `Out of Disk Space` error rather than just running slowly. **Recommendation:** To mitigate this, one should investigate the join or group-by keys and consider enabling **Adaptive Query Execution (AQE)** with skew join optimization.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A data engineer is analyzing a Spark job that is taking significantly longer than expected. Upon reviewing the Spark UI, they notice that the minimum and median completion times for tasks in a specific stage are nearly identical. However, the maximum task duration is approximately 100 times longer than the minimum.
What is the most likely cause of this performance discrepancy?
A
Network latency resulting from cluster nodes being deployed in a different region than the source data storage.
B
Credential validation delays occurring during the retrieval of data from an external system, leading to authentication retries.
C
Disk spillage caused by local storage volumes being sized incorrectly for the amount of shuffle data being processed.
D
Data skew resulting from a disproportionate amount of data being allocated to a small number of Spark partitions.
E
Task queuing delays caused by an improper assignment of threads within the executor thread pool.