
Explanation:
The metrics described—where the min and median task durations are consistent but the max is a significant outlier—are the classic signature of data skew.
Out of Disk Space error rather than just running slowly.Recommendation: To mitigate this, one should investigate the join or group-by keys and consider enabling Adaptive Query Execution (AQE) with skew join optimization.
Ultimate access to all questions.
A data engineer is analyzing a Spark job that is taking significantly longer than expected. Upon reviewing the Spark UI, they notice that the minimum and median completion times for tasks in a specific stage are nearly identical. However, the maximum task duration is approximately 100 times longer than the minimum.
What is the most likely cause of this performance discrepancy?
A
Network latency resulting from cluster nodes being deployed in a different region than the source data storage.
B
Credential validation delays occurring during the retrieval of data from an external system, leading to authentication retries.
C
Disk spillage caused by local storage volumes being sized incorrectly for the amount of shuffle data being processed.
D
Data skew resulting from a disproportionate amount of data being allocated to a small number of Spark partitions.
E
Task queuing delays caused by an improper assignment of threads within the executor thread pool.
No comments yet.