
Ultimate access to all questions.
A data engineer is analyzing a Spark job that is taking significantly longer than expected. Upon reviewing the Spark UI, they notice that the minimum and median completion times for tasks in a specific stage are nearly identical. However, the maximum task duration is approximately 100 times longer than the minimum.
What is the most likely cause of this performance discrepancy?
A
Network latency resulting from cluster nodes being deployed in a different region than the source data storage.
B
Credential validation delays occurring during the retrieval of data from an external system, leading to authentication retries.
C
Disk spillage caused by local storage volumes being sized incorrectly for the amount of shuffle data being processed.
D
Data skew resulting from a disproportionate amount of data being allocated to a small number of Spark partitions.
E
Task queuing delays caused by an improper assignment of threads within the executor thread pool.