
Answer-first summary for fast verification
Answer: Configuring `spark.default.parallelism` to match the number of cores in the cluster.
To effectively balance the load across all nodes in a Spark job processing skewed data, it's crucial to evenly distribute the workload. Configuring `spark.default.parallelism` to match the number of cores in the cluster ensures that each core is utilized efficiently, maximizing parallelism and evenly spreading the workload. This approach addresses the root cause of performance issues caused by skewed data by ensuring that the workload is distributed in a way that all available resources are used effectively. Other options, such as enabling adaptive skew join optimization or adjusting shuffle partitions, may offer some benefits but are not as directly effective in balancing the load across all nodes as configuring the parallelism to match the cluster's core count.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
When tuning a Spark job that processes a dataset with uneven data distribution (skewed data), which configuration setting is most effective for ensuring the workload is evenly distributed across all cluster nodes?
A
Enabling spark.speculation to true to restart slow tasks.
B
Setting spark.sql.adaptive.skewJoin.enabled to true.
C
Adjusting spark.sql.shuffle.partitions to a lower number than the default.
D
Configuring spark.default.parallelism to match the number of cores in the cluster.