Ultimate access to all questions.
Upgrade Now 🚀
Sign in to unlock AI tutor
When dealing with data skew in Spark joins, where one join key has significantly more data than others, which method is most effective to reduce performance issues?
A
Convert DataFrames to RDDs and apply a manual join operation to control data distribution.
B
Salt both DataFrames on the join key to distribute the skewed data more evenly across partitions.
C
Use broadcast join to force Spark to broadcast the smaller DataFrame.
D
Increase the number of partitions specifically for the skewed key using custom partitioning.