
Ultimate access to all questions.
When dealing with data skew in Spark joins between a large fact table (where one join key has significantly more records than others) and a smaller dimension table, which method is most effective to reduce performance issues caused by shuffle skew?
A
Convert DataFrames to RDDs and apply a manual join operation to control data distribution.
B
Salt both DataFrames on the join key to distribute the skewed data more evenly across partitions.
C
Use broadcast join to force Spark to broadcast the smaller DataFrame.
D
Increase the number of partitions specifically for the skewed key using custom partitioning.