
Answer-first summary for fast verification
Answer: It marks a DataFrame as small enough to store in memory on all executors, allowing a broadcast join.
The `pyspark.sql.functions.broadcast` function is used to hint that a DataFrame is small enough to be broadcasted during a join operation. This means the DataFrame will be stored in memory on all executors, which is beneficial for avoiding the shuffling of large DataFrames across the network. Option D accurately describes this functionality. The other options are incorrect because they either misinterpret the function's purpose by referring to columns instead of DataFrames (A and B) or describe unrelated caching mechanisms (C and E), which are not temporary and specific to the join operation like broadcast joins are.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
Which statement accurately describes the proper usage of pyspark.sql.functions.broadcast?
A
It marks a column as having low enough cardinality to properly map distinct values to available partitions, allowing a broadcast join.
B
It marks a column as small enough to store in memory on all executors, allowing a broadcast join.
C
It caches a copy of the indicated table on attached storage volumes for all active clusters within a Databricks workspace.
D
It marks a DataFrame as small enough to store in memory on all executors, allowing a broadcast join.
E
It caches a copy of the indicated table on all nodes in the cluster for use in all future queries during the cluster lifetime.
No comments yet.