Databricks Certified Data Engineer - Professional

Databricks Certified Data Engineer - Professional

Get started today

Ultimate access to all questions.


To optimize a join operation in Databricks by ensuring the smaller DataFrame is sent to all executor nodes in the cluster, which function should a data engineer use to mark the DataFrame as small enough to fit in memory on all executors?





Explanation:

The pyspark.sql.functions.broadcast function is used to mark a DataFrame as small enough for use in broadcast joins, which allows the smaller DataFrame to be sent to all executor nodes in the cluster. This optimization is crucial for improving the performance of join operations. Reference: Apache Spark Documentation