
Answer-first summary for fast verification
Answer: Add `data_utils` to the cluster's library dependencies using the `spark.conf` settings.
The most efficient and secure method to make the `data_utils` library available to all PySpark jobs on a Databricks cluster is by adding it to the cluster's library dependencies through `spark.conf` settings. This approach ensures the library is accessible across all notebooks and jobs without the need for individual installations or environment variable modifications. Other methods, such as using `%pip install`, setting `PYTHONPATH`, or switching to the Databricks Runtime for Data Engineering, either do not provide cluster-wide availability or are unnecessarily complex.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
How can a data engineering team efficiently make the Python library data_utils available to PySpark jobs across multiple notebooks on a Databricks cluster?
A
Run %pip install data_utils once on any notebook attached to the cluster.
B
Edit the cluster to use the Databricks Runtime for Data Engineering.
C
Set the PYTHONPATH variable in the cluster configuration to include the path to data_utils.
D
Add data_utils to the cluster's library dependencies using the spark.conf settings.
E
There is no way to make the data_utils library available to PySpark jobs on a cluster.
No comments yet.