
Ultimate access to all questions.
The following code has been migrated to a Databricks notebook from a legacy workload:
git clone https://github.com/foo/data_loader;
python ./data_loader/run.py;
mv ~/output /dbfs/mnt/new_data
git clone https://github.com/foo/data_loader;
python ./data_loader/run.py;
mv ~/output /dbfs/mnt/new_data
The code executes successfully and produces logically correct results, but it takes over 20 minutes to extract and load approximately 1 GB of data.
Which statement could explain this performance issue?_
A
%sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster startup time.
B
Instead of cloning, the code should use %sh pip install so that the Python code can get executed in parallel across all nodes in a cluster.
C
%sh does not distribute file moving operations; the final line of code should be updated to use %fs instead.
D
Python will always execute slower than Scala on Databricks. The run.py script should be refactored to Scala.
E
%sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark.