
Answer-first summary for fast verification
Answer: %sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark.
The code uses %sh to execute shell commands, which run solely on the driver node. This approach does not distribute workloads across worker nodes or leverage Spark's distributed processing capabilities. The Python script (run.py) likely processes the 1 GB data sequentially on the driver, leading to inefficiency. Option E correctly identifies that the code does not utilize worker nodes or Databricks optimizations like Spark. Other options are incorrect: A (cluster restart is not triggered by %sh), B (pip install does not address parallelism), C (mv vs %fs is a minor factor), and D (Scala vs Python is irrelevant if Spark isn't used).
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
The following code has been migrated to a Databricks notebook from a legacy workload:
git clone https://github.com/foo/data_loader;
python ./data_loader/run.py;
mv ~/output /dbfs/mnt/new_data
git clone https://github.com/foo/data_loader;
python ./data_loader/run.py;
mv ~/output /dbfs/mnt/new_data
The code executes successfully and produces logically correct results, but it takes over 20 minutes to extract and load approximately 1 GB of data.
Which statement could explain this performance issue?
A
%sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster startup time.
B
Instead of cloning, the code should use %sh pip install so that the Python code can get executed in parallel across all nodes in a cluster.
C
%sh does not distribute file moving operations; the final line of code should be updated to use %fs instead.
D
Python will always execute slower than Scala on Databricks. The run.py script should be refactored to Scala.
E
%sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark.
No comments yet.