
Answer-first summary for fast verification
Answer: The `%sh` magic command executes shell code exclusively on the driver node, failing to utilize worker nodes or Databricks' optimized Spark engine for distributed processing.
The `%sh` magic command runs shell commands locally on the cluster's driver node. Consequently, operations like cloning a repository, executing a standalone Python script, or moving files via shell commands are limited to the resources of a single node. This approach does not leverage Spark's distributed processing capabilities or the available worker nodes, making it highly inefficient for large-scale data tasks.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
The following code has been migrated to a Databricks notebook from a legacy workload:
%sh
git clone https://github.com/foo/data_loader;
python ./data_loader/run.py;
mv ./output /dbfs/mnt/new_data
%sh
git clone https://github.com/foo/data_loader;
python ./data_loader/run.py;
mv ./output /dbfs/mnt/new_data
Which statement provides the most likely explanation for the high latency observed during the execution of this cell?
A
Python execution is inherently slower than Scala on Databricks; the run.py script should be refactored into Scala to leverage the JVM's performance.
B
The %sh magic command triggers a cluster restart to initialize Git, meaning the majority of the latency is caused by the cluster startup time.
C
Instead of cloning the repository, the code should use %sh pip install to ensure the Python logic is automatically distributed across all nodes in the cluster.
D
The %sh command does not support distributed file movement; the final line must be updated to use %fs to enable parallelized I/O across the cluster.
E
The %sh magic command executes shell code exclusively on the driver node, failing to utilize worker nodes or Databricks' optimized Spark engine for distributed processing.
No comments yet.